Merge pull request #289 from R-icntay/main

Add resources for lesson 08 (logistic regression) and make minor fixes for lesson 05 (intro to regression) and 06 (prepare data)
pull/294/head
Jen Looper 3 years ago committed by GitHub
commit 1cea15a248
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,6 +1,6 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "lesson_1-R.ipynb",
@ -19,37 +19,39 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "YJUHCXqK57yz"
},
"source": [
"#Build a regression model: Get started with R and Tidymodels for regression models"
]
],
"metadata": {
"id": "YJUHCXqK57yz"
}
},
{
"cell_type": "markdown",
"source": [
"## Introduction to Regression - Lesson 1\r\n",
"\r\n",
"#### Putting it into perspective\r\n",
"\r\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\r\n",
"\r\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\r\n",
"\r\n",
"That said, let's get started on this task!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/encouRage.jpg\"\r\n",
" width=\"630\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/encouRage.jpg)<br>Artwork by @allison_horst-->"
],
"metadata": {
"id": "LWNNzfqd6feZ"
},
"source": [
"## Introduction to Regression - Lesson 1\n",
"\n",
"#### Putting it into perspective\n",
"\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\n",
"\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\n",
"\n",
"That said, let's get started on this task!\n",
"\n",
"![Artwork by \\@allison_horst](../images/encouRage.jpg)<br>Artwork by @allison_horst"
]
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "FIo2YhO26wI9"
},
"source": [
"## 1. Loading up our tool set\n",
"\n",
@ -64,62 +66,62 @@
"`install.packages(c(\"tidyverse\", \"tidymodels\"))`\n",
"\n",
"The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing."
]
],
"metadata": {
"id": "FIo2YhO26wI9"
}
},
{
"cell_type": "code",
"metadata": {
"id": "cIA9fz9v7Dss",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2df7073b-86b2-4b32-cb86-0da605a0dc11"
},
"execution_count": 2,
"source": [
"if (!require(\"pacman\")) install.packages(\"pacman\")\n",
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"pacman::p_load(tidyverse, tidymodels)"
],
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Loading required package: pacman\n",
"\n"
],
"name": "stderr"
]
}
]
],
"metadata": {
"id": "cIA9fz9v7Dss",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2df7073b-86b2-4b32-cb86-0da605a0dc11"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "gpO_P_6f9WUG"
},
"source": [
"Now, let's load these awesome packages and make them available in our current R session.(This is for mere illustration, `pacman::p_load()` already did that for you)"
]
],
"metadata": {
"id": "gpO_P_6f9WUG"
}
},
{
"cell_type": "code",
"metadata": {
"id": "NLMycgG-9ezO"
},
"execution_count": null,
"source": [
"# load the core Tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# load the core Tidymodels packages\n",
"library(tidymodels)\n"
"# load the core Tidyverse packages\r\n",
"library(tidyverse)\r\n",
"\r\n",
"# load the core Tidymodels packages\r\n",
"library(tidymodels)\r\n"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "NLMycgG-9ezO"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "KM6iXLH996Cl"
},
"source": [
"## 2. The diabetes dataset\n",
"\n",
@ -156,34 +158,34 @@
"Before going any further, let's also introduce something you will encounter often in R code 🥁🥁: the pipe operator `%>%`\n",
"\n",
"The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code."
]
],
"metadata": {
"id": "KM6iXLH996Cl"
}
},
{
"cell_type": "code",
"metadata": {
"id": "Z1geAMhM-bSP"
},
"execution_count": null,
"source": [
"# Import the data set\n",
"diabetes <- read_table2(file = \"https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt\")\n",
"\n",
"\n",
"# Get a glimpse and dimensions of the data\n",
"glimpse(diabetes)\n",
"\n",
"\n",
"# Select the first 5 rows of the data\n",
"diabetes %>% \n",
"# Import the data set\r\n",
"diabetes <- read_table2(file = \"https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt\")\r\n",
"\r\n",
"\r\n",
"# Get a glimpse and dimensions of the data\r\n",
"glimpse(diabetes)\r\n",
"\r\n",
"\r\n",
"# Select the first 5 rows of the data\r\n",
"diabetes %>% \r\n",
" slice(1:5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "Z1geAMhM-bSP"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "UwjVT1Hz-c3Z"
},
"source": [
"`glimpse()` shows us that this data has 442 rows and 11 columns with all the columns being of data type `double` \n",
"\n",
@ -198,65 +200,65 @@
"Now that we have the data, let's narrow down to one feature (`bmi`) to target for this exercise. This will require us to select the desired columns. So, how do we do this?\n",
"\n",
"[`dplyr::select()`](https://dplyr.tidyverse.org/reference/select.html) allows us to *select* (and optionally rename) columns in a data frame."
]
],
"metadata": {
"id": "UwjVT1Hz-c3Z"
}
},
{
"cell_type": "code",
"metadata": {
"id": "RDY1oAKI-m80"
},
"execution_count": null,
"source": [
"# Select predictor feature `bmi` and outcome `y`\n",
"diabetes_select <- diabetes %>% \n",
" select(c(bmi, y))\n",
"\n",
"# Print the first 5 rows\n",
"diabetes_select %>% \n",
"# Select predictor feature `bmi` and outcome `y`\r\n",
"diabetes_select <- diabetes %>% \r\n",
" select(c(bmi, y))\r\n",
"\r\n",
"# Print the first 5 rows\r\n",
"diabetes_select %>% \r\n",
" slice(1:10)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "RDY1oAKI-m80"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "SDk668xK-tc3"
},
"source": [
"## 3. Training and Testing data\n",
"\n",
"It's common practice in supervised learning to *split* the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to see how the model performed.\n",
"\n",
"Now that we have data ready, we can see if a machine can help determine a logical split between the numbers in this dataset. We can use the [rsample](https://tidymodels.github.io/rsample/) package, which is part of the Tidymodels framework, to create an object that contains the information on *how* to split the data, and then two more rsample functions to extract the created training and testing sets:\n"
]
],
"metadata": {
"id": "SDk668xK-tc3"
}
},
{
"cell_type": "code",
"metadata": {
"id": "EqtHx129-1h-"
},
"execution_count": null,
"source": [
"set.seed(2056)\n",
"# Split 67% of the data for training and the rest for tesing\n",
"diabetes_split <- diabetes_select %>% \n",
" initial_split(prop = 0.67)\n",
"\n",
"# Extract the resulting train and test sets\n",
"diabetes_train <- training(diabetes_split)\n",
"diabetes_test <- testing(diabetes_split)\n",
"\n",
"# Print the first 3 rows of the training set\n",
"diabetes_train %>% \n",
"set.seed(2056)\r\n",
"# Split 67% of the data for training and the rest for tesing\r\n",
"diabetes_split <- diabetes_select %>% \r\n",
" initial_split(prop = 0.67)\r\n",
"\r\n",
"# Extract the resulting train and test sets\r\n",
"diabetes_train <- training(diabetes_split)\r\n",
"diabetes_test <- testing(diabetes_split)\r\n",
"\r\n",
"# Print the first 3 rows of the training set\r\n",
"diabetes_train %>% \r\n",
" slice(1:10)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "EqtHx129-1h-"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "sBOS-XhB-6v7"
},
"source": [
"## 4. Train a linear regression model with Tidymodels\n",
"\n",
@ -271,68 +273,68 @@
"- Model **engine** is the computational tool which will be used to fit the model. Often these are R packages, such as **`\"lm\"`** or **`\"ranger\"`**\n",
"\n",
"This modeling information is captured in a model specification, so let's build one!"
]
],
"metadata": {
"id": "sBOS-XhB-6v7"
}
},
{
"cell_type": "code",
"metadata": {
"id": "20OwEw20--t3"
},
"execution_count": null,
"source": [
"# Build a linear model specification\n",
"lm_spec <- \n",
" # Type\n",
" linear_reg() %>% \n",
" # Engine\n",
" set_engine(\"lm\") %>% \n",
" # Mode\n",
" set_mode(\"regression\")\n",
"\n",
"\n",
"# Print the model specification\n",
"# Build a linear model specification\r\n",
"lm_spec <- \r\n",
" # Type\r\n",
" linear_reg() %>% \r\n",
" # Engine\r\n",
" set_engine(\"lm\") %>% \r\n",
" # Mode\r\n",
" set_mode(\"regression\")\r\n",
"\r\n",
"\r\n",
"# Print the model specification\r\n",
"lm_spec"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "20OwEw20--t3"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "_oDHs89k_CJj"
},
"source": [
"After a model has been *specified*, the model can be `estimated` or `trained` using the [`fit()`](https://parsnip.tidymodels.org/reference/fit.html) function, typically using a formula and some data.\n",
"\n",
"`y ~ .` means we'll fit `y` as the predicted quantity/target, explained by all the predictors/features ie, `.` (in this case, we only have one predictor: `bmi` )"
]
],
"metadata": {
"id": "_oDHs89k_CJj"
}
},
{
"cell_type": "code",
"metadata": {
"id": "YlsHqd-q_GJQ"
},
"execution_count": null,
"source": [
"# Build a linear model specification\n",
"lm_spec <- linear_reg() %>% \n",
" set_engine(\"lm\") %>%\n",
" set_mode(\"regression\")\n",
"\n",
"\n",
"# Train a linear regression model\n",
"lm_mod <- lm_spec %>% \n",
" fit(y ~ ., data = diabetes_train)\n",
"\n",
"# Print the model\n",
"# Build a linear model specification\r\n",
"lm_spec <- linear_reg() %>% \r\n",
" set_engine(\"lm\") %>%\r\n",
" set_mode(\"regression\")\r\n",
"\r\n",
"\r\n",
"# Train a linear regression model\r\n",
"lm_mod <- lm_spec %>% \r\n",
" fit(y ~ ., data = diabetes_train)\r\n",
"\r\n",
"# Print the model\r\n",
"lm_mod"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "YlsHqd-q_GJQ"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "kGZ22RQj_Olu"
},
"source": [
"From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that gives us the lowest overall error between the actual and predicted variable.\n",
"<br>\n",
@ -340,97 +342,100 @@
"## 5. Make predictions on the test set\n",
"\n",
"Now that we've trained a model, we can use it to predict the disease progression y for the test dataset using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). This will be used to draw the line between data groups."
]
],
"metadata": {
"id": "kGZ22RQj_Olu"
}
},
{
"cell_type": "code",
"metadata": {
"id": "nXHbY7M2_aao"
},
"execution_count": null,
"source": [
"# Make predictions for the test set\n",
"predictions <- lm_mod %>% \n",
" predict(new_data = diabetes_test)\n",
"\n",
"# Print out some of the predictions\n",
"predictions %>% \n",
"# Make predictions for the test set\r\n",
"predictions <- lm_mod %>% \r\n",
" predict(new_data = diabetes_test)\r\n",
"\r\n",
"# Print out some of the predictions\r\n",
"predictions %>% \r\n",
" slice(1:5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "nXHbY7M2_aao"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "R_JstwUY_bIs"
},
"source": [
"Woohoo! 💃🕺 We just trained a model and used it to make predictions!\n",
"\n",
"When making predictions, the tidymodels convention is to always produce a tibble/data frame of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format for subsequent operations such as plotting.\n",
"\n",
"`dplyr::bind_cols()` efficiently binds multiple data frames column."
]
],
"metadata": {
"id": "R_JstwUY_bIs"
}
},
{
"cell_type": "code",
"metadata": {
"id": "RybsMJR7_iI8"
},
"execution_count": null,
"source": [
"# Combine the predictions and the original test set\n",
"results <- diabetes_test %>% \n",
" bind_cols(predictions)\n",
"\n",
"\n",
"results %>% \n",
"# Combine the predictions and the original test set\r\n",
"results <- diabetes_test %>% \r\n",
" bind_cols(predictions)\r\n",
"\r\n",
"\r\n",
"results %>% \r\n",
" slice(1:5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "RybsMJR7_iI8"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "XJbYbMZW_n_s"
},
"source": [
"## 6. Plot modelling results\n",
"\n",
"Now, its time to see this visually 📈. We'll create a scatter plot of all the `y` and `bmi` values of the test set, then use the predictions to draw a line in the most appropriate place, between the model's data groupings.\n",
"\n",
"R has several systems for making graphs, but `ggplot2` is one of the most elegant and most versatile. This allows you to compose graphs by **combining independent components**."
]
],
"metadata": {
"id": "XJbYbMZW_n_s"
}
},
{
"cell_type": "code",
"metadata": {
"id": "R9tYp3VW_sTn"
},
"execution_count": null,
"source": [
"# Set a theme for the plot\n",
"theme_set(theme_light())\n",
"# Create a scatter plot\n",
"results %>% \n",
" ggplot(aes(x = bmi)) +\n",
" # Add a scatter plot\n",
" geom_point(aes(y = y), size = 1.6) +\n",
" # Add a line plot\n",
"# Set a theme for the plot\r\n",
"theme_set(theme_light())\r\n",
"# Create a scatter plot\r\n",
"results %>% \r\n",
" ggplot(aes(x = bmi)) +\r\n",
" # Add a scatter plot\r\n",
" geom_point(aes(y = y), size = 1.6) +\r\n",
" # Add a line plot\r\n",
" geom_line(aes(y = .pred), color = \"blue\", size = 1.5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "R9tYp3VW_sTn"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "zrPtHIxx_tNI"
},
"source": [
"> ✅ Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model.\n",
"\n",
"Congratulations, you built your first linear regression model, created a prediction with it, and displayed it in a plot!\n"
]
],
"metadata": {
"id": "zrPtHIxx_tNI"
}
}
]
}
}

@ -1,6 +1,6 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "lesson_2-R.ipynb",
@ -19,35 +19,39 @@
"cells": [
{
"cell_type": "markdown",
"source": [
"# Build a regression model: prepare and visualize data\r\n",
"\r\n",
"## **Linear Regression for Pumpkins - Lesson 2**\r\n",
"#### Introduction\r\n",
"\r\n",
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\r\n",
"\r\n",
"In this lesson, you will learn:\r\n",
"\r\n",
"- How to prepare your data for model-building.\r\n",
"\r\n",
"- How to use `ggplot2` for data visualization.\r\n",
"\r\n",
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\r\n",
"\r\n",
"Let's see this by working through a practical exercise.\r\n",
"\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/unruly_data.jpg\"\r\n",
" width=\"700\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/unruly_data.jpg)<br>Artwork by \\@allison_horst-->"
],
"metadata": {
"id": "Pg5aexcOPqAZ"
},
"source": [
"# Build a regression model: prepare and visualize data\n",
"\n",
"## **Linear Regression for Pumpkins - Lesson 2**\n",
"#### Introduction\n",
"\n",
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\n",
"\n",
"In this lesson, you will learn:\n",
"\n",
"- How to prepare your data for model-building.\n",
"\n",
"- How to use `ggplot2` for data visualization.\n",
"\n",
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\n",
"\n",
"Let's see this by working through a practical exercise.\n",
"\n",
"![Artwork by \\@allison_horst](../images/unruly_data.jpg)<br>Artwork by \\@allison_horst"
]
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "dc5WhyVdXAjR"
},
"source": [
"## 1. Importing pumpkins data and summoning the Tidyverse\n",
"\n",
@ -60,58 +64,58 @@
"`install.packages(c(\"tidyverse\"))`\n",
"\n",
"The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing."
]
],
"metadata": {
"id": "dc5WhyVdXAjR"
}
},
{
"cell_type": "code",
"metadata": {
"id": "GqPYUZgfXOBt"
},
"execution_count": null,
"source": [
"if (!require(\"pacman\")) install.packages(\"pacman\")\n",
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"pacman::p_load(tidyverse)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "GqPYUZgfXOBt"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "kvjDTPDSXRr2"
},
"source": [
"Now, let's fire up some packages and load the [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) provided for this lesson!"
]
],
"metadata": {
"id": "kvjDTPDSXRr2"
}
},
{
"cell_type": "code",
"metadata": {
"id": "VMri-t2zXqgD"
},
"execution_count": null,
"source": [
"# Load the core Tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# Import the pumpkins data\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n",
"\n",
"\n",
"# Get a glimpse and dimensions of the data\n",
"glimpse(pumpkins)\n",
"\n",
"\n",
"# Print the first 50 rows of the data set\n",
"pumpkins %>% \n",
"# Load the core Tidyverse packages\r\n",
"library(tidyverse)\r\n",
"\r\n",
"# Import the pumpkins data\r\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\r\n",
"\r\n",
"\r\n",
"# Get a glimpse and dimensions of the data\r\n",
"glimpse(pumpkins)\r\n",
"\r\n",
"\r\n",
"# Print the first 50 rows of the data set\r\n",
"pumpkins %>% \r\n",
" slice_head(n =50)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "VMri-t2zXqgD"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "REWcIv9yX29v"
},
"source": [
"A quick `glimpse()` immediately shows that there are blanks and a mix of strings (`chr`) and numeric data (`dbl`). The `Date` is of type character and there's also a strange column called `Package` where the data is a mix between `sacks`, `bins` and other values. The data, in fact, is a bit of a mess 😤.\n",
"\n",
@ -120,13 +124,13 @@
"\n",
"> A refresher: The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code.\n",
"\n"
]
],
"metadata": {
"id": "REWcIv9yX29v"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "Zxfb3AM5YbUe"
},
"source": [
"## 2. Check for missing data\n",
"\n",
@ -135,105 +139,112 @@
"So how would we know that the data frame contains missing values?\n",
"<br>\n",
"- One straight forward way would be to use the base R function `anyNA` which returns the logical objects `TRUE` or `FALSE`"
]
],
"metadata": {
"id": "Zxfb3AM5YbUe"
}
},
{
"cell_type": "code",
"metadata": {
"id": "G--DQutAYltj"
},
"execution_count": null,
"source": [
"pumpkins %>% \n",
"pumpkins %>% \r\n",
" anyNA()"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "G--DQutAYltj"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "mU-7-SB6YokF"
},
"source": [
"Great, there seems to be some missing data! That's a good place to start.\n",
"\n",
"- Another way would be to use the function `is.na()` that indicates which individual column elements are missing with a logical `TRUE`."
]
],
"metadata": {
"id": "mU-7-SB6YokF"
}
},
{
"cell_type": "code",
"metadata": {
"id": "W-DxDOR4YxSW"
},
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
"pumpkins %>% \r\n",
" is.na() %>% \r\n",
" head(n = 7)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "W-DxDOR4YxSW"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "xUWxipKYY0o7"
},
"source": [
"Okay, got the job done but with a large data frame such as this, it would be inefficient and practically impossible to review all of the rows and columns individually😴.\n",
"\n",
"- A more intuitive way would be to calculate the sum of the missing values for each column:"
]
],
"metadata": {
"id": "xUWxipKYY0o7"
}
},
{
"cell_type": "code",
"metadata": {
"id": "ZRBWV6P9ZArL"
},
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
"pumpkins %>% \r\n",
" is.na() %>% \r\n",
" colSums()"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "ZRBWV6P9ZArL"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "9gv-crB6ZD1Y"
},
"source": [
"Much better! There is missing data, but maybe it won't matter for the task at hand. Let's see what further analysis brings forth.\n",
"\n",
"> Along with the awesome sets of packages and functions, R has a very good documentation. For instance, use `help(colSums)` or `?colSums` to find out more about the function."
]
],
"metadata": {
"id": "9gv-crB6ZD1Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Dplyr: A Grammar of Data Manipulation\r\n",
"\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/dplyr_wrangling.png\"\r\n",
" width=\"569\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/dplyr_wrangling.png)<br/>Artwork by \\@allison_horst-->"
],
"metadata": {
"id": "o4jLY5-VZO2C"
},
"source": [
"## 3. Dplyr: A Grammar of Data Manipulation\n",
"\n",
"![Artwork by \\@allison_horst](../images/dplyr_wrangling.png)<br/>Artwork by \\@allison_horst"
]
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "i5o33MQBZWWw"
},
"source": [
"[`dplyr`](https://dplyr.tidyverse.org/), a package in the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges. In this section, we'll explore some of dplyr's verbs!\n",
"<br>\n"
]
],
"metadata": {
"id": "i5o33MQBZWWw"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "x3VGMAGBZiUr"
},
"source": [
"#### dplyr::select()\n",
"\n",
@ -242,31 +253,31 @@
"To make your data frame easier to work with, drop several of its columns, using `select()`, keeping only the columns you need.\n",
"\n",
"For instance, in this exercise, our analysis will involve the columns `Package`, `Low Price`, `High Price` and `Date`. Let's select these columns."
]
],
"metadata": {
"id": "x3VGMAGBZiUr"
}
},
{
"cell_type": "code",
"metadata": {
"id": "F_FgxQnVZnM0"
},
"execution_count": null,
"source": [
"# Select desired columns\n",
"pumpkins <- pumpkins %>% \n",
" select(Package, `Low Price`, `High Price`, Date)\n",
"\n",
"\n",
"# Print data set\n",
"pumpkins %>% \n",
"# Select desired columns\r\n",
"pumpkins <- pumpkins %>% \r\n",
" select(Package, `Low Price`, `High Price`, Date)\r\n",
"\r\n",
"\r\n",
"# Print data set\r\n",
"pumpkins %>% \r\n",
" slice_head(n = 5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "F_FgxQnVZnM0"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "2KKo0Ed9Z1VB"
},
"source": [
"#### dplyr::mutate()\n",
"\n",
@ -283,66 +294,66 @@
"2. Extract the month from the dates to a new column.\n",
"\n",
"In R, the package [lubridate](https://lubridate.tidyverse.org/) makes it easier to work with Date-time data. So, let's use `dplyr::mutate()`, `lubridate::mdy()`, `lubridate::month()` and see how to achieve the above objectives. We can drop the Date column since we won't be needing it again in subsequent operations."
]
],
"metadata": {
"id": "2KKo0Ed9Z1VB"
}
},
{
"cell_type": "code",
"metadata": {
"id": "5joszIVSZ6xe"
},
"execution_count": null,
"source": [
"# Load lubridate\n",
"library(lubridate)\n",
"\n",
"pumpkins <- pumpkins %>% \n",
" # Convert the Date column to a date object\n",
" mutate(Date = mdy(Date)) %>% \n",
" # Extract month from Date\n",
" mutate(Month = month(Date)) %>% \n",
" # Drop Date column\n",
" select(-Date)\n",
"\n",
"# View the first few rows\n",
"pumpkins %>% \n",
"# Load lubridate\r\n",
"library(lubridate)\r\n",
"\r\n",
"pumpkins <- pumpkins %>% \r\n",
" # Convert the Date column to a date object\r\n",
" mutate(Date = mdy(Date)) %>% \r\n",
" # Extract month from Date\r\n",
" mutate(Month = month(Date)) %>% \r\n",
" # Drop Date column\r\n",
" select(-Date)\r\n",
"\r\n",
"# View the first few rows\r\n",
"pumpkins %>% \r\n",
" slice_head(n = 7)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "5joszIVSZ6xe"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "nIgLjNMCZ-6Y"
},
"source": [
"Woohoo! 🤩\n",
"\n",
"Next, let's create a new column `Price`, which represents the average price of a pumpkin. Now, let's take the average of the `Low Price` and `High Price` columns to populate the new Price column.\n",
"<br>"
]
],
"metadata": {
"id": "nIgLjNMCZ-6Y"
}
},
{
"cell_type": "code",
"metadata": {
"id": "Zo0BsqqtaJw2"
},
"execution_count": null,
"source": [
"# Create a new column Price\n",
"pumpkins <- pumpkins %>% \n",
" mutate(Price = (`Low Price` + `High Price`)/2)\n",
"\n",
"# View the first few rows of the data\n",
"pumpkins %>% \n",
"# Create a new column Price\r\n",
"pumpkins <- pumpkins %>% \r\n",
" mutate(Price = (`Low Price` + `High Price`)/2)\r\n",
"\r\n",
"# View the first few rows of the data\r\n",
"pumpkins %>% \r\n",
" slice_head(n = 5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "Zo0BsqqtaJw2"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "p77WZr-9aQAR"
},
"source": [
"Yeees!💪\n",
"\n",
@ -351,38 +362,38 @@
"If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in `1 1/9 bushel` measures, and some in `1/2 bushel` measures, some per pumpkin, some per pound, and some in big boxes with varying widths.\n",
"\n",
"Let's verify this:"
]
],
"metadata": {
"id": "p77WZr-9aQAR"
}
},
{
"cell_type": "code",
"metadata": {
"id": "XISGfh0IaUy6"
},
"execution_count": null,
"source": [
"# Verify the distinct observations in Package column\n",
"pumpkins %>% \n",
"# Verify the distinct observations in Package column\r\n",
"pumpkins %>% \r\n",
" distinct(Package)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "XISGfh0IaUy6"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "7sMjiVujaZxY"
},
"source": [
"Amazing!👏\n",
"\n",
"Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string *bushel* in the `Package` column and put this in a new data frame `new_pumpkins`.\n",
"<br>"
]
],
"metadata": {
"id": "7sMjiVujaZxY"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "L8Qfcs92ageF"
},
"source": [
"#### dplyr::filter() and stringr::str_detect()\n",
"\n",
@ -391,43 +402,43 @@
"[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html): detects the presence or absence of a pattern in a string.\n",
"\n",
"The [`stringr`](https://github.com/tidyverse/stringr) package provides simple functions for common string operations."
]
],
"metadata": {
"id": "L8Qfcs92ageF"
}
},
{
"cell_type": "code",
"metadata": {
"id": "hy_SGYREampd"
},
"execution_count": null,
"source": [
"# Retain only pumpkins with \"bushel\"\n",
"new_pumpkins <- pumpkins %>% \n",
" filter(str_detect(Package, \"bushel\"))\n",
"\n",
"# Get the dimensions of the new data\n",
"dim(new_pumpkins)\n",
"\n",
"# View a few rows of the new data\n",
"new_pumpkins %>% \n",
"# Retain only pumpkins with \"bushel\"\r\n",
"new_pumpkins <- pumpkins %>% \r\n",
" filter(str_detect(Package, \"bushel\"))\r\n",
"\r\n",
"# Get the dimensions of the new data\r\n",
"dim(new_pumpkins)\r\n",
"\r\n",
"# View a few rows of the new data\r\n",
"new_pumpkins %>% \r\n",
" slice_head(n = 5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "hy_SGYREampd"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "VrDwF031avlR"
},
"source": [
"You can see that we have narrowed down to 415 or so rows of data containing pumpkins by the bushel.🤩\n",
"<br>"
]
],
"metadata": {
"id": "VrDwF031avlR"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "mLpw2jH4a0tx"
},
"source": [
"#### dplyr::case_when()\n",
"\n",
@ -436,33 +447,33 @@
"Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel. Time to do some math to standardize it.\n",
"\n",
"We'll use the function [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to *mutate* the Price column depending on some conditions. `case_when` allows you to vectorise multiple `if_else()`statements.\n"
]
],
"metadata": {
"id": "mLpw2jH4a0tx"
}
},
{
"cell_type": "code",
"metadata": {
"id": "P68kLVQmbM6I"
},
"execution_count": null,
"source": [
"# Convert the price if the Package contains fractional bushel values\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(Price = case_when(\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
" TRUE ~ Price))\n",
"\n",
"# View the first few rows of the data\n",
"new_pumpkins %>% \n",
"# Convert the price if the Package contains fractional bushel values\r\n",
"new_pumpkins <- new_pumpkins %>% \r\n",
" mutate(Price = case_when(\r\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\r\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\r\n",
" TRUE ~ Price))\r\n",
"\r\n",
"# View the first few rows of the data\r\n",
"new_pumpkins %>% \r\n",
" slice_head(n = 30)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "P68kLVQmbM6I"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "pS2GNPagbSdb"
},
"source": [
"Now, we can analyze the pricing per unit based on their bushel measurement. All this study of bushels of pumpkins, however, goes to show how very `important` it is to `understand the nature of your data`!\n",
"\n",
@ -470,103 +481,109 @@
">\n",
"> ✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.\n",
"<br>\n"
]
],
"metadata": {
"id": "pS2GNPagbSdb"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "qql1SowfbdnP"
},
"source": [
"Now lastly, for the sheer sake of adventure 💁‍♀️, let's also move the Month column to the first position i.e `before` column `Package`.\n",
"\n",
"`dplyr::relocate()` is used to change column positions."
]
],
"metadata": {
"id": "qql1SowfbdnP"
}
},
{
"cell_type": "code",
"metadata": {
"id": "JJ1x6kw8bixF"
},
"execution_count": null,
"source": [
"# Create a new data frame new_pumpkins\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(Month, .before = Package)\n",
"\n",
"new_pumpkins %>% \n",
"# Create a new data frame new_pumpkins\r\n",
"new_pumpkins <- new_pumpkins %>% \r\n",
" relocate(Month, .before = Package)\r\n",
"\r\n",
"new_pumpkins %>% \r\n",
" slice_head(n = 7)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "JJ1x6kw8bixF"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "y8TJ0Za_bn5Y"
},
"source": [
"Good job!👌 You now have a clean, tidy dataset on which you can build your new regression model!\n",
"<br>"
]
],
"metadata": {
"id": "y8TJ0Za_bn5Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 4. Data visualization with ggplot2\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/data-visualization.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](../images/data-visualization.png){width=\"600\"}-->\r\n",
"\r\n",
"There is a *wise* saying that goes like this:\r\n",
"\r\n",
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\r\n",
"\r\n",
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\r\n",
"\r\n",
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\r\n",
"\r\n",
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\r\n",
"\r\n",
"Let's start with a simple scatter plot for the Price and Month columns.\r\n",
"\r\n",
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\r\n"
],
"metadata": {
"id": "mYSH6-EtbvNa"
},
"source": [
"## 4. Data visualization with ggplot2\n",
"\n",
"![Infographic by Dasani Madipalli](../images/data-visualization.png){width=\"600\"}\n",
"\n",
"There is a *wise* saying that goes like this:\n",
"\n",
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\n",
"\n",
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\n",
"\n",
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\n",
"\n",
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\n",
"\n",
"Let's start with a simple scatter plot for the Price and Month columns.\n",
"\n",
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\n"
]
}
},
{
"cell_type": "code",
"metadata": {
"id": "g2YjnGeOcLo4"
},
"execution_count": null,
"source": [
"# Set a theme for the plots\n",
"theme_set(theme_light())\n",
"\n",
"# Create a scatter plot\n",
"p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\n",
"# Set a theme for the plots\r\n",
"theme_set(theme_light())\r\n",
"\r\n",
"# Create a scatter plot\r\n",
"p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\r\n",
"p + geom_point()"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "g2YjnGeOcLo4"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ml7SDCLQcPvE"
},
"source": [
"Is this a useful plot 🤷? Does anything about it surprise you?\n",
"\n",
"It's not particularly useful as all it does is display in your data as a spread of points in a given month.\n",
"<br>"
]
],
"metadata": {
"id": "Ml7SDCLQcPvE"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "jMakvJZIcVkh"
},
"source": [
"### **How do we make it useful?**\n",
"\n",
@ -583,62 +600,65 @@
"- `dplyr::summarize()` creates a new data frame with one column for each grouping variable and one column for each of the summary statistics that you have specified.\n",
"\n",
"For example, we can use the `dplyr::group_by() %>% summarize()` to group the pumpkins into groups based on the **Month** columns and then find the **mean price** for each month."
]
],
"metadata": {
"id": "jMakvJZIcVkh"
}
},
{
"cell_type": "code",
"metadata": {
"id": "6kVSUa2Bcilf"
},
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month\n",
"new_pumpkins %>%\n",
" group_by(Month) %>% \n",
"# Find the average price of pumpkins per month\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price))"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "6kVSUa2Bcilf"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "Kds48GUBcj3W"
},
"source": [
"Succinct!✨\n",
"\n",
"Categorical features such as months are better represented using a bar plot 📊. The layers responsible for bar charts are `geom_bar()` and `geom_col()`. Consult `?geom_bar` to find out more.\n",
"\n",
"Let's whip up one!"
]
],
"metadata": {
"id": "Kds48GUBcj3W"
}
},
{
"cell_type": "code",
"metadata": {
"id": "VNbU1S3BcrxO"
},
"source": [
"# Find the average price of pumpkins per month then plot a bar chart\n",
"new_pumpkins %>%\n",
" group_by(Month) %>% \n",
" summarise(mean_price = mean(Price)) %>% \n",
" ggplot(aes(x = Month, y = mean_price)) +\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\n",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month then plot a bar chart\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price)) %>% \r\n",
" ggplot(aes(x = Month, y = mean_price)) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"Pumpkin Price\")"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "VNbU1S3BcrxO"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "zDm0VOzzcuzR"
},
"source": [
"🤩🤩This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?\n",
"\n",
"Congratulations on finishing the second lesson 👏! You prepared your data for model building, then uncovered more insights using visualizations!"
]
],
"metadata": {
"id": "zDm0VOzzcuzR"
}
}
]
}
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

@ -0,0 +1,751 @@
{
"nbformat": 4,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "Untitled10.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Build a regression model: logistic regression\n",
"<br>\n"
],
"metadata": {
"id": "fVfEucLYkV9T"
}
},
{
"cell_type": "markdown",
"source": [
"## Build a logistic regression model - Lesson 4\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/logistic-linear.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](../images/logistic-linear.png){width=\"600\"}-->"
],
"metadata": {
"id": "QizKKpzakfx2"
}
},
{
"cell_type": "markdown",
"source": [
"#### ** [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/15/)**\n",
"\n",
"#### Introduction\n",
"\n",
"In this final lesson on Regression, one of the basic *classic* ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict `binary` `categories`. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?\n",
"\n",
"In this lesson, you will learn:\n",
"\n",
"- Techniques for logistic regression\n",
"\n",
"✅ Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa)\n",
"\n",
"#### **Prerequisite**\n",
"\n",
"Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: `Color`.\n",
"\n",
"Let's build a logistic regression model to predict that, given some variables, *what color a given pumpkin is likely to be* (orange 🎃 or white 👻).\n",
"\n",
"> Why are we talking about binary classification in a lesson grouping about regression? Only for linguistic convenience, as logistic regression is [really a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), albeit a linear-based one. Learn about other ways to classify data in the next lesson group.\n",
"\n",
"For this lesson, we'll require the following packages:\n",
"\n",
"- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\n",
"\n",
"- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n",
"\n",
"- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple little tools for examining and cleaning dirty data.\n",
"\n",
"- `ggbeeswarm`: The [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) provides methods to create beeswarm-style plots using ggplot2.\n",
"\n",
"You can have them installed as:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n",
"\n",
"Alternatiely, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing."
],
"metadata": {
"id": "KPmut75XkmXY"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if (!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"\r\n",
"pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)"
],
"outputs": [],
"metadata": {
"id": "dnIGNNttkx_O"
}
},
{
"cell_type": "markdown",
"source": [
"## ** Define the question**\r\n",
"\r\n",
"For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.\r\n",
"\r\n",
"> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!\r\n",
"\r\n",
"## **About logistic regression**\r\n",
"\r\n",
"Logistic regression differs from linear regression, which you learned about previously, in a few important ways.\r\n",
"\r\n",
"#### **Binary classification**\r\n",
"\r\n",
"Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` (\"orange or not orange\") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/pumpkin-classifier.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/pumpkin-classifier.png){width=\"600\"}-->"
],
"metadata": {
"id": "ws-hP_SXk2O6"
}
},
{
"cell_type": "markdown",
"source": [
"#### **Other classifications**\r\n",
"\r\n",
"There are other types of logistic regression, including multinomial and ordinal:\r\n",
"\r\n",
"- **Multinomial**, which involves having more than one category - \"Orange, White, and Striped\".\r\n",
"\r\n",
"- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/multinomial-ordinal.png\"\r\n",
" width=\"700\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/multinomial-ordinal.png){width=\"600\"}-->"
],
"metadata": {
"id": "LkLN-ZgDlBEc"
}
},
{
"cell_type": "markdown",
"source": [
"**It's still linear**\n",
"\n",
"Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.\n",
"\n",
"#### **Variables DO NOT have to correlate**\n",
"\n",
"Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.\n",
"\n",
"#### **You need a lot of clean data**\n",
"\n",
"Logistic regression will give more accurate results if you use more data; our small dataset is not optimal for this task, so keep that in mind.\n",
"\n",
"✅ Think about the types of data that would lend themselves well to logistic regression\n"
],
"metadata": {
"id": "D8_JoVZtlHUt"
}
},
{
"cell_type": "markdown",
"source": [
"## 1. Tidy the data\n",
"\n",
"Now, the fun begins! Let's start by importing the data, cleaning the data a bit, dropping rows containing missing values and selecting only some of the columns:"
],
"metadata": {
"id": "LPj8Ib1AlIua"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load the core tidyverse packages\r\n",
"library(tidyverse)\r\n",
"\r\n",
"# Import the data and clean column names\r\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \r\n",
" clean_names()\r\n",
"\r\n",
"# Select desired columns\r\n",
"pumpkins_select <- pumpkins %>% \r\n",
" select(c(city_name, package, variety, origin, item_size, color)) \r\n",
"\r\n",
"# Drop rows containing missing values and encode color as factor (category)\r\n",
"pumpkins_select <- pumpkins_select %>% \r\n",
" drop_na() %>% \r\n",
" mutate(color = factor(color))\r\n",
"\r\n",
"# View the first few rows\r\n",
"pumpkins_select %>% \r\n",
" slice_head(n = 5)\r\n"
],
"outputs": [],
"metadata": {
"id": "Q8oKJ8PAlLM0"
}
},
{
"cell_type": "markdown",
"source": [
"Sometimes, we may want some little more information on our data. We can have a look at the `data`, `its structure` and the `data type` of its features by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as below:"
],
"metadata": {
"id": "tKY5eN8alPNn"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins_select %>% \r\n",
" glimpse()"
],
"outputs": [],
"metadata": {
"id": "wDpatL1WlShu"
}
},
{
"cell_type": "markdown",
"source": [
"Wow! Seems that all our columns are all of type *character*, further alluding that they are all categorical.\n",
"\n",
"Let's confirm that we will actually be doing a binary classification problem:"
],
"metadata": {
"id": "QbdC2b0JlU2G"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Subset distinct observations in outcome column\r\n",
"pumpkins_select %>% \r\n",
" distinct(color)"
],
"outputs": [],
"metadata": {
"id": "Gys-Q18rlZpE"
}
},
{
"cell_type": "markdown",
"source": [
"🥳🥳 That went down well!\n",
"\n",
"## 2. Explore the data\n",
"\n",
"The goal of data exploration is to try to understand the `relationships` between its attributes; in particular, any apparent correlation between the *features* and the *label* your model will try to predict. One way of doing this is by using data visualization.\n",
"\n",
"Given our the data types of our columns, we can `encode` them and be on our way to making some visualizations. This simply involves `translating` a column with `categorical values` for example our columns of type *char*, into one or more `numeric columns` that take the place of the original. - Something we did in our [last lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/3-Linear/solution/lesson_3-R.ipynb).\n",
"\n",
"Tidymodels provides yet another neat package: [recipes](https://recipes.tidymodels.org/)- a package for preprocessing data. We'll define a `recipe` that specifies that all predictor columns should be encoded into a set of integers , `prep` it to estimates the required quantities and statistics needed by any operations and finally `bake` to apply the computations to new data.\n",
"\n",
"> Normally, recipes is usually used as a preprocessor for modelling where it defines what steps should be applied to a data set in order to get it ready for modelling. In that case it is **highly recommend** that you use a `workflow()` instead of manually estimating a recipe using prep and bake. We'll see all this in just a moment.\n",
">\n",
"> However for now, we are using recipes + prep + bake to specify what steps should be applied to a data set in order to get it ready for data analysis and then extract the preprocessed data with the steps applied."
],
"metadata": {
"id": "kn_20wSPldVH"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Preprocess and extract data to allow some data analysis\r\n",
"baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>% \r\n",
" # Encode all columns to a set of integers\r\n",
" step_integer(all_predictors(), zero_based = T) %>% \r\n",
" prep() %>% \r\n",
" bake(new_data = NULL)\r\n",
"\r\n",
"\r\n",
"# Display the first few rows of preprocessed data\r\n",
"baked_pumpkins %>% \r\n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "syaCgFQ_lijg"
}
},
{
"cell_type": "markdown",
"source": [
"Now let's compare the feature distributions for each label value using box plots. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple `facets`."
],
"metadata": {
"id": "RlkOZ_C5lldq"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Pivot data to long format\r\n",
"baked_pumpkins_long <- baked_pumpkins %>% \r\n",
" pivot_longer(!color, names_to = \"features\", values_to = \"values\")\r\n",
"\r\n",
"\r\n",
"# Print out restructured data\r\n",
"baked_pumpkins_long %>% \r\n",
" slice_head(n = 10)\r\n"
],
"outputs": [],
"metadata": {
"id": "putq8DagltUQ"
}
},
{
"cell_type": "markdown",
"source": [
"Now, let's make some boxplots showing the distribution of the predictors with respect to the outcome color."
],
"metadata": {
"id": "-RHm-12zlt-B"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"theme_set(theme_light())\r\n",
"#Make a box plot for each predictor feature\r\n",
"baked_pumpkins_long %>% \r\n",
" mutate(color = factor(color)) %>% \r\n",
" ggplot(mapping = aes(x = color, y = values, fill = features)) +\r\n",
" geom_boxplot() + \r\n",
" facet_wrap(~ features, scales = \"free\", ncol = 3) +\r\n",
" scale_color_viridis_d(option = \"cividis\", end = .8) +\r\n",
" theme(legend.position = \"none\")"
],
"outputs": [],
"metadata": {
"id": "3Py4i1p1l3hP"
}
},
{
"cell_type": "markdown",
"source": [
"Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin.\n",
"\n",
"#### **Use a swarm plot**\n",
"\n",
"Color is a binary category (Orange or Not), it's called `categorical data`. There are other various ways of [visualizing categorical data](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar).\n",
"\n",
"Try a `swarm plot` to show the distribution of color with respect to the item_size.\n",
"\n",
"We'll use the [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) which provides methods to create beeswarm-style plots using ggplot2. Beeswarm plots are a way of plotting points that would ordinarily overlap so that they fall next to each other instead."
],
"metadata": {
"id": "2LSj6_LCl68V"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create beeswarm plots of color and item_size\r\n",
"baked_pumpkins %>% \r\n",
" mutate(color = factor(color)) %>% \r\n",
" ggplot(mapping = aes(x = color, y = item_size, color = color)) +\r\n",
" geom_quasirandom() +\r\n",
" scale_color_brewer(palette = \"Dark2\", direction = -1) +\r\n",
" theme(legend.position = \"none\")"
],
"outputs": [],
"metadata": {
"id": "hGKeRgUemMTb"
}
},
{
"cell_type": "markdown",
"source": [
"#### **Violin plot**\n",
"\n",
"A 'violin' type plot is useful as you can easily visualize the way that data in the two categories is distributed. [`Violin plots`](https://en.wikipedia.org/wiki/Violin_plot) are similar to box plots, except that they also show the probability density of the data at different values. Violin plots don't work so well with smaller datasets as the distribution is displayed more 'smoothly'."
],
"metadata": {
"id": "_9wdZJH5mOvN"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a violin plot of color and item_size\r\n",
"baked_pumpkins %>%\r\n",
" mutate(color = factor(color)) %>% \r\n",
" ggplot(mapping = aes(x = color, y = item_size, fill = color)) +\r\n",
" geom_violin() +\r\n",
" geom_boxplot(color = \"black\", fill = \"white\", width = 0.02) +\r\n",
" scale_fill_brewer(palette = \"Dark2\", direction = -1) +\r\n",
" theme(legend.position = \"none\")"
],
"outputs": [],
"metadata": {
"id": "LFFFymujmTAZ"
}
},
{
"cell_type": "markdown",
"source": [
"Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.\r\n",
"\r\n",
"## 3. Build your logistic regression model\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/logistic-linear.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"> **🧮 Show Me The Math**\r\n",
">\r\n",
"> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:\r\n",
">\r\n",
"> \r\n",
"<p >\r\n",
" <img src=\"../images/sigmoid.png\">\r\n",
"\r\n",
"\r\n",
"> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.\r\n",
"\r\n",
"Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.\r\n",
"\r\n",
"It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:"
],
"metadata": {
"id": "RA_bnMS9mVo8"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Split data into 80% for training and 20% for testing\r\n",
"set.seed(2056)\r\n",
"pumpkins_split <- pumpkins_select %>% \r\n",
" initial_split(prop = 0.8)\r\n",
"\r\n",
"# Extract the data in each split\r\n",
"pumpkins_train <- training(pumpkins_split)\r\n",
"pumpkins_test <- testing(pumpkins_split)\r\n",
"\r\n",
"# Print out the first 5 rows of the training set\r\n",
"pumpkins_train %>% \r\n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "PQdpEYYPmdGW"
}
},
{
"cell_type": "markdown",
"source": [
"🙌 We are now ready to train a model by fitting the training features to the training label (color).\n",
"\n",
"We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers.\n",
"\n",
"There are quite a number of ways to specify a logistic regression model in Tidymodels. See `?logistic_reg()` For now, we'll specify a logistic regression model via the default `stats::glm()` engine."
],
"metadata": {
"id": "MX9LipSimhn0"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a recipe that specifies preprocessing steps for modelling\r\n",
"pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \r\n",
" step_integer(all_predictors(), zero_based = TRUE)\r\n",
"\r\n",
"\r\n",
"# Create a logistic model specification\r\n",
"log_reg <- logistic_reg() %>% \r\n",
" set_engine(\"glm\") %>% \r\n",
" set_mode(\"classification\")\r\n"
],
"outputs": [],
"metadata": {
"id": "0Eo5-SbSmm2-"
}
},
{
"cell_type": "markdown",
"source": [
"Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data (prep+bake behind the scenes), fit the model on the preprocessed data and also allow for potential post-processing activities.\n",
"\n",
"In Tidymodels, this convenient object is called a [`workflow`](https://workflows.tidymodels.org/) and conveniently holds your modeling components."
],
"metadata": {
"id": "G599GKhXmqWf"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Bundle modelling components in a workflow\r\n",
"log_reg_wf <- workflow() %>% \r\n",
" add_recipe(pumpkins_recipe) %>% \r\n",
" add_model(log_reg)\r\n",
"\r\n",
"# Print out the workflow\r\n",
"log_reg_wf\r\n"
],
"outputs": [],
"metadata": {
"id": "cRoU0tpbmu1T"
}
},
{
"cell_type": "markdown",
"source": [
"After a workflow has been *specified*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function. The workflow will estimate a recipe and preprocess the data before training, so we won't have to manually do that using prep and bake."
],
"metadata": {
"id": "JnRXKmREnEpd"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Train the model\r\n",
"wf_fit <- log_reg_wf %>% \r\n",
" fit(data = pumpkins_train)\r\n",
"\r\n",
"# Print the trained workflow\r\n",
"wf_fit"
],
"outputs": [],
"metadata": {
"id": "ehFwfkjWnNCb"
}
},
{
"cell_type": "markdown",
"source": [
"The model print out shows the coefficients learned during training.\n",
"\n",
"Now we've trained the model using the training data, we can make predictions on the test data using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Let's start by using the model to predict labels for our test set and the probabilities for each label. When the probability is more than 0.5, the predict class is `ORANGE` else `WHITE`."
],
"metadata": {
"id": "w01dGNZjnOJQ"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make predictions for color and corresponding probabilities\r\n",
"results <- pumpkins_test %>% select(color) %>% \r\n",
" bind_cols(wf_fit %>% \r\n",
" predict(new_data = pumpkins_test)) %>%\r\n",
" bind_cols(wf_fit %>%\r\n",
" predict(new_data = pumpkins_test, type = \"prob\"))\r\n",
"\r\n",
"# Compare predictions\r\n",
"results %>% \r\n",
" slice_head(n = 10)"
],
"outputs": [],
"metadata": {
"id": "K8PNjPfTnak2"
}
},
{
"cell_type": "markdown",
"source": [
"Very nice! This provides some more insights into how logistic regression works.\n",
"\n",
"Comparing each prediction with its corresponding \"ground truth\" actual value isn't a very efficient way to determine how well the model is predicting. Fortunately, Tidymodels has a few more tricks up its sleeve: [`yardstick`](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics.\n",
"\n",
"One performance metric associated with classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). A confusion matrix describes how well a classification model performs. A confusion matrix tabulates how many examples in each class were correctly classified by a model. In our case, it will show you how many orange pumpkins were classified as orange and how many white pumpkins were classified as white; the confusion matrix also shows you how many were classified into the **wrong** categories.\n",
"\n",
"The [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) function from yardstick calculates this cross-tabulation of observed and predicted classes."
],
"metadata": {
"id": "N3J-yW0wngKo"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Confusion matrix for prediction results\r\n",
"conf_mat(data = results, truth = color, estimate = .pred_class)"
],
"outputs": [],
"metadata": {
"id": "0RD77Dq1nl2j"
}
},
{
"cell_type": "markdown",
"source": [
"Let's interpret the confusion matrix. Our model is asked to classify pumpkins between two binary categories, category `orange` and category `not-orange`\n",
"\n",
"- If your model predicts a pumpkin as orange and it belongs to category 'orange' in reality we call it a `true positive`, shown by the top left number.\n",
"\n",
"- If your model predicts a pumpkin as not orange and it belongs to category 'orange' in reality we call it a `false negative`, shown by the bottom left number.\n",
"\n",
"- If your model predicts a pumpkin as orange and it belongs to category 'not-orange' in reality we call it a `false positive`, shown by the top right number.\n",
"\n",
"- If your model predicts a pumpkin as not orange and it belongs to category 'not-orange' in reality we call it a `true negative`, shown by the bottom right number.\n",
"\n",
"\n",
"| **Truth** |\n",
"|:-----:|\n",
"\n",
"\n",
"| | | |\n",
"|---------------|--------|-------|\n",
"| **Predicted** | ORANGE | WHITE |\n",
"| ORANGE | TP | FP |\n",
"| WHITE | FN | TN |"
],
"metadata": {
"id": "H61sFwdOnoiO"
}
},
{
"cell_type": "markdown",
"source": [
"As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.\n",
"\n",
"The confusion matrix is helpful since it gives rise to other metrics that can help us better evaluate the performance of a classification model. Let's go through some of them:\n",
"\n",
"🎓 Precision: `TP/(TP + FP)` defined as the proportion of predicted positives that are actually positive. Also called [positive predictive value](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")\n",
"\n",
"🎓 Recall: `TP/(TP + FN)` defined as the proportion of positive results out of the number of samples which were actually positive. Also known as `sensitivity`.\n",
"\n",
"🎓 Specificity: `TN/(TN + FP)` defined as the proportion of negative results out of the number of samples which were actually negative.\n",
"\n",
"🎓 Accuracy: `TP + TN/(TP + TN + FP + FN)` The percentage of labels predicted accurately for a sample.\n",
"\n",
"🎓 F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.\n",
"\n",
"Let's calculate these metrics!"
],
"metadata": {
"id": "Yc6QUie2oQUr"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Combine metric functions and calculate them all at once\r\n",
"eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\r\n",
"eval_metrics(data = results, truth = color, estimate = .pred_class)"
],
"outputs": [],
"metadata": {
"id": "p6rXx_T3oVxX"
}
},
{
"cell_type": "markdown",
"source": [
"#### **Visualize the ROC curve of this model**\n",
"\n",
"For a start, this is not a bad model; its precision, recall, F measure and accuracy are in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables. It also seems that our model was not really able to identify the white pumpkins 🧐. Could you guess why? One reason could be because of the high prevalence of ORANGE pumpkins in our training set making our model more inclined to predict the majority class.\n",
"\n",
"Let's do one more visualization to see the so-called [`ROC score`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):"
],
"metadata": {
"id": "JcenzZo1oaKR"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make a roc_curve\r\n",
"results %>% \r\n",
" roc_curve(color, .pred_ORANGE) %>% \r\n",
" autoplot()"
],
"outputs": [],
"metadata": {
"id": "BcmkHHHwogRB"
}
},
{
"cell_type": "markdown",
"source": [
"ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. ROC curves typically feature `True Positive Rate`/Sensitivity on the Y axis, and `False Positive Rate`/1-Specificity on the X axis. Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly.\n",
"\n",
"Finally, let's use `yardstick::roc_auc()` to calculate the actual Area Under the Curve. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example."
],
"metadata": {
"id": "P_an3vc1oqjI"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Calculate area under curve\r\n",
"results %>% \r\n",
" roc_auc(color, .pred_ORANGE)"
],
"outputs": [],
"metadata": {
"id": "SZyy5BT8ovew"
}
},
{
"cell_type": "markdown",
"source": [
"The result is around `0.67053`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\r\n",
"\r\n",
"In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).\r\n",
"\r\n",
"But for now, congratulations 🎉🎉🎉! You've completed these regression lessons!\r\n",
"\r\n",
"You R awesome!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/r_learners_sm.jpeg\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"<!--![Artwork by @allison_horst](images/r_learners_sm.jpeg)-->\r\n"
],
"metadata": {
"id": "5jtVKLTVoy6u"
}
}
]
}

@ -0,0 +1,430 @@
---
title: 'Build a regression model: logistic regression'
output:
html_document:
df_print: paged
theme: flatly
highlight: breezedark
toc: yes
toc_float: yes
code_download: yes
---
## Build a logistic regression model - Lesson 4
![Infographic by Dasani Madipalli](../images/logistic-linear.png){width="600"}
#### ** [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/15/)**
#### Introduction
In this final lesson on Regression, one of the basic *classic* ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict `binary` `categories`. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?
In this lesson, you will learn:
- Techniques for logistic regression
✅ Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa)
#### **Prerequisite**
Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: `Color`.
Let's build a logistic regression model to predict that, given some variables, *what color a given pumpkin is likely to be* (orange 🎃 or white 👻).
> Why are we talking about binary classification in a lesson grouping about regression? Only for linguistic convenience, as logistic regression is [really a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), albeit a linear-based one. Learn about other ways to classify data in the next lesson group.
For this lesson, we'll require the following packages:
- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!
- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.
- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple little tools for examining and cleaning dirty data.
- `ggbeeswarm`: The [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) provides methods to create beeswarm-style plots using ggplot2.
You can have them installed as:
`install.packages(c("tidyverse", "tidymodels", "janitor", "ggbeeswarm"))`
Alternatiely, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing.
```{r, message=F, warning=F}
suppressWarnings(if (!require("pacman"))install.packages("pacman"))
pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)
```
## ** Define the question**
For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.
> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!
## **About logistic regression**
Logistic regression differs from linear regression, which you learned about previously, in a few important ways.
#### **Binary classification**
Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` ("orange or not orange") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.
![Infographic by Dasani Madipalli](../images/pumpkin-classifier.png){width="600"}
#### **Other classifications**
There are other types of logistic regression, including multinomial and ordinal:
- **Multinomial**, which involves having more than one category - "Orange, White, and Striped".
- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).
![Infographic by Dasani Madipalli](../images/multinomial-ordinal.png){width="600"}
\
**It's still linear**
Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.
#### **Variables DO NOT have to correlate**
Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.
#### **You need a lot of clean data**
Logistic regression will give more accurate results if you use more data; our small dataset is not optimal for this task, so keep that in mind.
✅ Think about the types of data that would lend themselves well to logistic regression
## 1. Tidy the data
Now, the fun begins! Let's start by importing the data, cleaning the data a bit, dropping rows containing missing values and selecting only some of the columns:
```{r, tidyr, message=F, warning=F}
# Load the core tidyverse packages
library(tidyverse)
# Import the data and clean column names
pumpkins <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv") %>%
clean_names()
# Select desired columns
pumpkins_select <- pumpkins %>%
select(c(city_name, package, variety, origin, item_size, color))
# Drop rows containing missing values and encode color as factor (category)
pumpkins_select <- pumpkins_select %>%
drop_na() %>%
mutate(color = factor(color))
# View the first few rows
pumpkins_select %>%
slice_head(n = 5)
```
Sometimes, we may want some little more information on our data. We can have a look at the `data`, `its structure` and the `data type` of its features by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as below:
```{r glimpse}
pumpkins_select %>%
glimpse()
```
Wow! Seems that all our columns are all of type *character*, further alluding that they are all categorical.
Let's confirm that we will actually be doing a binary classification problem:
```{r distinct color}
# Subset distinct observations in outcome column
pumpkins_select %>%
distinct(color)
```
🥳🥳 That went down well!
## 2. Explore the data
The goal of data exploration is to try to understand the `relationships` between its attributes; in particular, any apparent correlation between the *features* and the *label* your model will try to predict. One way of doing this is by using data visualization.
Given our the data types of our columns, we can `encode` them and be on our way to making some visualizations. This simply involves `translating` a column with `categorical values` for example our columns of type *char*, into one or more `numeric columns` that take the place of the original. - Something we did in our [last lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/3-Linear/solution/lesson_3-R.ipynb).
Tidymodels provides yet another neat package: [recipes](https://recipes.tidymodels.org/)- a package for preprocessing data. We'll define a `recipe` that specifies that all predictor columns should be encoded into a set of integers , `prep` it to estimates the required quantities and statistics needed by any operations and finally `bake` to apply the computations to new data.
> Normally, recipes is usually used as a preprocessor for modelling where it defines what steps should be applied to a data set in order to get it ready for modelling. In that case it is **highly recommend** that you use a `workflow()` instead of manually estimating a recipe using prep and bake. We'll see all this in just a moment.
>
> However for now, we are using recipes + prep + bake to specify what steps should be applied to a data set in order to get it ready for data analysis and then extract the preprocessed data with the steps applied.
```{r recipe_prep_bake}
# Preprocess and extract data to allow some data analysis
baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%
# Encode all columns to a set of integers
step_integer(all_predictors(), zero_based = T) %>%
prep() %>%
bake(new_data = NULL)
# Display the first few rows of preprocessed data
baked_pumpkins %>%
slice_head(n = 5)
```
Now let's compare the feature distributions for each label value using box plots. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple `facets`.
```{r pivot}
# Pivot data to long format
baked_pumpkins_long <- baked_pumpkins %>%
pivot_longer(!color, names_to = "features", values_to = "values")
# Print out restructured data
baked_pumpkins_long %>%
slice_head(n = 10)
```
Now, let's make some boxplots showing the distribution of the predictors with respect to the outcome color!
```{r boxplots}
theme_set(theme_light())
#Make a box plot for each predictor feature
baked_pumpkins_long %>%
mutate(color = factor(color)) %>%
ggplot(mapping = aes(x = color, y = values, fill = features)) +
geom_boxplot() +
facet_wrap(~ features, scales = "free", ncol = 3) +
scale_color_viridis_d(option = "cividis", end = .8) +
theme(legend.position = "none")
```
Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin.
#### **Use a swarm plot**
Color is a binary category (Orange or Not), it's called `categorical data`. There are other various ways of [visualizing categorical data](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar).
Try a `swarm plot` to show the distribution of color with respect to the item_size.
We'll use the [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) which provides methods to create beeswarm-style plots using ggplot2. Beeswarm plots are a way of plotting points that would ordinarily overlap so that they fall next to each other instead.
```{r bee_swarm plot}
# Create beeswarm plots of color and item_size
baked_pumpkins %>%
mutate(color = factor(color)) %>%
ggplot(mapping = aes(x = color, y = item_size, color = color)) +
geom_quasirandom() +
scale_color_brewer(palette = "Dark2", direction = -1) +
theme(legend.position = "none")
```
#### **Violin plot**
A 'violin' type plot is useful as you can easily visualize the way that data in the two categories is distributed. [`Violin plots`](https://en.wikipedia.org/wiki/Violin_plot) are similar to box plots, except that they also show the probability density of the data at different values. Violin plots don't work so well with smaller datasets as the distribution is displayed more 'smoothly'.
```{r violin_plot}
# Create a violin plot of color and item_size
baked_pumpkins %>%
mutate(color = factor(color)) %>%
ggplot(mapping = aes(x = color, y = item_size, fill = color)) +
geom_violin() +
geom_boxplot(color = "black", fill = "white", width = 0.02) +
scale_fill_brewer(palette = "Dark2", direction = -1) +
theme(legend.position = "none")
```
Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.
## 3. Build your model
> **🧮 Show Me The Math**
>
> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:
>
> ![](../images/sigmoid.png)
>
> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.
Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.
It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:
```{r split_data}
# Split data into 80% for training and 20% for testing
set.seed(2056)
pumpkins_split <- pumpkins_select %>%
initial_split(prop = 0.8)
# Extract the data in each split
pumpkins_train <- training(pumpkins_split)
pumpkins_test <- testing(pumpkins_split)
# Print out the first 5 rows of the training set
pumpkins_train %>%
slice_head(n = 5)
```
🙌 We are now ready to train a model by fitting the training features to the training label (color).
We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers.
There are quite a number of ways to specify a logistic regression model in Tidymodels. See `?logistic_reg()` For now, we'll specify a logistic regression model via the default `stats::glm()` engine.
```{r log_reg}
# Create a recipe that specifies preprocessing steps for modelling
pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>%
step_integer(all_predictors(), zero_based = TRUE)
# Create a logistic model specification
log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
```
Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data (prep+bake behind the scenes), fit the model on the preprocessed data and also allow for potential post-processing activities.
In Tidymodels, this convenient object is called a [`workflow`](https://workflows.tidymodels.org/) and conveniently holds your modeling components.
```{r workflow}
# Bundle modelling components in a workflow
log_reg_wf <- workflow() %>%
add_recipe(pumpkins_recipe) %>%
add_model(log_reg)
# Print out the workflow
log_reg_wf
```
After a workflow has been *specified*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function. The workflow will estimate a recipe and preprocess the data before training, so we won't have to manually do that using prep and bake.
```{r train}
# Train the model
wf_fit <- log_reg_wf %>%
fit(data = pumpkins_train)
# Print the trained workflow
wf_fit
```
The model print out shows the coefficients learned during training.
Now we've trained the model using the training data, we can make predictions on the test data using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Let's start by using the model to predict labels for our test set and the probabilities for each label. When the probability is more than 0.5, the predict class is `ORANGE` else `WHITE`.
```{r test_pred}
# Make predictions for color and corresponding probabilities
results <- pumpkins_test %>% select(color) %>%
bind_cols(wf_fit %>%
predict(new_data = pumpkins_test)) %>%
bind_cols(wf_fit %>%
predict(new_data = pumpkins_test, type = "prob"))
# Compare predictions
results %>%
slice_head(n = 10)
```
Very nice! This provides some more insights into how logistic regression works.
Comparing each prediction with its corresponding "ground truth" actual value isn't a very efficient way to determine how well the model is predicting. Fortunately, Tidymodels has a few more tricks up its sleeve: [`yardstick`](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics.
One performance metric associated with classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). A confusion matrix describes how well a classification model performs. A confusion matrix tabulates how many examples in each class were correctly classified by a model. In our case, it will show you how many orange pumpkins were classified as orange and how many white pumpkins were classified as white; the confusion matrix also shows you how many were classified into the **wrong** categories.
The [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) function from yardstick calculates this cross-tabulation of observed and predicted classes.
```{r conf_mat}
# Confusion matrix for prediction results
conf_mat(data = results, truth = color, estimate = .pred_class)
```
Let's interpret the confusion matrix. Our model is asked to classify pumpkins between two binary categories, category `orange` and category `not-orange`
- If your model predicts a pumpkin as orange and it belongs to category 'orange' in reality we call it a `true positive`, shown by the top left number.
- If your model predicts a pumpkin as not orange and it belongs to category 'orange' in reality we call it a `false negative`, shown by the bottom left number.
- If your model predicts a pumpkin as orange and it belongs to category 'not-orange' in reality we call it a `false positive`, shown by the top right number.
- If your model predicts a pumpkin as not orange and it belongs to category 'not-orange' in reality we call it a `true negative`, shown by the bottom right number.
| Truth |
|:-----:|
| | | |
|---------------|--------|-------|
| **Predicted** | ORANGE | WHITE |
| ORANGE | TP | FP |
| WHITE | FN | TN |
As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.
The confusion matrix is helpful since it gives rise to other metrics that can help us better evaluate the performance of a classification model. Let's go through some of them:
🎓 Precision: `TP/(TP + FP)` defined as the proportion of predicted positives that are actually positive. Also called [positive predictive value](https://en.wikipedia.org/wiki/Positive_predictive_value "Positive predictive value")
🎓 Recall: `TP/(TP + FN)` defined as the proportion of positive results out of the number of samples which were actually positive. Also known as `sensitivity`.
🎓 Specificity: `TN/(TN + FP)` defined as the proportion of negative results out of the number of samples which were actually negative.
🎓 Accuracy: `TP + TN/(TP + TN + FP + FN)` The percentage of labels predicted accurately for a sample.
🎓 F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.
Let's calculate these metrics!
```{r metric_set}
# Combine metric functions and calculate them all at once
eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)
eval_metrics(data = results, truth = color, estimate = .pred_class)
```
#### **Visualize the ROC curve of this model**
For a start, this is not a bad model; its precision, recall, F measure and accuracy are in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables. It also seems that our model was not really able to identify the white pumpkins 🧐. Could you guess why? One reason could be because of the high prevalence of ORANGE pumpkins in our training set making our model more inclined to predict the majority class.
Let's do one more visualization to see the so-called [`ROC score`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):
```{r roc_curve}
# Make a roc_curve
results %>%
roc_curve(color, .pred_ORANGE) %>%
autoplot()
```
ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. ROC curves typically feature `True Positive Rate`/Sensitivity on the Y axis, and `False Positive Rate`/1-Specificity on the X axis. Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly.
Finally, let's use `yardstick::roc_auc()` to calculate the actual Area Under the Curve. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
```{r roc_aoc}
# Calculate area under curve
results %>%
roc_auc(color, .pred_ORANGE)
```
The result is around `0.67053`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.
In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).
But for now, congratulations 🎉🎉🎉! You've completed these regression lessons!
You R awesome!
![Artwork by \@allison_horst](../images/r_learners_sm.jpeg)
Loading…
Cancel
Save