"#Build a regression model: Get started with R and Tidymodels for regression models"
"#Build a regression model: Get started with R and Tidymodels for regression models"
],
"metadata": {
"id": "YJUHCXqK57yz"
@ -29,22 +29,22 @@
{
"cell_type": "markdown",
"source": [
"## Introduction to Regression - Lesson 1\r\n",
"\r\n",
"#### Putting it into perspective\r\n",
"\r\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\r\n",
"\r\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\r\n",
"\r\n",
"That said, let's get started on this task!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/encouRage.jpg\"\r\n",
" width=\"630\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/encouRage.jpg)<br>Artwork by @allison_horst-->"
"## Introduction to Regression - Lesson 1\n",
"\n",
"#### Putting it into perspective\n",
"\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\n",
"\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\n",
"\n",
"That said, let's get started on this task!\n",
"\n",
"<p >\n",
" <img src=\"../../images/encouRage.jpg\"\n",
" width=\"630\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"<!--![Artwork by \\@allison_horst](../../images/encouRage.jpg)<br>Artwork by @allison_horst-->"
"# Build a regression model: prepare and visualize data\r\n",
"\r\n",
"## **Linear Regression for Pumpkins - Lesson 2**\r\n",
"#### Introduction\r\n",
"\r\n",
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\r\n",
"\r\n",
"In this lesson, you will learn:\r\n",
"\r\n",
"- How to prepare your data for model-building.\r\n",
"\r\n",
"- How to use `ggplot2` for data visualization.\r\n",
"\r\n",
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\r\n",
"\r\n",
"Let's see this by working through a practical exercise.\r\n",
"\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/unruly_data.jpg\"\r\n",
" width=\"700\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/unruly_data.jpg)<br>Artwork by \\@allison_horst-->"
"# Build a regression model: prepare and visualize data\n",
"\n",
"## **Linear Regression for Pumpkins - Lesson 2**\n",
"#### Introduction\n",
"\n",
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\n",
"\n",
"In this lesson, you will learn:\n",
"\n",
"- How to prepare your data for model-building.\n",
"\n",
"- How to use `ggplot2` for data visualization.\n",
"\n",
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\n",
"\n",
"Let's see this by working through a practical exercise.\n",
"\n",
"\n",
"<p >\n",
" <img src=\"../../images/unruly_data.jpg\"\n",
" width=\"700\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"\n",
"<!--![Artwork by \\@allison_horst](../../images/unruly_data.jpg)<br>Artwork by \\@allison_horst-->"
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](../images/data-visualization.png){width=\"600\"}-->\r\n",
"\r\n",
"There is a *wise* saying that goes like this:\r\n",
"\r\n",
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\r\n",
"\r\n",
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\r\n",
"\r\n",
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\r\n",
"\r\n",
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\r\n",
"\r\n",
"Let's start with a simple scatter plot for the Price and Month columns.\r\n",
"\r\n",
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\r\n"
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/data-visualization.png){width=\"600\"}-->\n",
"\n",
"There is a *wise* saying that goes like this:\n",
"\n",
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\n",
"\n",
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\n",
"\n",
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\n",
"\n",
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\n",
"\n",
"Let's start with a simple scatter plot for the Price and Month columns.\n",
"\n",
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\n"
@ -26,7 +26,7 @@ The question you need answered will determine what type of ML algorithms you wil
Let's see this by working through a practical exercise.
![Artwork by \@allison_horst](../images/unruly_data.jpg){width="700"}
![Artwork by \@allison_horst](../../images/unruly_data.jpg){width="700"}
## 1. Importing pumpkins data and summoning the Tidyverse
@ -113,7 +113,7 @@ Much better! There is missing data, but maybe it won't matter for the task at ha
## 3. Dplyr: A Grammar of Data Manipulation
![Artwork by \@allison_horst](../images/dplyr_wrangling.png){width="569"}
![Artwork by \@allison_horst](../../images/dplyr_wrangling.png){width="569"}
[`dplyr`](https://dplyr.tidyverse.org/), a package in the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges. In this section, we'll explore some of dplyr's verbs!
@ -270,7 +270,7 @@ Good job!👌 You now have a clean, tidy dataset on which you can build your new
## 4. Data visualization with ggplot2
![Infographic by Dasani Madipalli](../images/data-visualization.png){width="600"}
![Infographic by Dasani Madipalli](../../images/data-visualization.png){width="600"}
There is a *wise* saying that goes like this:
@ -342,4 +342,4 @@ new_pumpkins %>%
🤩🤩This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?
Congratulations on finishing the second lesson 👏! You did prepared your data for model building, then uncovered more insights using visualizations!\
Congratulations on finishing the second lesson 👏! You prepared your data for model building, then uncovered more insights using visualizations!
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/linear-polynomial.png){width=\"800\"}-->\r\n",
"\r\n",
"#### Introduction\r\n",
"\r\n",
"So far you have explored what regression is with sample data gathered from the pumpkin pricing dataset that we will use throughout this lesson. You have also visualized it using `ggplot2`.💪\r\n",
"\r\n",
"Now you are ready to dive deeper into regression for ML. In this lesson, you will learn more about two types of regression: *basic linear regression* and *polynomial regression*, along with some of the math underlying these techniques.\r\n",
"\r\n",
"> Throughout this curriculum, we assume minimal knowledge of math, and seek to make it accessible for students coming from other fields, so watch for notes, 🧮 callouts, diagrams, and other learning tools to aid in comprehension.\r\n",
"\r\n",
"#### Preparation\r\n",
"\r\n",
"As a reminder, you are loading this data so as to ask questions of it.\r\n",
"\r\n",
"- When is the best time to buy pumpkins?\r\n",
"\r\n",
"- What price can I expect of a case of miniature pumpkins?\r\n",
"\r\n",
"- Should I buy them in half-bushel baskets or by the 1 1/9 bushel box? Let's keep digging into this data.\r\n",
"\r\n",
"In the previous lesson, you created a `tibble` (a modern reimagining of the data frame) and populated it with part of the original dataset, standardizing the pricing by the bushel. By doing that, however, you were only able to gather about 400 data points and only for the fall months. Maybe we can get a little more detail about the nature of the data by cleaning it more? We'll see... 🕵️♀️\r\n",
"\r\n",
"For this task, we'll require the following packages:\r\n",
"\r\n",
"- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\r\n",
"\r\n",
"- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\r\n",
"\r\n",
"- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple little tools for examining and cleaning dirty data.\r\n",
"\r\n",
"- `corrplot`: The [corrplot package](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width=\"800\"}-->\n",
"\n",
"#### Introduction\n",
"\n",
"So far you have explored what regression is with sample data gathered from the pumpkin pricing dataset that we will use throughout this lesson. You have also visualized it using `ggplot2`.💪\n",
"\n",
"Now you are ready to dive deeper into regression for ML. In this lesson, you will learn more about two types of regression: *basic linear regression* and *polynomial regression*, along with some of the math underlying these techniques.\n",
"\n",
"> Throughout this curriculum, we assume minimal knowledge of math, and seek to make it accessible for students coming from other fields, so watch for notes, 🧮 callouts, diagrams, and other learning tools to aid in comprehension.\n",
"\n",
"#### Preparation\n",
"\n",
"As a reminder, you are loading this data so as to ask questions of it.\n",
"\n",
"- When is the best time to buy pumpkins?\n",
"\n",
"- What price can I expect of a case of miniature pumpkins?\n",
"\n",
"- Should I buy them in half-bushel baskets or by the 1 1/9 bushel box? Let's keep digging into this data.\n",
"\n",
"In the previous lesson, you created a `tibble` (a modern reimagining of the data frame) and populated it with part of the original dataset, standardizing the pricing by the bushel. By doing that, however, you were only able to gather about 400 data points and only for the fall months. Maybe we can get a little more detail about the nature of the data by cleaning it more? We'll see... 🕵️♀️\n",
"\n",
"For this task, we'll require the following packages:\n",
"\n",
"- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\n",
"\n",
"- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n",
"\n",
"- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple little tools for examining and cleaning dirty data.\n",
"\n",
"- `corrplot`: The [corrplot package](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.\n",
"Awesome! 👏 We just created our first recipe that specifies an outcome (price) and its corresponding predictors and that all the predictor columns should be encoded into a set of integers 🙌! Let's quickly break it down:\r\n",
"\r\n",
"- The call to `recipe()` with a formula tells the recipe the *roles* of the variables using `new_pumpkins` data as the reference. For instance the `price` column has been assigned an `outcome` role while the rest of the columns have been assigned a `predictor` role.\r\n",
"\r\n",
"- `step_integer(all_predictors(), zero_based = TRUE)` specifies that all the predictors should be converted into a set of integers with the numbering starting at 0.\r\n",
"\r\n",
"We are sure you may be having thoughts such as: \"This is so cool!! But what if I needed to confirm that the recipes are doing exactly what I expect them to do? 🤔\"\r\n",
"\r\n",
"That's an awesome thought! You see, once your recipe is defined, you can estimate the parameters required to actually preprocess the data, and then extract the processed data. You don't typically need to do this when you use Tidymodels (we'll see the normal convention in just a minute-\\> `workflows`) but it can come in handy when you want to do some kind of sanity check for confirming that recipes are doing what you expect.\r\n",
"\r\n",
"For that, you'll need two more verbs: `prep()` and `bake()` and as always, our little R friends by [`Allison Horst`](https://github.com/allisonhorst/stats-illustrations) help you in understanding this better!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/recipes.png\"\r\n",
" width=\"550\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](images/recipes.png){width=\"550\"}-->"
"Awesome! 👏 We just created our first recipe that specifies an outcome (price) and its corresponding predictors and that all the predictor columns should be encoded into a set of integers 🙌! Let's quickly break it down:\n",
"\n",
"- The call to `recipe()` with a formula tells the recipe the *roles* of the variables using `new_pumpkins` data as the reference. For instance the `price` column has been assigned an `outcome` role while the rest of the columns have been assigned a `predictor` role.\n",
"\n",
"- `step_integer(all_predictors(), zero_based = TRUE)` specifies that all the predictors should be converted into a set of integers with the numbering starting at 0.\n",
"\n",
"We are sure you may be having thoughts such as: \"This is so cool!! But what if I needed to confirm that the recipes are doing exactly what I expect them to do? 🤔\"\n",
"\n",
"That's an awesome thought! You see, once your recipe is defined, you can estimate the parameters required to actually preprocess the data, and then extract the processed data. You don't typically need to do this when you use Tidymodels (we'll see the normal convention in just a minute-\\> `workflows`) but it can come in handy when you want to do some kind of sanity check for confirming that recipes are doing what you expect.\n",
"\n",
"For that, you'll need two more verbs: `prep()` and `bake()` and as always, our little R friends by [`Allison Horst`](https://github.com/allisonhorst/stats-illustrations) help you in understanding this better!\n",
"\n",
"<p >\n",
" <img src=\"../../images/recipes.png\"\n",
" width=\"550\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"\n",
"<!--![Artwork by \\@allison_horst](../../images/recipes.png){width=\"550\"}-->"
],
"metadata": {
"id": "KEiO0v7kuC9O"
@ -478,14 +478,14 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Prep the recipe\r\n",
"pumpkins_prep <- prep(pumpkins_recipe)\r\n",
"\r\n",
"# Bake the recipe to extract a preprocessed new_pumpkins data\r\n",
"A good question to now ask of this data will be: '`What price can I expect of a given pumpkin package?`' Let's get right into it!\r\n",
"\r\n",
"> Note: When you **`bake()`** the prepped recipe **`pumpkins_prep`** with **`new_data = NULL`**, you extract the processed (i.e. encoded) training data. If you had another data set for example a test set and would want to see how a recipe would pre-process it, you would simply bake **`pumpkins_prep`** with **`new_data = test_set`**\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/linear-polynomial.png){width=\"800\"}-->"
"🤩🤩 Much better.\n",
"\n",
"A good question to now ask of this data will be: '`What price can I expect of a given pumpkin package?`' Let's get right into it!\n",
"\n",
"> Note: When you **`bake()`** the prepped recipe **`pumpkins_prep`** with **`new_data = NULL`**, you extract the processed (i.e. encoded) training data. If you had another data set for example a test set and would want to see how a recipe would pre-process it, you would simply bake **`pumpkins_prep`** with **`new_data = test_set`**\n",
"Great! As you can see, the linear regression model does not really well generalize the relationship between a package and its corresponding price.\r\n",
"\r\n",
"🎃 Congratulations, you just created a model that can help predict the price of a few varieties of pumpkins. Your holiday pumpkin patch will be beautiful. But you can probably create a better model!\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/linear-polynomial.png){width=\"800\"}-->"
"Great! As you can see, the linear regression model does not really well generalize the relationship between a package and its corresponding price.\n",
"\n",
"🎃 Congratulations, you just created a model that can help predict the price of a few varieties of pumpkins. Your holiday pumpkin patch will be beautiful. But you can probably create a better model!\n",
## Linear and Polynomial Regression for Pumpkin Pricing - Lesson 3
![Infographic by Dasani Madipalli](../images/linear-polynomial.png){width="800"}
![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width="800"}
#### Introduction
@ -78,13 +78,13 @@ We do so since we want to model a line that has the least cumulative distance fr
>
> `X` is the '`explanatory variable` or `predictor`'. `Y` is the '`dependent variable` or `outcome`'. The slope of the line is `b` and `a` is the y-intercept, which refers to the value of `Y` when `X = 0`.
>
> ![Infographic by Jen Looper](../images/slope.png){width="400"}
> ![Infographic by Jen Looper](../../images/slope.png){width="400"}
>
> First, calculate the slope `b`.
>
> In other words, and referring to our pumpkin data's original question: "predict the price of a pumpkin per bushel by month", `X` would refer to the price and `Y` would refer to the month of sale.
>
> ![Infographic by Jen Looper](../images/calculation.png)
> ![Infographic by Jen Looper](../../images/calculation.png)
>
> Calculate the value of Y. If you're paying around \$4, it must be April!
>
@ -102,7 +102,7 @@ A good linear regression model will be one that has a high (nearer to 1 than 0)
## **2. A dance with data: creating a data frame that will be used for modelling**
![Artwork by \@allison_horst](../images/janitor.jpg){width="700"}
![Artwork by \@allison_horst](../../images/janitor.jpg){width="700"}
Load up required libraries and dataset. Convert the data to a data frame containing a subset of the data:
@ -346,7 +346,7 @@ A good question to now ask of this data will be: '`What price can I expect of a
## 4. Build a linear regression model
![Infographic by Dasani Madipalli](../images/linear-polynomial.png){width="800"}
![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width="800"}
Now that we have build a recipe, and actually confirmed that the data will be pre-processed appropriately, let's now build a regression model to answer the question: `What price can I expect of a given pumpkin package?`
@ -498,7 +498,7 @@ Great! As you can see, the linear regression model does not really well generali
## 5. Build a polynomial regression model
![Infographic by Dasani Madipalli](../images/linear-polynomial.png){width="800"}
![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width="800"}
Sometimes our data may not have a linear relationship, but we still want to predict an outcome. Polynomial regression can help us make predictions for more complex non-linear relationships.
"For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.\r\n",
"\r\n",
"> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!\r\n",
"\r\n",
"## **About logistic regression**\r\n",
"\r\n",
"Logistic regression differs from linear regression, which you learned about previously, in a few important ways.\r\n",
"\r\n",
"#### **Binary classification**\r\n",
"\r\n",
"Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` (\"orange or not orange\") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/pumpkin-classifier.png){width=\"600\"}-->"
"## ** Define the question**\n",
"\n",
"For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.\n",
"\n",
"> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!\n",
"\n",
"## **About logistic regression**\n",
"\n",
"Logistic regression differs from linear regression, which you learned about previously, in a few important ways.\n",
"\n",
"#### **Binary classification**\n",
"\n",
"Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` (\"orange or not orange\") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png){width=\"600\"}-->"
],
"metadata": {
"id": "ws-hP_SXk2O6"
@ -129,20 +129,20 @@
{
"cell_type": "markdown",
"source": [
"#### **Other classifications**\r\n",
"\r\n",
"There are other types of logistic regression, including multinomial and ordinal:\r\n",
"\r\n",
"- **Multinomial**, which involves having more than one category - \"Orange, White, and Striped\".\r\n",
"\r\n",
"- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/multinomial-ordinal.png){width=\"600\"}-->"
"#### **Other classifications**\n",
"\n",
"There are other types of logistic regression, including multinomial and ordinal:\n",
"\n",
"- **Multinomial**, which involves having more than one category - \"Orange, White, and Striped\".\n",
"\n",
"- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).\n",
" scale_fill_brewer(palette = \"Dark2\", direction = -1) +\n",
" theme(legend.position = \"none\")"
],
"outputs": [],
@ -417,28 +417,28 @@
{
"cell_type": "markdown",
"source": [
"Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"> **🧮 Show Me The Math**\r\n",
">\r\n",
"> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:\r\n",
">\r\n",
"> \r\n",
"<p >\r\n",
" <img src=\"../images/sigmoid.png\">\r\n",
"\r\n",
"\r\n",
"> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.\r\n",
"\r\n",
"Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.\r\n",
"\r\n",
"Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"> **🧮 Show Me The Math**\n",
">\n",
"> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:\n",
">\n",
"> \n",
"<p >\n",
" <img src=\"../../images/sigmoid.png\">\n",
"\n",
"\n",
"> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.\n",
"\n",
"Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.\n",
"\n",
"It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:"
],
"metadata": {
@ -449,17 +449,17 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Split data into 80% for training and 20% for testing\r\n",
"set.seed(2056)\r\n",
"pumpkins_split <- pumpkins_select %>% \r\n",
" initial_split(prop = 0.8)\r\n",
"\r\n",
"# Extract the data in each split\r\n",
"pumpkins_train <- training(pumpkins_split)\r\n",
"pumpkins_test <- testing(pumpkins_split)\r\n",
"\r\n",
"# Print out the first 5 rows of the training set\r\n",
"pumpkins_train %>% \r\n",
"# Split data into 80% for training and 20% for testing\n",
"set.seed(2056)\n",
"pumpkins_split <- pumpkins_select %>% \n",
" initial_split(prop = 0.8)\n",
"\n",
"# Extract the data in each split\n",
"pumpkins_train <- training(pumpkins_split)\n",
"pumpkins_test <- testing(pumpkins_split)\n",
"\n",
"# Print out the first 5 rows of the training set\n",
"pumpkins_train %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
@ -484,15 +484,15 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a recipe that specifies preprocessing steps for modelling\r\n",
"pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \r\n",
"eval_metrics(data = results, truth = color, estimate = .pred_class)"
],
"outputs": [],
@ -691,9 +691,9 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Make a roc_curve\r\n",
"results %>% \r\n",
" roc_curve(color, .pred_ORANGE) %>% \r\n",
"# Make a roc_curve\n",
"results %>% \n",
" roc_curve(color, .pred_ORANGE) %>% \n",
" autoplot()"
],
"outputs": [],
@ -716,8 +716,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Calculate area under curve\r\n",
"results %>% \r\n",
"# Calculate area under curve\n",
"results %>% \n",
" roc_auc(color, .pred_ORANGE)"
],
"outputs": [],
@ -728,20 +728,20 @@
{
"cell_type": "markdown",
"source": [
"The result is around `0.67053`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\r\n",
"\r\n",
"In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).\r\n",
"\r\n",
"But for now, congratulations 🎉🎉🎉! You've completed these regression lessons!\r\n",
"\r\n",
"You R awesome!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/r_learners_sm.jpeg\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"<!--![Artwork by @allison_horst](images/r_learners_sm.jpeg)-->\r\n"
"The result is around `0.67053`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\n",
"\n",
"In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).\n",
"\n",
"But for now, congratulations 🎉🎉🎉! You've completed these regression lessons!\n",
@ -70,7 +70,7 @@ Logistic regression differs from linear regression, which you learned about prev
Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` ("orange or not orange") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.
![Infographic by Dasani Madipalli](../images/pumpkin-classifier.png){width="600"}
![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png){width="600"}
#### **Other classifications**
@ -80,7 +80,7 @@ There are other types of logistic regression, including multinomial and ordinal:
- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).
![Infographic by Dasani Madipalli](../images/multinomial-ordinal.png){width="600"}
![Infographic by Dasani Madipalli](../../images/multinomial-ordinal.png){width="600"}
\
**It's still linear**
@ -244,7 +244,7 @@ Now that we have an idea of the relationship between the binary categories of co
>
> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:
>
> ![](../images/sigmoid.png)
> ![](../../images/sigmoid.png)
>
> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.
@ -425,6 +425,6 @@ But for now, congratulations 🎉🎉🎉! You've completed these regression les
You R awesome!
![Artwork by \@allison_horst](../images/r_learners_sm.jpeg)
![Artwork by \@allison_horst](../../images/r_learners_sm.jpeg)
"# Build a classification model: Delicious Asian and Indian Cuisines"
]
],
"metadata": {
"id": "ItETB4tSFprR"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "ri5bQxZ-Fz_0"
},
"source": [
"## Introduction to classification: Clean, prep, and visualize your data\n",
"\n",
"In these four lessons, you will explore a fundamental focus of classic machine learning - *classification*. We will walk through using various classification algorithms with a dataset about all the brilliant cuisines of Asia and India. Hope you're hungry!\n",
"\n",
"<p >\n",
" <img src=\"../images/pinch.png\"\n",
" <img src=\"../../images/pinch.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Celebrate pan-Asian cuisines in these lessons! Image by Jen Looper</figcaption>\n",
"\n",
@ -62,7 +59,7 @@
"To state the process in a more scientific way, your classification method creates a predictive model that enables you to map the relationship between input variables to output variables.\n",
"Alternatiely, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing."
"We'll later load these awesome packages and make them available in our current R session. (This is for mere illustration, `pacman::p_load()` already did that for you)"
]
],
"metadata": {
"id": "YkKAxOJvGD4C"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "PFkQDlk0GN5O"
},
"source": [
"## Exercise - clean and balance your data\n",
"\n",
"The first task at hand, before starting this project, is to clean and **balance** your data to get better results\n",
"Interesting! From the looks of it, the first column is a kind of `id` column. Let's get a little more information about the data."
]
],
"metadata": {
"id": "XrWnlgSrGVmR"
}
},
{
"cell_type": "code",
"metadata": {
"id": "4UcGmxRxGieA"
},
"execution_count": null,
"source": [
"# Basic information about the data\n",
"df %>%\n",
@ -173,27 +171,27 @@
"df %>% \n",
" plot_intro(ggtheme = theme_light())"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "4UcGmxRxGieA"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "AaPubl__GmH5"
},
"source": [
"From the output, we can immediately see that we have `2448` rows and `385` columns and `0` missing values. We also have 1 discrete column, *cuisine*.\n",
"\n",
"## Exercise - learning about cuisines\n",
"\n",
"Now the work starts to become more interesting. Let's discover the distribution of data, per cuisine."
"There are a finite number of cuisines, but the distribution of data is uneven. You can fix that! Before doing so, explore a little more.\n",
"\n",
"Next, let's assign each cuisine into its individual table and find out how much data is available (rows, columns) per cuisine.\n",
"\n",
"<p >\n",
" <img src=\"../images/dplyr_filter.jpg\"\n",
" <img src=\"../../images/dplyr_filter.jpg\"\n",
" width=\"600\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n"
]
],
"metadata": {
"id": "vVvyDb1kG2in"
}
},
{
"cell_type": "code",
"metadata": {
"id": "0TvXUxD3G8Bk"
},
"execution_count": null,
"source": [
"# Create individual tables for the cuisines\n",
"thai_df <- df %>% \n",
@ -254,14 +252,13 @@
" \"indian_df:\", dim(indian_df), \"\\n\",\n",
" \"korean_df:\", dim(korean_df))"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "0TvXUxD3G8Bk"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "K3RF5bSCHC76"
},
"source": [
"Perfect!😋\n",
"\n",
@ -297,13 +294,14 @@
"- `dplyr::mutate()`: helps you to create or modify columns.\n",
"\n",
"Check out this [*art*-filled learnr tutorial](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome) by Allison Horst, that introduces some useful data wrangling functions in dplyr *(part of the Tidyverse)*"
]
],
"metadata": {
"id": "K3RF5bSCHC76"
}
},
{
"cell_type": "code",
"metadata": {
"id": "uB_0JR82HTPa"
},
"execution_count": null,
"source": [
"# Creates a functions that returns the top ingredients by class\n",
"\n",
@ -325,23 +323,23 @@
" return(ingredient_df)\n",
"} # End of function"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "uB_0JR82HTPa"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "h9794WF8HWmc"
},
"source": [
"Now we can use the function to get an idea of top ten most popular ingredient by cuisine. Let's take it out for a spin with `thai_df`\n"
]
],
"metadata": {
"id": "h9794WF8HWmc"
}
},
{
"cell_type": "code",
"metadata": {
"id": "agQ-1HrcHaEA"
},
"execution_count": null,
"source": [
"# Call create_ingredient and display popular ingredients\n",
"From the data visualizations, we can now drop the most common ingredients that create confusion between distinct cuisines, using `dplyr::select()`.\n",
"\n",
"Everyone loves rice, garlic and ginger!"
]
],
"metadata": {
"id": "iO4veMXuIEta"
}
},
{
"cell_type": "code",
"metadata": {
"id": "iHJPiG6rIUcK"
},
"execution_count": null,
"source": [
"# Drop id column, rice, garlic and ginger from our original data set\n",
"df_select <- df %>% \n",
@ -502,41 +500,41 @@
"df_select %>% \n",
" slice_head(n = 5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "iHJPiG6rIUcK"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "kkFd-JxdIaL6"
},
"source": [
"## Preprocessing data using recipes 👩🍳👨🍳 - Dealing with imbalanced data ⚖️\n",
"\n",
"<p >\n",
" <img src=\"../images/recipes.png\"\n",
" <img src=\"../../images/recipes.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"Given that this lesson is about cuisines, we have to put `recipes` into context .\n",
"\n",
"Tidymodels provides yet another neat package: `recipes`- a package for preprocessing data.\n"
]
],
"metadata": {
"id": "kkFd-JxdIaL6"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "6l2ubtTPJAhY"
},
"source": [
"Let's take a look at the distribution of our cuisines again.\n"
]
],
"metadata": {
"id": "6l2ubtTPJAhY"
}
},
{
"cell_type": "code",
"metadata": {
"id": "1e-E9cb7JDVi"
},
"execution_count": null,
"source": [
"# Distribution of cuisines\n",
"old_label_count <- df_select %>% \n",
@ -545,14 +543,13 @@
"\n",
"old_label_count"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "1e-E9cb7JDVi"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "soAw6826JKx9"
},
"source": [
"\n",
"As you can see, there is quite an unequal distribution in the number of cuisines. Korean cuisines are almost 3 times Thai cuisines. Imbalanced data often has negative effects on the model performance. Think about a binary classification. If most of your data is one class, a ML model is going to predict that class more frequently, just because there is more data for it. Balancing the data takes any skewed data and helps remove this imbalance. Many models perform best when the number of observations is equal and, thus, tend to struggle with unbalanced data.\n",
@ -564,13 +561,14 @@
"- removing observations from majority class: `Under-sampling`\n",
"\n",
"Let's now demonstrate how to deal with imbalanced data sets using a `recipe`. A recipe can be thought of as a blueprint that describes what steps should be applied to a data set in order to get it ready for data analysis."
]
],
"metadata": {
"id": "soAw6826JKx9"
}
},
{
"cell_type": "code",
"metadata": {
"id": "HS41brUIJVJy"
},
"execution_count": null,
"source": [
"# Load themis package for dealing with imbalanced data\n",
"library(themis)\n",
@ -581,14 +579,13 @@
"\n",
"cuisines_recipe"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "HS41brUIJVJy"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "Yb-7t7XcJaC8"
},
"source": [
"Let's break down our preprocessing steps.\n",
"\n",
@ -601,13 +598,14 @@
"`prep()`: estimates the required parameters from a training set that can be later applied to other data sets.\n",
"\n",
"`bake()`: takes a prepped recipe and applies the operations to any data set.\n"
]
],
"metadata": {
"id": "Yb-7t7XcJaC8"
}
},
{
"cell_type": "code",
"metadata": {
"id": "9QhSgdpxJl44"
},
"execution_count": null,
"source": [
"# Prep and bake the recipe\n",
"preprocessed_df <- cuisines_recipe %>% \n",
@ -623,23 +621,23 @@
"preprocessed_df %>% \n",
" introduce()"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "9QhSgdpxJl44"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "dmidELh_LdV7"
},
"source": [
"Let's now check the distribution of our cuisines and compare them with the imbalanced data."
]
],
"metadata": {
"id": "dmidELh_LdV7"
}
},
{
"cell_type": "code",
"metadata": {
"id": "aSh23klBLwDz"
},
"execution_count": null,
"source": [
"# Distribution of cuisines\n",
"new_label_count <- preprocessed_df %>% \n",
@ -649,14 +647,13 @@
"list(new_label_count = new_label_count,\n",
" old_label_count = old_label_count)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "aSh23klBLwDz"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "HEu80HZ8L7ae"
},
"source": [
"Yum! The data is nice and clean, balanced, and very delicious 😋!\n",
"\n",
@ -667,25 +664,25 @@
"> When you **`bake()`** a prepped recipe with **`new_data = NULL`**, you get the data that you provided when defining the recipe back, but having undergone the preprocessing steps.\n",
"\n",
"Let's now save a copy of this data for use in future lessons:\n"
"This fresh CSV can now be found in the root data folder.\n",
"\n",
@ -710,10 +707,13 @@
"[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ♥️\n",
In these four lessons, you will explore a fundamental focus of classic machine learning - *classification*. We will walk through using various classification algorithms with a dataset about all the brilliant cuisines of Asia and India. Hope you're hungry!
![Celebrate pan-Asian cuisines in these lessons! Image by Jen Looper](../images/pinch.png)
![Celebrate pan-Asian cuisines in these lessons! Image by Jen Looper](../../images/pinch.png)
Classification is a form of [supervised learning](https://wikipedia.org/wiki/Supervised_learning) that bears a lot in common with regression techniques. In classification, you train a model to predict which `category` an item belongs to. If machine learning is all about predicting values or names to things by using datasets, then classification generally falls into two groups: *binary classification* and *multiclass classification*.
@ -417,4 +417,4 @@ This curriculum contains several interesting datasets. Dig through the `data` fo
[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ♥️
![Artwork by \@allison_horst](../images/r_learners_sm.jpeg)
![Artwork by \@allison_horst](../../images/r_learners_sm.jpeg)