editing image paths for R lessons, moving R and linking

pull/315/head
Jen Looper 3 years ago
parent 591a25f436
commit f415ae81ba

@ -6,7 +6,7 @@
## [Pre-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/9/)
> ### [This lesson is available in R!](./solution/lesson_1-R.ipynb)
> ### [This lesson is available in R!](./solution/R/lesson_1-R.ipynb)
## Introduction

@ -0,0 +1 @@
This is a temporary placeholder

@ -29,22 +29,22 @@
{
"cell_type": "markdown",
"source": [
"## Introduction to Regression - Lesson 1\r\n",
"\r\n",
"#### Putting it into perspective\r\n",
"\r\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\r\n",
"\r\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\r\n",
"\r\n",
"That said, let's get started on this task!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/encouRage.jpg\"\r\n",
" width=\"630\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/encouRage.jpg)<br>Artwork by @allison_horst-->"
"## Introduction to Regression - Lesson 1\n",
"\n",
"#### Putting it into perspective\n",
"\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\n",
"\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\n",
"\n",
"That said, let's get started on this task!\n",
"\n",
"<p >\n",
" <img src=\"../../images/encouRage.jpg\"\n",
" width=\"630\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"<!--![Artwork by \\@allison_horst](../../images/encouRage.jpg)<br>Artwork by @allison_horst-->"
],
"metadata": {
"id": "LWNNzfqd6feZ"
@ -75,7 +75,7 @@
"cell_type": "code",
"execution_count": 2,
"source": [
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n",
"pacman::p_load(tidyverse, tidymodels)"
],
"outputs": [

@ -20,7 +20,7 @@ In this section, you will work with a [small dataset about diabetes](https://www
That said, let's get started on this task!
![Artwork by \@allison_horst](../images/encouRage.jpg){width="630"}
![Artwork by \@allison_horst](../../images/encouRage.jpg){width="630"}
## 1. Loading up our tool set

@ -0,0 +1 @@
This is a temporary placeholder

@ -20,31 +20,31 @@
{
"cell_type": "markdown",
"source": [
"# Build a regression model: prepare and visualize data\r\n",
"\r\n",
"## **Linear Regression for Pumpkins - Lesson 2**\r\n",
"#### Introduction\r\n",
"\r\n",
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\r\n",
"\r\n",
"In this lesson, you will learn:\r\n",
"\r\n",
"- How to prepare your data for model-building.\r\n",
"\r\n",
"- How to use `ggplot2` for data visualization.\r\n",
"\r\n",
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\r\n",
"\r\n",
"Let's see this by working through a practical exercise.\r\n",
"\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/unruly_data.jpg\"\r\n",
" width=\"700\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/unruly_data.jpg)<br>Artwork by \\@allison_horst-->"
"# Build a regression model: prepare and visualize data\n",
"\n",
"## **Linear Regression for Pumpkins - Lesson 2**\n",
"#### Introduction\n",
"\n",
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\n",
"\n",
"In this lesson, you will learn:\n",
"\n",
"- How to prepare your data for model-building.\n",
"\n",
"- How to use `ggplot2` for data visualization.\n",
"\n",
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\n",
"\n",
"Let's see this by working through a practical exercise.\n",
"\n",
"\n",
"<p >\n",
" <img src=\"../../images/unruly_data.jpg\"\n",
" width=\"700\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"\n",
"<!--![Artwork by \\@allison_horst](../../images/unruly_data.jpg)<br>Artwork by \\@allison_horst-->"
],
"metadata": {
"id": "Pg5aexcOPqAZ"
@ -73,7 +73,7 @@
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n",
"pacman::p_load(tidyverse)"
],
"outputs": [],
@ -94,19 +94,19 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Load the core Tidyverse packages\r\n",
"library(tidyverse)\r\n",
"\r\n",
"# Import the pumpkins data\r\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\r\n",
"\r\n",
"\r\n",
"# Get a glimpse and dimensions of the data\r\n",
"glimpse(pumpkins)\r\n",
"\r\n",
"\r\n",
"# Print the first 50 rows of the data set\r\n",
"pumpkins %>% \r\n",
"# Load the core Tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# Import the pumpkins data\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n",
"\n",
"\n",
"# Get a glimpse and dimensions of the data\n",
"glimpse(pumpkins)\n",
"\n",
"\n",
"# Print the first 50 rows of the data set\n",
"pumpkins %>% \n",
" slice_head(n =50)"
],
"outputs": [],
@ -148,7 +148,7 @@
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \r\n",
"pumpkins %>% \n",
" anyNA()"
],
"outputs": [],
@ -171,8 +171,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \r\n",
" is.na() %>% \r\n",
"pumpkins %>% \n",
" is.na() %>% \n",
" head(n = 7)"
],
"outputs": [],
@ -195,8 +195,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \r\n",
" is.na() %>% \r\n",
"pumpkins %>% \n",
" is.na() %>% \n",
" colSums()"
],
"outputs": [],
@ -218,16 +218,16 @@
{
"cell_type": "markdown",
"source": [
"## 3. Dplyr: A Grammar of Data Manipulation\r\n",
"\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/dplyr_wrangling.png\"\r\n",
" width=\"569\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/dplyr_wrangling.png)<br/>Artwork by \\@allison_horst-->"
"## 3. Dplyr: A Grammar of Data Manipulation\n",
"\n",
"\n",
"<p >\n",
" <img src=\"../../images/dplyr_wrangling.png\"\n",
" width=\"569\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"\n",
"<!--![Artwork by \\@allison_horst](../../images/dplyr_wrangling.png)<br/>Artwork by \\@allison_horst-->"
],
"metadata": {
"id": "o4jLY5-VZO2C"
@ -262,13 +262,13 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Select desired columns\r\n",
"pumpkins <- pumpkins %>% \r\n",
" select(Package, `Low Price`, `High Price`, Date)\r\n",
"\r\n",
"\r\n",
"# Print data set\r\n",
"pumpkins %>% \r\n",
"# Select desired columns\n",
"pumpkins <- pumpkins %>% \n",
" select(Package, `Low Price`, `High Price`, Date)\n",
"\n",
"\n",
"# Print data set\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
@ -303,19 +303,19 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Load lubridate\r\n",
"library(lubridate)\r\n",
"\r\n",
"pumpkins <- pumpkins %>% \r\n",
" # Convert the Date column to a date object\r\n",
" mutate(Date = mdy(Date)) %>% \r\n",
" # Extract month from Date\r\n",
" mutate(Month = month(Date)) %>% \r\n",
" # Drop Date column\r\n",
" select(-Date)\r\n",
"\r\n",
"# View the first few rows\r\n",
"pumpkins %>% \r\n",
"# Load lubridate\n",
"library(lubridate)\n",
"\n",
"pumpkins <- pumpkins %>% \n",
" # Convert the Date column to a date object\n",
" mutate(Date = mdy(Date)) %>% \n",
" # Extract month from Date\n",
" mutate(Month = month(Date)) %>% \n",
" # Drop Date column\n",
" select(-Date)\n",
"\n",
"# View the first few rows\n",
"pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
@ -339,12 +339,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new column Price\r\n",
"pumpkins <- pumpkins %>% \r\n",
" mutate(Price = (`Low Price` + `High Price`)/2)\r\n",
"\r\n",
"# View the first few rows of the data\r\n",
"pumpkins %>% \r\n",
"# Create a new column Price\n",
"pumpkins <- pumpkins %>% \n",
" mutate(Price = (`Low Price` + `High Price`)/2)\n",
"\n",
"# View the first few rows of the data\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
@ -371,8 +371,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Verify the distinct observations in Package column\r\n",
"pumpkins %>% \r\n",
"# Verify the distinct observations in Package column\n",
"pumpkins %>% \n",
" distinct(Package)"
],
"outputs": [],
@ -411,15 +411,15 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Retain only pumpkins with \"bushel\"\r\n",
"new_pumpkins <- pumpkins %>% \r\n",
" filter(str_detect(Package, \"bushel\"))\r\n",
"\r\n",
"# Get the dimensions of the new data\r\n",
"dim(new_pumpkins)\r\n",
"\r\n",
"# View a few rows of the new data\r\n",
"new_pumpkins %>% \r\n",
"# Retain only pumpkins with \"bushel\"\n",
"new_pumpkins <- pumpkins %>% \n",
" filter(str_detect(Package, \"bushel\"))\n",
"\n",
"# Get the dimensions of the new data\n",
"dim(new_pumpkins)\n",
"\n",
"# View a few rows of the new data\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
@ -456,15 +456,15 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Convert the price if the Package contains fractional bushel values\r\n",
"new_pumpkins <- new_pumpkins %>% \r\n",
" mutate(Price = case_when(\r\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\r\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\r\n",
" TRUE ~ Price))\r\n",
"\r\n",
"# View the first few rows of the data\r\n",
"new_pumpkins %>% \r\n",
"# Convert the price if the Package contains fractional bushel values\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(Price = case_when(\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
" TRUE ~ Price))\n",
"\n",
"# View the first few rows of the data\n",
"new_pumpkins %>% \n",
" slice_head(n = 30)"
],
"outputs": [],
@ -501,11 +501,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new data frame new_pumpkins\r\n",
"new_pumpkins <- new_pumpkins %>% \r\n",
" relocate(Month, .before = Package)\r\n",
"\r\n",
"new_pumpkins %>% \r\n",
"# Create a new data frame new_pumpkins\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(Month, .before = Package)\n",
"\n",
"new_pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
@ -526,29 +526,29 @@
{
"cell_type": "markdown",
"source": [
"## 4. Data visualization with ggplot2\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/data-visualization.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](../images/data-visualization.png){width=\"600\"}-->\r\n",
"\r\n",
"There is a *wise* saying that goes like this:\r\n",
"\r\n",
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\r\n",
"\r\n",
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\r\n",
"\r\n",
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\r\n",
"\r\n",
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\r\n",
"\r\n",
"Let's start with a simple scatter plot for the Price and Month columns.\r\n",
"\r\n",
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\r\n"
"## 4. Data visualization with ggplot2\n",
"\n",
"<p >\n",
" <img src=\"../../images/data-visualization.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/data-visualization.png){width=\"600\"}-->\n",
"\n",
"There is a *wise* saying that goes like this:\n",
"\n",
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\n",
"\n",
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\n",
"\n",
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\n",
"\n",
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\n",
"\n",
"Let's start with a simple scatter plot for the Price and Month columns.\n",
"\n",
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\n"
],
"metadata": {
"id": "mYSH6-EtbvNa"
@ -558,11 +558,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Set a theme for the plots\r\n",
"theme_set(theme_light())\r\n",
"\r\n",
"# Create a scatter plot\r\n",
"p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\r\n",
"# Set a theme for the plots\n",
"theme_set(theme_light())\n",
"\n",
"# Create a scatter plot\n",
"p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\n",
"p + geom_point()"
],
"outputs": [],

@ -26,7 +26,7 @@ The question you need answered will determine what type of ML algorithms you wil
Let's see this by working through a practical exercise.
![Artwork by \@allison_horst](../images/unruly_data.jpg){width="700"}
![Artwork by \@allison_horst](../../images/unruly_data.jpg){width="700"}
## 1. Importing pumpkins data and summoning the Tidyverse
@ -113,7 +113,7 @@ Much better! There is missing data, but maybe it won't matter for the task at ha
## 3. Dplyr: A Grammar of Data Manipulation
![Artwork by \@allison_horst](../images/dplyr_wrangling.png){width="569"}
![Artwork by \@allison_horst](../../images/dplyr_wrangling.png){width="569"}
[`dplyr`](https://dplyr.tidyverse.org/), a package in the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges. In this section, we'll explore some of dplyr's verbs!
@ -270,7 +270,7 @@ Good job!👌 You now have a clean, tidy dataset on which you can build your new
## 4. Data visualization with ggplot2
![Infographic by Dasani Madipalli](../images/data-visualization.png){width="600"}
![Infographic by Dasani Madipalli](../../images/data-visualization.png){width="600"}
There is a *wise* saying that goes like this:
@ -342,4 +342,4 @@ new_pumpkins %>%
🤩🤩This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?
Congratulations on finishing the second lesson 👏! You did prepared your data for model building, then uncovered more insights using visualizations!\
Congratulations on finishing the second lesson 👏! You prepared your data for model building, then uncovered more insights using visualizations!

@ -0,0 +1 @@
This is a temporary placeholder

@ -29,49 +29,49 @@
{
"cell_type": "markdown",
"source": [
"## Linear and Polynomial Regression for Pumpkin Pricing - Lesson 3\r\n",
"<p >\r\n",
" <img src=\"../images/linear-polynomial.png\"\r\n",
" width=\"800\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/linear-polynomial.png){width=\"800\"}-->\r\n",
"\r\n",
"#### Introduction\r\n",
"\r\n",
"So far you have explored what regression is with sample data gathered from the pumpkin pricing dataset that we will use throughout this lesson. You have also visualized it using `ggplot2`.💪\r\n",
"\r\n",
"Now you are ready to dive deeper into regression for ML. In this lesson, you will learn more about two types of regression: *basic linear regression* and *polynomial regression*, along with some of the math underlying these techniques.\r\n",
"\r\n",
"> Throughout this curriculum, we assume minimal knowledge of math, and seek to make it accessible for students coming from other fields, so watch for notes, 🧮 callouts, diagrams, and other learning tools to aid in comprehension.\r\n",
"\r\n",
"#### Preparation\r\n",
"\r\n",
"As a reminder, you are loading this data so as to ask questions of it.\r\n",
"\r\n",
"- When is the best time to buy pumpkins?\r\n",
"\r\n",
"- What price can I expect of a case of miniature pumpkins?\r\n",
"\r\n",
"- Should I buy them in half-bushel baskets or by the 1 1/9 bushel box? Let's keep digging into this data.\r\n",
"\r\n",
"In the previous lesson, you created a `tibble` (a modern reimagining of the data frame) and populated it with part of the original dataset, standardizing the pricing by the bushel. By doing that, however, you were only able to gather about 400 data points and only for the fall months. Maybe we can get a little more detail about the nature of the data by cleaning it more? We'll see... 🕵️‍♀️\r\n",
"\r\n",
"For this task, we'll require the following packages:\r\n",
"\r\n",
"- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\r\n",
"\r\n",
"- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\r\n",
"\r\n",
"- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple little tools for examining and cleaning dirty data.\r\n",
"\r\n",
"- `corrplot`: The [corrplot package](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.\r\n",
"\r\n",
"You can have them installed as:\r\n",
"\r\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"corrplot\"))`\r\n",
"\r\n",
"## Linear and Polynomial Regression for Pumpkin Pricing - Lesson 3\n",
"<p >\n",
" <img src=\"../../images/linear-polynomial.png\"\n",
" width=\"800\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width=\"800\"}-->\n",
"\n",
"#### Introduction\n",
"\n",
"So far you have explored what regression is with sample data gathered from the pumpkin pricing dataset that we will use throughout this lesson. You have also visualized it using `ggplot2`.💪\n",
"\n",
"Now you are ready to dive deeper into regression for ML. In this lesson, you will learn more about two types of regression: *basic linear regression* and *polynomial regression*, along with some of the math underlying these techniques.\n",
"\n",
"> Throughout this curriculum, we assume minimal knowledge of math, and seek to make it accessible for students coming from other fields, so watch for notes, 🧮 callouts, diagrams, and other learning tools to aid in comprehension.\n",
"\n",
"#### Preparation\n",
"\n",
"As a reminder, you are loading this data so as to ask questions of it.\n",
"\n",
"- When is the best time to buy pumpkins?\n",
"\n",
"- What price can I expect of a case of miniature pumpkins?\n",
"\n",
"- Should I buy them in half-bushel baskets or by the 1 1/9 bushel box? Let's keep digging into this data.\n",
"\n",
"In the previous lesson, you created a `tibble` (a modern reimagining of the data frame) and populated it with part of the original dataset, standardizing the pricing by the bushel. By doing that, however, you were only able to gather about 400 data points and only for the fall months. Maybe we can get a little more detail about the nature of the data by cleaning it more? We'll see... 🕵️‍♀️\n",
"\n",
"For this task, we'll require the following packages:\n",
"\n",
"- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\n",
"\n",
"- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n",
"\n",
"- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple little tools for examining and cleaning dirty data.\n",
"\n",
"- `corrplot`: The [corrplot package](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.\n",
"\n",
"You can have them installed as:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"corrplot\"))`\n",
"\n",
"The script below checks whether you have the packages required to complete this module and installs them for you in case they are missing."
],
"metadata": {
@ -82,8 +82,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if (!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"\r\n",
"suppressWarnings(if (!require(\"pacman\")) install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load(tidyverse, tidymodels, janitor, corrplot)"
],
"outputs": [],
@ -153,15 +153,15 @@
{
"cell_type": "markdown",
"source": [
"## **2. A dance with data: creating a data frame that will be used for modelling**\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/janitor.jpg\"\r\n",
" width=\"700\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](images/janitor.jpg){width=\"700\"}-->"
"## **2. A dance with data: creating a data frame that will be used for modelling**\n",
"\n",
"<p >\n",
" <img src=\"../../images/janitor.jpg\"\n",
" width=\"700\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"\n",
"<!--![Artwork by \\@allison_horst](../../images/janitor.jpg){width=\"700\"}-->"
],
"metadata": {
"id": "WdUKXk7Bs8-V"
@ -190,20 +190,20 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Load the core Tidyverse packages\r\n",
"library(tidyverse)\r\n",
"library(lubridate)\r\n",
"\r\n",
"# Import the pumpkins data\r\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\r\n",
"\r\n",
"\r\n",
"# Get a glimpse and dimensions of the data\r\n",
"glimpse(pumpkins)\r\n",
"\r\n",
"\r\n",
"# Print the first 50 rows of the data set\r\n",
"pumpkins %>% \r\n",
"# Load the core Tidyverse packages\n",
"library(tidyverse)\n",
"library(lubridate)\n",
"\n",
"# Import the pumpkins data\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n",
"\n",
"\n",
"# Get a glimpse and dimensions of the data\n",
"glimpse(pumpkins)\n",
"\n",
"\n",
"# Print the first 50 rows of the data set\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
@ -224,8 +224,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Return column names\r\n",
"pumpkins %>% \r\n",
"# Return column names\n",
"pumpkins %>% \n",
" names()"
],
"outputs": [],
@ -246,12 +246,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Clean names to the snake_case convention\r\n",
"pumpkins <- pumpkins %>% \r\n",
" clean_names(case = \"snake\")\r\n",
"\r\n",
"# Return column names\r\n",
"pumpkins %>% \r\n",
"# Clean names to the snake_case convention\n",
"pumpkins <- pumpkins %>% \n",
" clean_names(case = \"snake\")\n",
"\n",
"# Return column names\n",
"pumpkins %>% \n",
" names()"
],
"outputs": [],
@ -272,44 +272,44 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Select desired columns\r\n",
"pumpkins <- pumpkins %>% \r\n",
" select(variety, city_name, package, low_price, high_price, date)\r\n",
"\r\n",
"\r\n",
"\r\n",
"# Extract the month from the dates to a new column\r\n",
"pumpkins <- pumpkins %>%\r\n",
" mutate(date = mdy(date),\r\n",
" month = month(date)) %>% \r\n",
" select(-date)\r\n",
"\r\n",
"\r\n",
"\r\n",
"# Create a new column for average Price\r\n",
"pumpkins <- pumpkins %>% \r\n",
" mutate(price = (low_price + high_price)/2)\r\n",
"\r\n",
"\r\n",
"# Retain only pumpkins with the string \"bushel\"\r\n",
"new_pumpkins <- pumpkins %>% \r\n",
" filter(str_detect(string = package, pattern = \"bushel\"))\r\n",
"\r\n",
"\r\n",
"# Normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel\r\n",
"new_pumpkins <- new_pumpkins %>% \r\n",
" mutate(price = case_when(\r\n",
" str_detect(package, \"1 1/9\") ~ price/(1.1),\r\n",
" str_detect(package, \"1/2\") ~ price*2,\r\n",
" TRUE ~ price))\r\n",
"\r\n",
"# Relocate column positions\r\n",
"new_pumpkins <- new_pumpkins %>% \r\n",
" relocate(month, .before = variety)\r\n",
"\r\n",
"\r\n",
"# Display the first 5 rows\r\n",
"new_pumpkins %>% \r\n",
"# Select desired columns\n",
"pumpkins <- pumpkins %>% \n",
" select(variety, city_name, package, low_price, high_price, date)\n",
"\n",
"\n",
"\n",
"# Extract the month from the dates to a new column\n",
"pumpkins <- pumpkins %>%\n",
" mutate(date = mdy(date),\n",
" month = month(date)) %>% \n",
" select(-date)\n",
"\n",
"\n",
"\n",
"# Create a new column for average Price\n",
"pumpkins <- pumpkins %>% \n",
" mutate(price = (low_price + high_price)/2)\n",
"\n",
"\n",
"# Retain only pumpkins with the string \"bushel\"\n",
"new_pumpkins <- pumpkins %>% \n",
" filter(str_detect(string = package, pattern = \"bushel\"))\n",
"\n",
"\n",
"# Normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(price = case_when(\n",
" str_detect(package, \"1 1/9\") ~ price/(1.1),\n",
" str_detect(package, \"1/2\") ~ price*2,\n",
" TRUE ~ price))\n",
"\n",
"# Relocate column positions\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(month, .before = variety)\n",
"\n",
"\n",
"# Display the first 5 rows\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
@ -332,13 +332,13 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Set theme\r\n",
"theme_set(theme_light())\r\n",
"\r\n",
"# Make a scatter plot of month and price\r\n",
"new_pumpkins %>% \r\n",
" ggplot(mapping = aes(x = month, y = price)) +\r\n",
" geom_point(size = 1.6)\r\n"
"# Set theme\n",
"theme_set(theme_light())\n",
"\n",
"# Make a scatter plot of month and price\n",
"new_pumpkins %>% \n",
" ggplot(mapping = aes(x = month, y = price)) +\n",
" geom_point(size = 1.6)\n"
],
"outputs": [],
"metadata": {
@ -360,8 +360,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Display first 5 rows\r\n",
"new_pumpkins %>% \r\n",
"# Display first 5 rows\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
@ -421,12 +421,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Specify a recipe\r\n",
"pumpkins_recipe <- recipe(price ~ ., data = new_pumpkins) %>% \r\n",
" step_integer(all_predictors(), zero_based = TRUE)\r\n",
"\r\n",
"\r\n",
"# Print out the recipe\r\n",
"# Specify a recipe\n",
"pumpkins_recipe <- recipe(price ~ ., data = new_pumpkins) %>% \n",
" step_integer(all_predictors(), zero_based = TRUE)\n",
"\n",
"\n",
"# Print out the recipe\n",
"pumpkins_recipe"
],
"outputs": [],
@ -437,25 +437,25 @@
{
"cell_type": "markdown",
"source": [
"Awesome! 👏 We just created our first recipe that specifies an outcome (price) and its corresponding predictors and that all the predictor columns should be encoded into a set of integers 🙌! Let's quickly break it down:\r\n",
"\r\n",
"- The call to `recipe()` with a formula tells the recipe the *roles* of the variables using `new_pumpkins` data as the reference. For instance the `price` column has been assigned an `outcome` role while the rest of the columns have been assigned a `predictor` role.\r\n",
"\r\n",
"- `step_integer(all_predictors(), zero_based = TRUE)` specifies that all the predictors should be converted into a set of integers with the numbering starting at 0.\r\n",
"\r\n",
"We are sure you may be having thoughts such as: \"This is so cool!! But what if I needed to confirm that the recipes are doing exactly what I expect them to do? 🤔\"\r\n",
"\r\n",
"That's an awesome thought! You see, once your recipe is defined, you can estimate the parameters required to actually preprocess the data, and then extract the processed data. You don't typically need to do this when you use Tidymodels (we'll see the normal convention in just a minute-\\> `workflows`) but it can come in handy when you want to do some kind of sanity check for confirming that recipes are doing what you expect.\r\n",
"\r\n",
"For that, you'll need two more verbs: `prep()` and `bake()` and as always, our little R friends by [`Allison Horst`](https://github.com/allisonhorst/stats-illustrations) help you in understanding this better!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/recipes.png\"\r\n",
" width=\"550\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](images/recipes.png){width=\"550\"}-->"
"Awesome! 👏 We just created our first recipe that specifies an outcome (price) and its corresponding predictors and that all the predictor columns should be encoded into a set of integers 🙌! Let's quickly break it down:\n",
"\n",
"- The call to `recipe()` with a formula tells the recipe the *roles* of the variables using `new_pumpkins` data as the reference. For instance the `price` column has been assigned an `outcome` role while the rest of the columns have been assigned a `predictor` role.\n",
"\n",
"- `step_integer(all_predictors(), zero_based = TRUE)` specifies that all the predictors should be converted into a set of integers with the numbering starting at 0.\n",
"\n",
"We are sure you may be having thoughts such as: \"This is so cool!! But what if I needed to confirm that the recipes are doing exactly what I expect them to do? 🤔\"\n",
"\n",
"That's an awesome thought! You see, once your recipe is defined, you can estimate the parameters required to actually preprocess the data, and then extract the processed data. You don't typically need to do this when you use Tidymodels (we'll see the normal convention in just a minute-\\> `workflows`) but it can come in handy when you want to do some kind of sanity check for confirming that recipes are doing what you expect.\n",
"\n",
"For that, you'll need two more verbs: `prep()` and `bake()` and as always, our little R friends by [`Allison Horst`](https://github.com/allisonhorst/stats-illustrations) help you in understanding this better!\n",
"\n",
"<p >\n",
" <img src=\"../../images/recipes.png\"\n",
" width=\"550\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"\n",
"<!--![Artwork by \\@allison_horst](../../images/recipes.png){width=\"550\"}-->"
],
"metadata": {
"id": "KEiO0v7kuC9O"
@ -478,14 +478,14 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Prep the recipe\r\n",
"pumpkins_prep <- prep(pumpkins_recipe)\r\n",
"\r\n",
"# Bake the recipe to extract a preprocessed new_pumpkins data\r\n",
"baked_pumpkins <- bake(pumpkins_prep, new_data = NULL)\r\n",
"\r\n",
"# Print out the baked data set\r\n",
"baked_pumpkins %>% \r\n",
"# Prep the recipe\n",
"pumpkins_prep <- prep(pumpkins_recipe)\n",
"\n",
"# Bake the recipe to extract a preprocessed new_pumpkins data\n",
"baked_pumpkins <- bake(pumpkins_prep, new_data = NULL)\n",
"\n",
"# Print out the baked data set\n",
"baked_pumpkins %>% \n",
" slice_head(n = 10)"
],
"outputs": [],
@ -510,11 +510,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the correlation between the city_name and the price\r\n",
"cor(baked_pumpkins$city_name, baked_pumpkins$price)\r\n",
"\r\n",
"# Find the correlation between the package and the price\r\n",
"cor(baked_pumpkins$package, baked_pumpkins$price)\r\n"
"# Find the correlation between the city_name and the price\n",
"cor(baked_pumpkins$city_name, baked_pumpkins$price)\n",
"\n",
"# Find the correlation between the package and the price\n",
"cor(baked_pumpkins$package, baked_pumpkins$price)\n"
],
"outputs": [],
"metadata": {
@ -536,15 +536,15 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Load the corrplot package\r\n",
"library(corrplot)\r\n",
"\r\n",
"# Obtain correlation matrix\r\n",
"corr_mat <- cor(baked_pumpkins %>% \r\n",
" # Drop columns that are not really informative\r\n",
" select(-c(low_price, high_price)))\r\n",
"\r\n",
"# Make a correlation plot between the variables\r\n",
"# Load the corrplot package\n",
"library(corrplot)\n",
"\n",
"# Obtain correlation matrix\n",
"corr_mat <- cor(baked_pumpkins %>% \n",
" # Drop columns that are not really informative\n",
" select(-c(low_price, high_price)))\n",
"\n",
"# Make a correlation plot between the variables\n",
"corrplot(corr_mat, method = \"shade\", shade.col = NA, tl.col = \"black\", tl.srt = 45, addCoef.col = \"black\", cl.pos = \"n\", order = \"original\")"
],
"outputs": [],
@ -555,21 +555,21 @@
{
"cell_type": "markdown",
"source": [
"🤩🤩 Much better.\r\n",
"\r\n",
"A good question to now ask of this data will be: '`What price can I expect of a given pumpkin package?`' Let's get right into it!\r\n",
"\r\n",
"> Note: When you **`bake()`** the prepped recipe **`pumpkins_prep`** with **`new_data = NULL`**, you extract the processed (i.e. encoded) training data. If you had another data set for example a test set and would want to see how a recipe would pre-process it, you would simply bake **`pumpkins_prep`** with **`new_data = test_set`**\r\n",
"\r\n",
"## 4. Build a linear regression model\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/linear-polynomial.png\"\r\n",
" width=\"800\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/linear-polynomial.png){width=\"800\"}-->"
"🤩🤩 Much better.\n",
"\n",
"A good question to now ask of this data will be: '`What price can I expect of a given pumpkin package?`' Let's get right into it!\n",
"\n",
"> Note: When you **`bake()`** the prepped recipe **`pumpkins_prep`** with **`new_data = NULL`**, you extract the processed (i.e. encoded) training data. If you had another data set for example a test set and would want to see how a recipe would pre-process it, you would simply bake **`pumpkins_prep`** with **`new_data = test_set`**\n",
"\n",
"## 4. Build a linear regression model\n",
"\n",
"<p >\n",
" <img src=\"../../images/linear-polynomial.png\"\n",
" width=\"800\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width=\"800\"}-->"
],
"metadata": {
"id": "YqXjLuWavNxW"
@ -594,27 +594,27 @@
"cell_type": "code",
"execution_count": null,
"source": [
"set.seed(2056)\r\n",
"# Split the data into training and test sets\r\n",
"pumpkins_split <- new_pumpkins %>% \r\n",
" initial_split(prop = 0.8)\r\n",
"\r\n",
"\r\n",
"# Extract training and test data\r\n",
"pumpkins_train <- training(pumpkins_split)\r\n",
"pumpkins_test <- testing(pumpkins_split)\r\n",
"\r\n",
"\r\n",
"\r\n",
"# Create a recipe for preprocessing the data\r\n",
"lm_pumpkins_recipe <- recipe(price ~ package, data = pumpkins_train) %>% \r\n",
" step_integer(all_predictors(), zero_based = TRUE)\r\n",
"\r\n",
"\r\n",
"\r\n",
"# Create a linear model specification\r\n",
"lm_spec <- linear_reg() %>% \r\n",
" set_engine(\"lm\") %>% \r\n",
"set.seed(2056)\n",
"# Split the data into training and test sets\n",
"pumpkins_split <- new_pumpkins %>% \n",
" initial_split(prop = 0.8)\n",
"\n",
"\n",
"# Extract training and test data\n",
"pumpkins_train <- training(pumpkins_split)\n",
"pumpkins_test <- testing(pumpkins_split)\n",
"\n",
"\n",
"\n",
"# Create a recipe for preprocessing the data\n",
"lm_pumpkins_recipe <- recipe(price ~ package, data = pumpkins_train) %>% \n",
" step_integer(all_predictors(), zero_based = TRUE)\n",
"\n",
"\n",
"\n",
"# Create a linear model specification\n",
"lm_spec <- linear_reg() %>% \n",
" set_engine(\"lm\") %>% \n",
" set_mode(\"regression\")"
],
"outputs": [],
@ -639,12 +639,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Hold modelling components in a workflow\r\n",
"lm_wf <- workflow() %>% \r\n",
" add_recipe(lm_pumpkins_recipe) %>% \r\n",
" add_model(lm_spec)\r\n",
"\r\n",
"# Print out the workflow\r\n",
"# Hold modelling components in a workflow\n",
"lm_wf <- workflow() %>% \n",
" add_recipe(lm_pumpkins_recipe) %>% \n",
" add_model(lm_spec)\n",
"\n",
"# Print out the workflow\n",
"lm_wf"
],
"outputs": [],
@ -666,11 +666,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Train the model\r\n",
"lm_wf_fit <- lm_wf %>% \r\n",
" fit(data = pumpkins_train)\r\n",
"\r\n",
"# Print the model coefficients learned \r\n",
"# Train the model\n",
"lm_wf_fit <- lm_wf %>% \n",
" fit(data = pumpkins_train)\n",
"\n",
"# Print the model coefficients learned \n",
"lm_wf_fit"
],
"outputs": [],
@ -700,19 +700,19 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Make predictions for the test set\r\n",
"predictions <- lm_wf_fit %>% \r\n",
" predict(new_data = pumpkins_test)\r\n",
"\r\n",
"\r\n",
"# Bind predictions to the test set\r\n",
"lm_results <- pumpkins_test %>% \r\n",
" select(c(package, price)) %>% \r\n",
" bind_cols(predictions)\r\n",
"\r\n",
"\r\n",
"# Print the first ten rows of the tibble\r\n",
"lm_results %>% \r\n",
"# Make predictions for the test set\n",
"predictions <- lm_wf_fit %>% \n",
" predict(new_data = pumpkins_test)\n",
"\n",
"\n",
"# Bind predictions to the test set\n",
"lm_results <- pumpkins_test %>% \n",
" select(c(package, price)) %>% \n",
" bind_cols(predictions)\n",
"\n",
"\n",
"# Print the first ten rows of the tibble\n",
"lm_results %>% \n",
" slice_head(n = 10)"
],
"outputs": [],
@ -740,9 +740,9 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Evaluate performance of linear regression\r\n",
"metrics(data = lm_results,\r\n",
" truth = price,\r\n",
"# Evaluate performance of linear regression\n",
"metrics(data = lm_results,\n",
" truth = price,\n",
" estimate = .pred)"
],
"outputs": [],
@ -765,33 +765,33 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Encode package column\r\n",
"package_encode <- lm_pumpkins_recipe %>% \r\n",
" prep() %>% \r\n",
" bake(new_data = pumpkins_test) %>% \r\n",
" select(package)\r\n",
"\r\n",
"\r\n",
"# Bind encoded package column to the results\r\n",
"lm_results <- lm_results %>% \r\n",
" bind_cols(package_encode %>% \r\n",
" rename(package_integer = package)) %>% \r\n",
" relocate(package_integer, .after = package)\r\n",
"\r\n",
"\r\n",
"# Print new results data frame\r\n",
"lm_results %>% \r\n",
" slice_head(n = 5)\r\n",
"\r\n",
"\r\n",
"# Make a scatter plot\r\n",
"lm_results %>% \r\n",
" ggplot(mapping = aes(x = package_integer, y = price)) +\r\n",
" geom_point(size = 1.6) +\r\n",
" # Overlay a line of best fit\r\n",
" geom_line(aes(y = .pred), color = \"orange\", size = 1.2) +\r\n",
" xlab(\"package\")\r\n",
" \r\n"
"# Encode package column\n",
"package_encode <- lm_pumpkins_recipe %>% \n",
" prep() %>% \n",
" bake(new_data = pumpkins_test) %>% \n",
" select(package)\n",
"\n",
"\n",
"# Bind encoded package column to the results\n",
"lm_results <- lm_results %>% \n",
" bind_cols(package_encode %>% \n",
" rename(package_integer = package)) %>% \n",
" relocate(package_integer, .after = package)\n",
"\n",
"\n",
"# Print new results data frame\n",
"lm_results %>% \n",
" slice_head(n = 5)\n",
"\n",
"\n",
"# Make a scatter plot\n",
"lm_results %>% \n",
" ggplot(mapping = aes(x = package_integer, y = price)) +\n",
" geom_point(size = 1.6) +\n",
" # Overlay a line of best fit\n",
" geom_line(aes(y = .pred), color = \"orange\", size = 1.2) +\n",
" xlab(\"package\")\n",
" \n"
],
"outputs": [],
"metadata": {
@ -801,19 +801,19 @@
{
"cell_type": "markdown",
"source": [
"Great! As you can see, the linear regression model does not really well generalize the relationship between a package and its corresponding price.\r\n",
"\r\n",
"🎃 Congratulations, you just created a model that can help predict the price of a few varieties of pumpkins. Your holiday pumpkin patch will be beautiful. But you can probably create a better model!\r\n",
"\r\n",
"## 5. Build a polynomial regression model\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/linear-polynomial.png\"\r\n",
" width=\"800\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/linear-polynomial.png){width=\"800\"}-->"
"Great! As you can see, the linear regression model does not really well generalize the relationship between a package and its corresponding price.\n",
"\n",
"🎃 Congratulations, you just created a model that can help predict the price of a few varieties of pumpkins. Your holiday pumpkin patch will be beautiful. But you can probably create a better model!\n",
"\n",
"## 5. Build a polynomial regression model\n",
"\n",
"<p >\n",
" <img src=\"../../images/linear-polynomial.png\"\n",
" width=\"800\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width=\"800\"}-->"
],
"metadata": {
"id": "HOCqJXLTwtWI"

@ -12,7 +12,7 @@ output:
## Linear and Polynomial Regression for Pumpkin Pricing - Lesson 3
![Infographic by Dasani Madipalli](../images/linear-polynomial.png){width="800"}
![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width="800"}
#### Introduction
@ -78,13 +78,13 @@ We do so since we want to model a line that has the least cumulative distance fr
>
> `X` is the '`explanatory variable` or `predictor`'. `Y` is the '`dependent variable` or `outcome`'. The slope of the line is `b` and `a` is the y-intercept, which refers to the value of `Y` when `X = 0`.
>
> ![Infographic by Jen Looper](../images/slope.png){width="400"}
> ![Infographic by Jen Looper](../../images/slope.png){width="400"}
>
> First, calculate the slope `b`.
>
> In other words, and referring to our pumpkin data's original question: "predict the price of a pumpkin per bushel by month", `X` would refer to the price and `Y` would refer to the month of sale.
>
> ![Infographic by Jen Looper](../images/calculation.png)
> ![Infographic by Jen Looper](../../images/calculation.png)
>
> Calculate the value of Y. If you're paying around \$4, it must be April!
>
@ -102,7 +102,7 @@ A good linear regression model will be one that has a high (nearer to 1 than 0)
## **2. A dance with data: creating a data frame that will be used for modelling**
![Artwork by \@allison_horst](../images/janitor.jpg){width="700"}
![Artwork by \@allison_horst](../../images/janitor.jpg){width="700"}
Load up required libraries and dataset. Convert the data to a data frame containing a subset of the data:
@ -346,7 +346,7 @@ A good question to now ask of this data will be: '`What price can I expect of a
## 4. Build a linear regression model
![Infographic by Dasani Madipalli](../images/linear-polynomial.png){width="800"}
![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width="800"}
Now that we have build a recipe, and actually confirmed that the data will be pre-processed appropriately, let's now build a regression model to answer the question: `What price can I expect of a given pumpkin package?`
@ -498,7 +498,7 @@ Great! As you can see, the linear regression model does not really well generali
## 5. Build a polynomial regression model
![Infographic by Dasani Madipalli](../images/linear-polynomial.png){width="800"}
![Infographic by Dasani Madipalli](../../images/linear-polynomial.png){width="800"}
Sometimes our data may not have a linear relationship, but we still want to predict an outcome. Polynomial regression can help us make predictions for more complex non-linear relationships.

@ -0,0 +1 @@
This is a temporary placeholder

@ -29,14 +29,14 @@
{
"cell_type": "markdown",
"source": [
"## Build a logistic regression model - Lesson 4\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/logistic-linear.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](../images/logistic-linear.png){width=\"600\"}-->"
"## Build a logistic regression model - Lesson 4\n",
"\n",
"<p >\n",
" <img src=\"../../images/logistic-linear.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/logistic-linear.png){width=\"600\"}-->"
],
"metadata": {
"id": "QizKKpzakfx2"
@ -89,8 +89,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if (!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"\r\n",
"suppressWarnings(if (!require(\"pacman\")) install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)"
],
"outputs": [],
@ -101,26 +101,26 @@
{
"cell_type": "markdown",
"source": [
"## ** Define the question**\r\n",
"\r\n",
"For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.\r\n",
"\r\n",
"> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!\r\n",
"\r\n",
"## **About logistic regression**\r\n",
"\r\n",
"Logistic regression differs from linear regression, which you learned about previously, in a few important ways.\r\n",
"\r\n",
"#### **Binary classification**\r\n",
"\r\n",
"Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` (\"orange or not orange\") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/pumpkin-classifier.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/pumpkin-classifier.png){width=\"600\"}-->"
"## ** Define the question**\n",
"\n",
"For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.\n",
"\n",
"> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!\n",
"\n",
"## **About logistic regression**\n",
"\n",
"Logistic regression differs from linear regression, which you learned about previously, in a few important ways.\n",
"\n",
"#### **Binary classification**\n",
"\n",
"Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` (\"orange or not orange\") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.\n",
"\n",
"<p >\n",
" <img src=\"../../images/pumpkin-classifier.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png){width=\"600\"}-->"
],
"metadata": {
"id": "ws-hP_SXk2O6"
@ -129,20 +129,20 @@
{
"cell_type": "markdown",
"source": [
"#### **Other classifications**\r\n",
"\r\n",
"There are other types of logistic regression, including multinomial and ordinal:\r\n",
"\r\n",
"- **Multinomial**, which involves having more than one category - \"Orange, White, and Striped\".\r\n",
"\r\n",
"- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/multinomial-ordinal.png\"\r\n",
" width=\"700\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](images/multinomial-ordinal.png){width=\"600\"}-->"
"#### **Other classifications**\n",
"\n",
"There are other types of logistic regression, including multinomial and ordinal:\n",
"\n",
"- **Multinomial**, which involves having more than one category - \"Orange, White, and Striped\".\n",
"\n",
"- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).\n",
"\n",
"<p >\n",
" <img src=\"../../images/multinomial-ordinal.png\"\n",
" width=\"700\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../images/multinomial-ordinal.png){width=\"600\"}-->"
],
"metadata": {
"id": "LkLN-ZgDlBEc"
@ -184,25 +184,25 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Load the core tidyverse packages\r\n",
"library(tidyverse)\r\n",
"\r\n",
"# Import the data and clean column names\r\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \r\n",
" clean_names()\r\n",
"\r\n",
"# Select desired columns\r\n",
"pumpkins_select <- pumpkins %>% \r\n",
" select(c(city_name, package, variety, origin, item_size, color)) \r\n",
"\r\n",
"# Drop rows containing missing values and encode color as factor (category)\r\n",
"pumpkins_select <- pumpkins_select %>% \r\n",
" drop_na() %>% \r\n",
" mutate(color = factor(color))\r\n",
"\r\n",
"# View the first few rows\r\n",
"pumpkins_select %>% \r\n",
" slice_head(n = 5)\r\n"
"# Load the core tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# Import the data and clean column names\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n",
" clean_names()\n",
"\n",
"# Select desired columns\n",
"pumpkins_select <- pumpkins %>% \n",
" select(c(city_name, package, variety, origin, item_size, color)) \n",
"\n",
"# Drop rows containing missing values and encode color as factor (category)\n",
"pumpkins_select <- pumpkins_select %>% \n",
" drop_na() %>% \n",
" mutate(color = factor(color))\n",
"\n",
"# View the first few rows\n",
"pumpkins_select %>% \n",
" slice_head(n = 5)\n"
],
"outputs": [],
"metadata": {
@ -222,7 +222,7 @@
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins_select %>% \r\n",
"pumpkins_select %>% \n",
" glimpse()"
],
"outputs": [],
@ -245,8 +245,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Subset distinct observations in outcome column\r\n",
"pumpkins_select %>% \r\n",
"# Subset distinct observations in outcome column\n",
"pumpkins_select %>% \n",
" distinct(color)"
],
"outputs": [],
@ -279,16 +279,16 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Preprocess and extract data to allow some data analysis\r\n",
"baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>% \r\n",
" # Encode all columns to a set of integers\r\n",
" step_integer(all_predictors(), zero_based = T) %>% \r\n",
" prep() %>% \r\n",
" bake(new_data = NULL)\r\n",
"\r\n",
"\r\n",
"# Display the first few rows of preprocessed data\r\n",
"baked_pumpkins %>% \r\n",
"# Preprocess and extract data to allow some data analysis\n",
"baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>% \n",
" # Encode all columns to a set of integers\n",
" step_integer(all_predictors(), zero_based = T) %>% \n",
" prep() %>% \n",
" bake(new_data = NULL)\n",
"\n",
"\n",
"# Display the first few rows of preprocessed data\n",
"baked_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
@ -309,14 +309,14 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Pivot data to long format\r\n",
"baked_pumpkins_long <- baked_pumpkins %>% \r\n",
" pivot_longer(!color, names_to = \"features\", values_to = \"values\")\r\n",
"\r\n",
"\r\n",
"# Print out restructured data\r\n",
"baked_pumpkins_long %>% \r\n",
" slice_head(n = 10)\r\n"
"# Pivot data to long format\n",
"baked_pumpkins_long <- baked_pumpkins %>% \n",
" pivot_longer(!color, names_to = \"features\", values_to = \"values\")\n",
"\n",
"\n",
"# Print out restructured data\n",
"baked_pumpkins_long %>% \n",
" slice_head(n = 10)\n"
],
"outputs": [],
"metadata": {
@ -336,14 +336,14 @@
"cell_type": "code",
"execution_count": null,
"source": [
"theme_set(theme_light())\r\n",
"#Make a box plot for each predictor feature\r\n",
"baked_pumpkins_long %>% \r\n",
" mutate(color = factor(color)) %>% \r\n",
" ggplot(mapping = aes(x = color, y = values, fill = features)) +\r\n",
" geom_boxplot() + \r\n",
" facet_wrap(~ features, scales = \"free\", ncol = 3) +\r\n",
" scale_color_viridis_d(option = \"cividis\", end = .8) +\r\n",
"theme_set(theme_light())\n",
"#Make a box plot for each predictor feature\n",
"baked_pumpkins_long %>% \n",
" mutate(color = factor(color)) %>% \n",
" ggplot(mapping = aes(x = color, y = values, fill = features)) +\n",
" geom_boxplot() + \n",
" facet_wrap(~ features, scales = \"free\", ncol = 3) +\n",
" scale_color_viridis_d(option = \"cividis\", end = .8) +\n",
" theme(legend.position = \"none\")"
],
"outputs": [],
@ -372,12 +372,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Create beeswarm plots of color and item_size\r\n",
"baked_pumpkins %>% \r\n",
" mutate(color = factor(color)) %>% \r\n",
" ggplot(mapping = aes(x = color, y = item_size, color = color)) +\r\n",
" geom_quasirandom() +\r\n",
" scale_color_brewer(palette = \"Dark2\", direction = -1) +\r\n",
"# Create beeswarm plots of color and item_size\n",
"baked_pumpkins %>% \n",
" mutate(color = factor(color)) %>% \n",
" ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n",
" geom_quasirandom() +\n",
" scale_color_brewer(palette = \"Dark2\", direction = -1) +\n",
" theme(legend.position = \"none\")"
],
"outputs": [],
@ -400,13 +400,13 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a violin plot of color and item_size\r\n",
"baked_pumpkins %>%\r\n",
" mutate(color = factor(color)) %>% \r\n",
" ggplot(mapping = aes(x = color, y = item_size, fill = color)) +\r\n",
" geom_violin() +\r\n",
" geom_boxplot(color = \"black\", fill = \"white\", width = 0.02) +\r\n",
" scale_fill_brewer(palette = \"Dark2\", direction = -1) +\r\n",
"# Create a violin plot of color and item_size\n",
"baked_pumpkins %>%\n",
" mutate(color = factor(color)) %>% \n",
" ggplot(mapping = aes(x = color, y = item_size, fill = color)) +\n",
" geom_violin() +\n",
" geom_boxplot(color = \"black\", fill = \"white\", width = 0.02) +\n",
" scale_fill_brewer(palette = \"Dark2\", direction = -1) +\n",
" theme(legend.position = \"none\")"
],
"outputs": [],
@ -417,28 +417,28 @@
{
"cell_type": "markdown",
"source": [
"Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.\r\n",
"\r\n",
"## 3. Build your logistic regression model\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/logistic-linear.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"> **🧮 Show Me The Math**\r\n",
">\r\n",
"> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:\r\n",
">\r\n",
"> \r\n",
"<p >\r\n",
" <img src=\"../images/sigmoid.png\">\r\n",
"\r\n",
"\r\n",
"> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.\r\n",
"\r\n",
"Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.\r\n",
"\r\n",
"Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.\n",
"\n",
"## 3. Build your logistic regression model\n",
"\n",
"<p >\n",
" <img src=\"../../images/logistic-linear.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"> **🧮 Show Me The Math**\n",
">\n",
"> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:\n",
">\n",
"> \n",
"<p >\n",
" <img src=\"../../images/sigmoid.png\">\n",
"\n",
"\n",
"> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.\n",
"\n",
"Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.\n",
"\n",
"It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:"
],
"metadata": {
@ -449,17 +449,17 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Split data into 80% for training and 20% for testing\r\n",
"set.seed(2056)\r\n",
"pumpkins_split <- pumpkins_select %>% \r\n",
" initial_split(prop = 0.8)\r\n",
"\r\n",
"# Extract the data in each split\r\n",
"pumpkins_train <- training(pumpkins_split)\r\n",
"pumpkins_test <- testing(pumpkins_split)\r\n",
"\r\n",
"# Print out the first 5 rows of the training set\r\n",
"pumpkins_train %>% \r\n",
"# Split data into 80% for training and 20% for testing\n",
"set.seed(2056)\n",
"pumpkins_split <- pumpkins_select %>% \n",
" initial_split(prop = 0.8)\n",
"\n",
"# Extract the data in each split\n",
"pumpkins_train <- training(pumpkins_split)\n",
"pumpkins_test <- testing(pumpkins_split)\n",
"\n",
"# Print out the first 5 rows of the training set\n",
"pumpkins_train %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
@ -484,15 +484,15 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a recipe that specifies preprocessing steps for modelling\r\n",
"pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \r\n",
" step_integer(all_predictors(), zero_based = TRUE)\r\n",
"\r\n",
"\r\n",
"# Create a logistic model specification\r\n",
"log_reg <- logistic_reg() %>% \r\n",
" set_engine(\"glm\") %>% \r\n",
" set_mode(\"classification\")\r\n"
"# Create a recipe that specifies preprocessing steps for modelling\n",
"pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n",
" step_integer(all_predictors(), zero_based = TRUE)\n",
"\n",
"\n",
"# Create a logistic model specification\n",
"log_reg <- logistic_reg() %>% \n",
" set_engine(\"glm\") %>% \n",
" set_mode(\"classification\")\n"
],
"outputs": [],
"metadata": {
@ -514,13 +514,13 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Bundle modelling components in a workflow\r\n",
"log_reg_wf <- workflow() %>% \r\n",
" add_recipe(pumpkins_recipe) %>% \r\n",
" add_model(log_reg)\r\n",
"\r\n",
"# Print out the workflow\r\n",
"log_reg_wf\r\n"
"# Bundle modelling components in a workflow\n",
"log_reg_wf <- workflow() %>% \n",
" add_recipe(pumpkins_recipe) %>% \n",
" add_model(log_reg)\n",
"\n",
"# Print out the workflow\n",
"log_reg_wf\n"
],
"outputs": [],
"metadata": {
@ -540,11 +540,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Train the model\r\n",
"wf_fit <- log_reg_wf %>% \r\n",
" fit(data = pumpkins_train)\r\n",
"\r\n",
"# Print the trained workflow\r\n",
"# Train the model\n",
"wf_fit <- log_reg_wf %>% \n",
" fit(data = pumpkins_train)\n",
"\n",
"# Print the trained workflow\n",
"wf_fit"
],
"outputs": [],
@ -567,15 +567,15 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Make predictions for color and corresponding probabilities\r\n",
"results <- pumpkins_test %>% select(color) %>% \r\n",
" bind_cols(wf_fit %>% \r\n",
" predict(new_data = pumpkins_test)) %>%\r\n",
" bind_cols(wf_fit %>%\r\n",
" predict(new_data = pumpkins_test, type = \"prob\"))\r\n",
"\r\n",
"# Compare predictions\r\n",
"results %>% \r\n",
"# Make predictions for color and corresponding probabilities\n",
"results <- pumpkins_test %>% select(color) %>% \n",
" bind_cols(wf_fit %>% \n",
" predict(new_data = pumpkins_test)) %>%\n",
" bind_cols(wf_fit %>%\n",
" predict(new_data = pumpkins_test, type = \"prob\"))\n",
"\n",
"# Compare predictions\n",
"results %>% \n",
" slice_head(n = 10)"
],
"outputs": [],
@ -602,7 +602,7 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Confusion matrix for prediction results\r\n",
"# Confusion matrix for prediction results\n",
"conf_mat(data = results, truth = color, estimate = .pred_class)"
],
"outputs": [],
@ -665,8 +665,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Combine metric functions and calculate them all at once\r\n",
"eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\r\n",
"# Combine metric functions and calculate them all at once\n",
"eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n",
"eval_metrics(data = results, truth = color, estimate = .pred_class)"
],
"outputs": [],
@ -691,9 +691,9 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Make a roc_curve\r\n",
"results %>% \r\n",
" roc_curve(color, .pred_ORANGE) %>% \r\n",
"# Make a roc_curve\n",
"results %>% \n",
" roc_curve(color, .pred_ORANGE) %>% \n",
" autoplot()"
],
"outputs": [],
@ -716,8 +716,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Calculate area under curve\r\n",
"results %>% \r\n",
"# Calculate area under curve\n",
"results %>% \n",
" roc_auc(color, .pred_ORANGE)"
],
"outputs": [],
@ -728,20 +728,20 @@
{
"cell_type": "markdown",
"source": [
"The result is around `0.67053`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\r\n",
"\r\n",
"In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).\r\n",
"\r\n",
"But for now, congratulations 🎉🎉🎉! You've completed these regression lessons!\r\n",
"\r\n",
"You R awesome!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/r_learners_sm.jpeg\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"<!--![Artwork by @allison_horst](images/r_learners_sm.jpeg)-->\r\n"
"The result is around `0.67053`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\n",
"\n",
"In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).\n",
"\n",
"But for now, congratulations 🎉🎉🎉! You've completed these regression lessons!\n",
"\n",
"You R awesome!\n",
"\n",
"<p >\n",
" <img src=\"../../images/r_learners_sm.jpeg\"\n",
" width=\"600\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"<!--![Artwork by @allison_horst](../../images/r_learners_sm.jpeg)-->\n"
],
"metadata": {
"id": "5jtVKLTVoy6u"

@ -12,7 +12,7 @@ output:
## Build a logistic regression model - Lesson 4
![Infographic by Dasani Madipalli](../images/logistic-linear.png){width="600"}
![Infographic by Dasani Madipalli](../../images/logistic-linear.png){width="600"}
#### ** [Pre-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/15/)**
@ -70,7 +70,7 @@ Logistic regression differs from linear regression, which you learned about prev
Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` ("orange or not orange") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.
![Infographic by Dasani Madipalli](../images/pumpkin-classifier.png){width="600"}
![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png){width="600"}
#### **Other classifications**
@ -80,7 +80,7 @@ There are other types of logistic regression, including multinomial and ordinal:
- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).
![Infographic by Dasani Madipalli](../images/multinomial-ordinal.png){width="600"}
![Infographic by Dasani Madipalli](../../images/multinomial-ordinal.png){width="600"}
\
**It's still linear**
@ -244,7 +244,7 @@ Now that we have an idea of the relationship between the binary categories of co
>
> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:
>
> ![](../images/sigmoid.png)
> ![](../../images/sigmoid.png)
>
> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.
@ -425,6 +425,6 @@ But for now, congratulations 🎉🎉🎉! You've completed these regression les
You R awesome!
![Artwork by \@allison_horst](../images/r_learners_sm.jpeg)
![Artwork by \@allison_horst](../../images/r_learners_sm.jpeg)

@ -1,6 +1,6 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "lesson_10-R.ipynb",
@ -18,25 +18,22 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "ItETB4tSFprR"
},
"source": [
"# Build a classification model: Delicious Asian and Indian Cuisines"
]
],
"metadata": {
"id": "ItETB4tSFprR"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "ri5bQxZ-Fz_0"
},
"source": [
"## Introduction to classification: Clean, prep, and visualize your data\n",
"\n",
"In these four lessons, you will explore a fundamental focus of classic machine learning - *classification*. We will walk through using various classification algorithms with a dataset about all the brilliant cuisines of Asia and India. Hope you're hungry!\n",
"\n",
"<p >\n",
" <img src=\"../images/pinch.png\"\n",
" <img src=\"../../images/pinch.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Celebrate pan-Asian cuisines in these lessons! Image by Jen Looper</figcaption>\n",
"\n",
@ -62,7 +59,7 @@
"To state the process in a more scientific way, your classification method creates a predictive model that enables you to map the relationship between input variables to output variables.\n",
"\n",
"<p >\n",
" <img src=\"../images/binary-multiclass.png\"\n",
" <img src=\"../../images/binary-multiclass.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Binary vs. multiclass problems for classification algorithms to handle. Infographic by Jen Looper</figcaption>\n",
"\n",
@ -97,48 +94,49 @@
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"DataExplorer\", \"here\"))`\n",
"\n",
"Alternatiely, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing."
]
],
"metadata": {
"id": "ri5bQxZ-Fz_0"
}
},
{
"cell_type": "code",
"metadata": {
"id": "KIPxa4elGAPI"
},
"execution_count": null,
"source": [
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "KIPxa4elGAPI"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "YkKAxOJvGD4C"
},
"source": [
"We'll later load these awesome packages and make them available in our current R session. (This is for mere illustration, `pacman::p_load()` already did that for you)"
]
],
"metadata": {
"id": "YkKAxOJvGD4C"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "PFkQDlk0GN5O"
},
"source": [
"## Exercise - clean and balance your data\n",
"\n",
"The first task at hand, before starting this project, is to clean and **balance** your data to get better results\n",
"\n",
"Let's meet the data!🕵️"
]
],
"metadata": {
"id": "PFkQDlk0GN5O"
}
},
{
"cell_type": "code",
"metadata": {
"id": "Qccw7okxGT0S"
},
"execution_count": null,
"source": [
"# Import data\n",
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\n",
@ -147,23 +145,23 @@
"df %>% \n",
" slice_head(n = 5)\n"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "Qccw7okxGT0S"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "XrWnlgSrGVmR"
},
"source": [
"Interesting! From the looks of it, the first column is a kind of `id` column. Let's get a little more information about the data."
]
],
"metadata": {
"id": "XrWnlgSrGVmR"
}
},
{
"cell_type": "code",
"metadata": {
"id": "4UcGmxRxGieA"
},
"execution_count": null,
"source": [
"# Basic information about the data\n",
"df %>%\n",
@ -173,27 +171,27 @@
"df %>% \n",
" plot_intro(ggtheme = theme_light())"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "4UcGmxRxGieA"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "AaPubl__GmH5"
},
"source": [
"From the output, we can immediately see that we have `2448` rows and `385` columns and `0` missing values. We also have 1 discrete column, *cuisine*.\n",
"\n",
"## Exercise - learning about cuisines\n",
"\n",
"Now the work starts to become more interesting. Let's discover the distribution of data, per cuisine."
]
],
"metadata": {
"id": "AaPubl__GmH5"
}
},
{
"cell_type": "code",
"metadata": {
"id": "FRsBVy5eGrrv"
},
"execution_count": null,
"source": [
"# Count observations per cuisine\n",
"df %>% \n",
@ -208,31 +206,31 @@
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\n",
" ylab(\"cuisine\")"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "FRsBVy5eGrrv"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "vVvyDb1kG2in"
},
"source": [
"There are a finite number of cuisines, but the distribution of data is uneven. You can fix that! Before doing so, explore a little more.\n",
"\n",
"Next, let's assign each cuisine into its individual table and find out how much data is available (rows, columns) per cuisine.\n",
"\n",
"<p >\n",
" <img src=\"../images/dplyr_filter.jpg\"\n",
" <img src=\"../../images/dplyr_filter.jpg\"\n",
" width=\"600\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n"
]
],
"metadata": {
"id": "vVvyDb1kG2in"
}
},
{
"cell_type": "code",
"metadata": {
"id": "0TvXUxD3G8Bk"
},
"execution_count": null,
"source": [
"# Create individual tables for the cuisines\n",
"thai_df <- df %>% \n",
@ -254,14 +252,13 @@
" \"indian_df:\", dim(indian_df), \"\\n\",\n",
" \"korean_df:\", dim(korean_df))"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "0TvXUxD3G8Bk"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "K3RF5bSCHC76"
},
"source": [
"Perfect!😋\n",
"\n",
@ -297,13 +294,14 @@
"- `dplyr::mutate()`: helps you to create or modify columns.\n",
"\n",
"Check out this [*art*-filled learnr tutorial](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome) by Allison Horst, that introduces some useful data wrangling functions in dplyr *(part of the Tidyverse)*"
]
],
"metadata": {
"id": "K3RF5bSCHC76"
}
},
{
"cell_type": "code",
"metadata": {
"id": "uB_0JR82HTPa"
},
"execution_count": null,
"source": [
"# Creates a functions that returns the top ingredients by class\n",
"\n",
@ -325,23 +323,23 @@
" return(ingredient_df)\n",
"} # End of function"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "uB_0JR82HTPa"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "h9794WF8HWmc"
},
"source": [
"Now we can use the function to get an idea of top ten most popular ingredient by cuisine. Let's take it out for a spin with `thai_df`\n"
]
],
"metadata": {
"id": "h9794WF8HWmc"
}
},
{
"cell_type": "code",
"metadata": {
"id": "agQ-1HrcHaEA"
},
"execution_count": null,
"source": [
"# Call create_ingredient and display popular ingredients\n",
"thai_ingredient_df <- create_ingredient(df = thai_df)\n",
@ -349,23 +347,23 @@
"thai_ingredient_df %>% \n",
" slice_head(n = 10)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "agQ-1HrcHaEA"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "kHu9ffGjHdcX"
},
"source": [
"In the previous section, we used `geom_col()`, let's see how you can use `geom_bar` too, to create bar charts. Use `?geom_bar` for further reading."
]
],
"metadata": {
"id": "kHu9ffGjHdcX"
}
},
{
"cell_type": "code",
"metadata": {
"id": "fb3Bx_3DHj6e"
},
"execution_count": null,
"source": [
"# Make a bar chart for popular thai cuisines\n",
"thai_ingredient_df %>% \n",
@ -374,23 +372,23 @@
" geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\n",
" xlab(\"\") + ylab(\"\")"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "fb3Bx_3DHj6e"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "RHP_xgdkHnvM"
},
"source": [
"Let's do the same for the Japanese data"
]
],
"metadata": {
"id": "RHP_xgdkHnvM"
}
},
{
"cell_type": "code",
"metadata": {
"id": "019v8F0XHrRU"
},
"execution_count": null,
"source": [
"# Get popular ingredients for Japanese cuisines and make bar chart\n",
"create_ingredient(df = japanese_df) %>% \n",
@ -399,23 +397,23 @@
" geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\n",
" xlab(\"\") + ylab(\"\")\n"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "019v8F0XHrRU"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "iIGM7vO8Hu3v"
},
"source": [
"What about the Chinese cuisines?\n"
]
],
"metadata": {
"id": "iIGM7vO8Hu3v"
}
},
{
"cell_type": "code",
"metadata": {
"id": "lHd9_gd2HyzU"
},
"execution_count": null,
"source": [
"# Get popular ingredients for Chinese cuisines and make bar chart\n",
"create_ingredient(df = chinese_df) %>% \n",
@ -424,23 +422,23 @@
" geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\n",
" xlab(\"\") + ylab(\"\")"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "lHd9_gd2HyzU"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "ir8qyQbNH1c7"
},
"source": [
"Let's take a look at the Indian cuisines 🌶️."
]
],
"metadata": {
"id": "ir8qyQbNH1c7"
}
},
{
"cell_type": "code",
"metadata": {
"id": "ApukQtKjH5FO"
},
"execution_count": null,
"source": [
"# Get popular ingredients for Indian cuisines and make bar chart\n",
"create_ingredient(df = indian_df) %>% \n",
@ -449,23 +447,23 @@
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\n",
" xlab(\"\") + ylab(\"\")"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "ApukQtKjH5FO"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "qv30cwY1H-FM"
},
"source": [
"Finally, plot the Korean ingredients."
]
],
"metadata": {
"id": "qv30cwY1H-FM"
}
},
{
"cell_type": "code",
"metadata": {
"id": "lumgk9cHIBie"
},
"execution_count": null,
"source": [
"# Get popular ingredients for Korean cuisines and make bar chart\n",
"create_ingredient(df = korean_df) %>% \n",
@ -474,25 +472,25 @@
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\n",
" xlab(\"\") + ylab(\"\")"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "lumgk9cHIBie"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "iO4veMXuIEta"
},
"source": [
"From the data visualizations, we can now drop the most common ingredients that create confusion between distinct cuisines, using `dplyr::select()`.\n",
"\n",
"Everyone loves rice, garlic and ginger!"
]
],
"metadata": {
"id": "iO4veMXuIEta"
}
},
{
"cell_type": "code",
"metadata": {
"id": "iHJPiG6rIUcK"
},
"execution_count": null,
"source": [
"# Drop id column, rice, garlic and ginger from our original data set\n",
"df_select <- df %>% \n",
@ -502,41 +500,41 @@
"df_select %>% \n",
" slice_head(n = 5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "iHJPiG6rIUcK"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "kkFd-JxdIaL6"
},
"source": [
"## Preprocessing data using recipes 👩‍🍳👨‍🍳 - Dealing with imbalanced data ⚖️\n",
"\n",
"<p >\n",
" <img src=\"../images/recipes.png\"\n",
" <img src=\"../../images/recipes.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"Given that this lesson is about cuisines, we have to put `recipes` into context .\n",
"\n",
"Tidymodels provides yet another neat package: `recipes`- a package for preprocessing data.\n"
]
],
"metadata": {
"id": "kkFd-JxdIaL6"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "6l2ubtTPJAhY"
},
"source": [
"Let's take a look at the distribution of our cuisines again.\n"
]
],
"metadata": {
"id": "6l2ubtTPJAhY"
}
},
{
"cell_type": "code",
"metadata": {
"id": "1e-E9cb7JDVi"
},
"execution_count": null,
"source": [
"# Distribution of cuisines\n",
"old_label_count <- df_select %>% \n",
@ -545,14 +543,13 @@
"\n",
"old_label_count"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "1e-E9cb7JDVi"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "soAw6826JKx9"
},
"source": [
"\n",
"As you can see, there is quite an unequal distribution in the number of cuisines. Korean cuisines are almost 3 times Thai cuisines. Imbalanced data often has negative effects on the model performance. Think about a binary classification. If most of your data is one class, a ML model is going to predict that class more frequently, just because there is more data for it. Balancing the data takes any skewed data and helps remove this imbalance. Many models perform best when the number of observations is equal and, thus, tend to struggle with unbalanced data.\n",
@ -564,13 +561,14 @@
"- removing observations from majority class: `Under-sampling`\n",
"\n",
"Let's now demonstrate how to deal with imbalanced data sets using a `recipe`. A recipe can be thought of as a blueprint that describes what steps should be applied to a data set in order to get it ready for data analysis."
]
],
"metadata": {
"id": "soAw6826JKx9"
}
},
{
"cell_type": "code",
"metadata": {
"id": "HS41brUIJVJy"
},
"execution_count": null,
"source": [
"# Load themis package for dealing with imbalanced data\n",
"library(themis)\n",
@ -581,14 +579,13 @@
"\n",
"cuisines_recipe"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "HS41brUIJVJy"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "Yb-7t7XcJaC8"
},
"source": [
"Let's break down our preprocessing steps.\n",
"\n",
@ -601,13 +598,14 @@
"`prep()`: estimates the required parameters from a training set that can be later applied to other data sets.\n",
"\n",
"`bake()`: takes a prepped recipe and applies the operations to any data set.\n"
]
],
"metadata": {
"id": "Yb-7t7XcJaC8"
}
},
{
"cell_type": "code",
"metadata": {
"id": "9QhSgdpxJl44"
},
"execution_count": null,
"source": [
"# Prep and bake the recipe\n",
"preprocessed_df <- cuisines_recipe %>% \n",
@ -623,23 +621,23 @@
"preprocessed_df %>% \n",
" introduce()"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "9QhSgdpxJl44"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "dmidELh_LdV7"
},
"source": [
"Let's now check the distribution of our cuisines and compare them with the imbalanced data."
]
],
"metadata": {
"id": "dmidELh_LdV7"
}
},
{
"cell_type": "code",
"metadata": {
"id": "aSh23klBLwDz"
},
"execution_count": null,
"source": [
"# Distribution of cuisines\n",
"new_label_count <- preprocessed_df %>% \n",
@ -649,14 +647,13 @@
"list(new_label_count = new_label_count,\n",
" old_label_count = old_label_count)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "aSh23klBLwDz"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "HEu80HZ8L7ae"
},
"source": [
"Yum! The data is nice and clean, balanced, and very delicious 😋!\n",
"\n",
@ -667,25 +664,25 @@
"> When you **`bake()`** a prepped recipe with **`new_data = NULL`**, you get the data that you provided when defining the recipe back, but having undergone the preprocessing steps.\n",
"\n",
"Let's now save a copy of this data for use in future lessons:\n"
]
],
"metadata": {
"id": "HEu80HZ8L7ae"
}
},
{
"cell_type": "code",
"metadata": {
"id": "cBmCbIgrMOI6"
},
"execution_count": null,
"source": [
"# Save preprocessed data\n",
"write_csv(preprocessed_df, \"../../data/cleaned_cuisines_R.csv\")"
"write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "cBmCbIgrMOI6"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "WQs5621pMGwf"
},
"source": [
"This fresh CSV can now be found in the root data folder.\n",
"\n",
@ -710,10 +707,13 @@
"[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ♥️\n",
"\n",
"<p >\n",
" <img src=\"../images/r_learners_sm.jpeg\"\n",
" <img src=\"../../images/r_learners_sm.jpeg\"\n",
" width=\"600\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n"
]
],
"metadata": {
"id": "WQs5621pMGwf"
}
}
]
}

@ -14,7 +14,7 @@ output:
In these four lessons, you will explore a fundamental focus of classic machine learning - *classification*. We will walk through using various classification algorithms with a dataset about all the brilliant cuisines of Asia and India. Hope you're hungry!
![Celebrate pan-Asian cuisines in these lessons! Image by Jen Looper](../images/pinch.png)
![Celebrate pan-Asian cuisines in these lessons! Image by Jen Looper](../../images/pinch.png)
Classification is a form of [supervised learning](https://wikipedia.org/wiki/Supervised_learning) that bears a lot in common with regression techniques. In classification, you train a model to predict which `category` an item belongs to. If machine learning is all about predicting values or names to things by using datasets, then classification generally falls into two groups: *binary classification* and *multiclass classification*.
@ -417,4 +417,4 @@ This curriculum contains several interesting datasets. Dig through the `data` fo
[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ♥️
![Artwork by \@allison_horst](../images/r_learners_sm.jpeg)
![Artwork by \@allison_horst](../../images/r_learners_sm.jpeg)

@ -0,0 +1 @@
This is a temporary placeholder

@ -0,0 +1 @@
this is a temporary placeholder

@ -0,0 +1 @@
This is a temporary placeholder

@ -0,0 +1 @@
this is a temporary placeholder

@ -0,0 +1 @@
this is a temporary placeholder

@ -0,0 +1 @@
This is a temporary placeholder

@ -0,0 +1 @@
this is a temporary placeholder

@ -0,0 +1 @@
This is a temporary placeholder

@ -0,0 +1 @@
this is a temporary placeholder

@ -0,0 +1 @@
This is a temporary placeholder

@ -0,0 +1 @@
this is a temporary placeholder

@ -0,0 +1 @@
This is a temporary placeholder

@ -0,0 +1 @@
this is a temporary placeholder

@ -0,0 +1 @@
This is a temporary placeholder

@ -0,0 +1 @@
this is a temporary placeholder

@ -0,0 +1 @@
This is a temporary placeholder

@ -0,0 +1 @@
this is a temporary placeholder
Loading…
Cancel
Save