Suppress pacman warning and render images better for lessons 05 and 06

pull/288/head
Eric 4 years ago
parent 2521fc2ba4
commit 108ffc45f2

@ -1,6 +1,6 @@
{ {
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 0, "nbformat_minor": 2,
"metadata": { "metadata": {
"colab": { "colab": {
"name": "lesson_1-R.ipynb", "name": "lesson_1-R.ipynb",
@ -19,37 +19,39 @@
"cells": [ "cells": [
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "YJUHCXqK57yz"
},
"source": [ "source": [
"#Build a regression model: Get started with R and Tidymodels for regression models" "#Build a regression model: Get started with R and Tidymodels for regression models"
] ],
"metadata": {
"id": "YJUHCXqK57yz"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"## Introduction to Regression - Lesson 1\r\n",
"\r\n",
"#### Putting it into perspective\r\n",
"\r\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\r\n",
"\r\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\r\n",
"\r\n",
"That said, let's get started on this task!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/encouRage.jpg\"\r\n",
" width=\"630\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/encouRage.jpg)<br>Artwork by @allison_horst-->"
],
"metadata": { "metadata": {
"id": "LWNNzfqd6feZ" "id": "LWNNzfqd6feZ"
}, }
"source": [
"## Introduction to Regression - Lesson 1\n",
"\n",
"#### Putting it into perspective\n",
"\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\n",
"\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\n",
"\n",
"That said, let's get started on this task!\n",
"\n",
"![Artwork by \\@allison_horst](../images/encouRage.jpg)<br>Artwork by @allison_horst"
]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "FIo2YhO26wI9"
},
"source": [ "source": [
"## 1. Loading up our tool set\n", "## 1. Loading up our tool set\n",
"\n", "\n",
@ -64,62 +66,62 @@
"`install.packages(c(\"tidyverse\", \"tidymodels\"))`\n", "`install.packages(c(\"tidyverse\", \"tidymodels\"))`\n",
"\n", "\n",
"The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing." "The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing."
] ],
"metadata": {
"id": "FIo2YhO26wI9"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": 2,
"id": "cIA9fz9v7Dss",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2df7073b-86b2-4b32-cb86-0da605a0dc11"
},
"source": [ "source": [
"if (!require(\"pacman\")) install.packages(\"pacman\")\n", "suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"pacman::p_load(tidyverse, tidymodels)" "pacman::p_load(tidyverse, tidymodels)"
], ],
"execution_count": 2,
"outputs": [ "outputs": [
{ {
"output_type": "stream", "output_type": "stream",
"name": "stderr",
"text": [ "text": [
"Loading required package: pacman\n", "Loading required package: pacman\n",
"\n" "\n"
]
}
], ],
"name": "stderr" "metadata": {
"id": "cIA9fz9v7Dss",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2df7073b-86b2-4b32-cb86-0da605a0dc11"
} }
]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "gpO_P_6f9WUG"
},
"source": [ "source": [
"Now, let's load these awesome packages and make them available in our current R session.(This is for mere illustration, `pacman::p_load()` already did that for you)" "Now, let's load these awesome packages and make them available in our current R session.(This is for mere illustration, `pacman::p_load()` already did that for you)"
] ],
"metadata": {
"id": "gpO_P_6f9WUG"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "NLMycgG-9ezO"
},
"source": [ "source": [
"# load the core Tidyverse packages\n", "# load the core Tidyverse packages\r\n",
"library(tidyverse)\n", "library(tidyverse)\r\n",
"\n", "\r\n",
"# load the core Tidymodels packages\n", "# load the core Tidymodels packages\r\n",
"library(tidymodels)\n" "library(tidymodels)\r\n"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "NLMycgG-9ezO"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "KM6iXLH996Cl"
},
"source": [ "source": [
"## 2. The diabetes dataset\n", "## 2. The diabetes dataset\n",
"\n", "\n",
@ -156,34 +158,34 @@
"Before going any further, let's also introduce something you will encounter often in R code 🥁🥁: the pipe operator `%>%`\n", "Before going any further, let's also introduce something you will encounter often in R code 🥁🥁: the pipe operator `%>%`\n",
"\n", "\n",
"The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code." "The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code."
] ],
"metadata": {
"id": "KM6iXLH996Cl"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "Z1geAMhM-bSP"
},
"source": [ "source": [
"# Import the data set\n", "# Import the data set\r\n",
"diabetes <- read_table2(file = \"https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt\")\n", "diabetes <- read_table2(file = \"https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt\")\r\n",
"\n", "\r\n",
"\n", "\r\n",
"# Get a glimpse and dimensions of the data\n", "# Get a glimpse and dimensions of the data\r\n",
"glimpse(diabetes)\n", "glimpse(diabetes)\r\n",
"\n", "\r\n",
"\n", "\r\n",
"# Select the first 5 rows of the data\n", "# Select the first 5 rows of the data\r\n",
"diabetes %>% \n", "diabetes %>% \r\n",
" slice(1:5)" " slice(1:5)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "Z1geAMhM-bSP"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "UwjVT1Hz-c3Z"
},
"source": [ "source": [
"`glimpse()` shows us that this data has 442 rows and 11 columns with all the columns being of data type `double` \n", "`glimpse()` shows us that this data has 442 rows and 11 columns with all the columns being of data type `double` \n",
"\n", "\n",
@ -198,65 +200,65 @@
"Now that we have the data, let's narrow down to one feature (`bmi`) to target for this exercise. This will require us to select the desired columns. So, how do we do this?\n", "Now that we have the data, let's narrow down to one feature (`bmi`) to target for this exercise. This will require us to select the desired columns. So, how do we do this?\n",
"\n", "\n",
"[`dplyr::select()`](https://dplyr.tidyverse.org/reference/select.html) allows us to *select* (and optionally rename) columns in a data frame." "[`dplyr::select()`](https://dplyr.tidyverse.org/reference/select.html) allows us to *select* (and optionally rename) columns in a data frame."
] ],
"metadata": {
"id": "UwjVT1Hz-c3Z"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "RDY1oAKI-m80"
},
"source": [ "source": [
"# Select predictor feature `bmi` and outcome `y`\n", "# Select predictor feature `bmi` and outcome `y`\r\n",
"diabetes_select <- diabetes %>% \n", "diabetes_select <- diabetes %>% \r\n",
" select(c(bmi, y))\n", " select(c(bmi, y))\r\n",
"\n", "\r\n",
"# Print the first 5 rows\n", "# Print the first 5 rows\r\n",
"diabetes_select %>% \n", "diabetes_select %>% \r\n",
" slice(1:10)" " slice(1:10)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "RDY1oAKI-m80"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "SDk668xK-tc3"
},
"source": [ "source": [
"## 3. Training and Testing data\n", "## 3. Training and Testing data\n",
"\n", "\n",
"It's common practice in supervised learning to *split* the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to see how the model performed.\n", "It's common practice in supervised learning to *split* the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to see how the model performed.\n",
"\n", "\n",
"Now that we have data ready, we can see if a machine can help determine a logical split between the numbers in this dataset. We can use the [rsample](https://tidymodels.github.io/rsample/) package, which is part of the Tidymodels framework, to create an object that contains the information on *how* to split the data, and then two more rsample functions to extract the created training and testing sets:\n" "Now that we have data ready, we can see if a machine can help determine a logical split between the numbers in this dataset. We can use the [rsample](https://tidymodels.github.io/rsample/) package, which is part of the Tidymodels framework, to create an object that contains the information on *how* to split the data, and then two more rsample functions to extract the created training and testing sets:\n"
] ],
"metadata": {
"id": "SDk668xK-tc3"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "EqtHx129-1h-"
},
"source": [ "source": [
"set.seed(2056)\n", "set.seed(2056)\r\n",
"# Split 67% of the data for training and the rest for tesing\n", "# Split 67% of the data for training and the rest for tesing\r\n",
"diabetes_split <- diabetes_select %>% \n", "diabetes_split <- diabetes_select %>% \r\n",
" initial_split(prop = 0.67)\n", " initial_split(prop = 0.67)\r\n",
"\n", "\r\n",
"# Extract the resulting train and test sets\n", "# Extract the resulting train and test sets\r\n",
"diabetes_train <- training(diabetes_split)\n", "diabetes_train <- training(diabetes_split)\r\n",
"diabetes_test <- testing(diabetes_split)\n", "diabetes_test <- testing(diabetes_split)\r\n",
"\n", "\r\n",
"# Print the first 3 rows of the training set\n", "# Print the first 3 rows of the training set\r\n",
"diabetes_train %>% \n", "diabetes_train %>% \r\n",
" slice(1:10)" " slice(1:10)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "EqtHx129-1h-"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "sBOS-XhB-6v7"
},
"source": [ "source": [
"## 4. Train a linear regression model with Tidymodels\n", "## 4. Train a linear regression model with Tidymodels\n",
"\n", "\n",
@ -271,68 +273,68 @@
"- Model **engine** is the computational tool which will be used to fit the model. Often these are R packages, such as **`\"lm\"`** or **`\"ranger\"`**\n", "- Model **engine** is the computational tool which will be used to fit the model. Often these are R packages, such as **`\"lm\"`** or **`\"ranger\"`**\n",
"\n", "\n",
"This modeling information is captured in a model specification, so let's build one!" "This modeling information is captured in a model specification, so let's build one!"
] ],
"metadata": {
"id": "sBOS-XhB-6v7"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "20OwEw20--t3"
},
"source": [ "source": [
"# Build a linear model specification\n", "# Build a linear model specification\r\n",
"lm_spec <- \n", "lm_spec <- \r\n",
" # Type\n", " # Type\r\n",
" linear_reg() %>% \n", " linear_reg() %>% \r\n",
" # Engine\n", " # Engine\r\n",
" set_engine(\"lm\") %>% \n", " set_engine(\"lm\") %>% \r\n",
" # Mode\n", " # Mode\r\n",
" set_mode(\"regression\")\n", " set_mode(\"regression\")\r\n",
"\n", "\r\n",
"\n", "\r\n",
"# Print the model specification\n", "# Print the model specification\r\n",
"lm_spec" "lm_spec"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "20OwEw20--t3"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "_oDHs89k_CJj"
},
"source": [ "source": [
"After a model has been *specified*, the model can be `estimated` or `trained` using the [`fit()`](https://parsnip.tidymodels.org/reference/fit.html) function, typically using a formula and some data.\n", "After a model has been *specified*, the model can be `estimated` or `trained` using the [`fit()`](https://parsnip.tidymodels.org/reference/fit.html) function, typically using a formula and some data.\n",
"\n", "\n",
"`y ~ .` means we'll fit `y` as the predicted quantity/target, explained by all the predictors/features ie, `.` (in this case, we only have one predictor: `bmi` )" "`y ~ .` means we'll fit `y` as the predicted quantity/target, explained by all the predictors/features ie, `.` (in this case, we only have one predictor: `bmi` )"
] ],
"metadata": {
"id": "_oDHs89k_CJj"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "YlsHqd-q_GJQ"
},
"source": [ "source": [
"# Build a linear model specification\n", "# Build a linear model specification\r\n",
"lm_spec <- linear_reg() %>% \n", "lm_spec <- linear_reg() %>% \r\n",
" set_engine(\"lm\") %>%\n", " set_engine(\"lm\") %>%\r\n",
" set_mode(\"regression\")\n", " set_mode(\"regression\")\r\n",
"\n", "\r\n",
"\n", "\r\n",
"# Train a linear regression model\n", "# Train a linear regression model\r\n",
"lm_mod <- lm_spec %>% \n", "lm_mod <- lm_spec %>% \r\n",
" fit(y ~ ., data = diabetes_train)\n", " fit(y ~ ., data = diabetes_train)\r\n",
"\n", "\r\n",
"# Print the model\n", "# Print the model\r\n",
"lm_mod" "lm_mod"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "YlsHqd-q_GJQ"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "kGZ22RQj_Olu"
},
"source": [ "source": [
"From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that gives us the lowest overall error between the actual and predicted variable.\n", "From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that gives us the lowest overall error between the actual and predicted variable.\n",
"<br>\n", "<br>\n",
@ -340,97 +342,100 @@
"## 5. Make predictions on the test set\n", "## 5. Make predictions on the test set\n",
"\n", "\n",
"Now that we've trained a model, we can use it to predict the disease progression y for the test dataset using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). This will be used to draw the line between data groups." "Now that we've trained a model, we can use it to predict the disease progression y for the test dataset using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). This will be used to draw the line between data groups."
] ],
"metadata": {
"id": "kGZ22RQj_Olu"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "nXHbY7M2_aao"
},
"source": [ "source": [
"# Make predictions for the test set\n", "# Make predictions for the test set\r\n",
"predictions <- lm_mod %>% \n", "predictions <- lm_mod %>% \r\n",
" predict(new_data = diabetes_test)\n", " predict(new_data = diabetes_test)\r\n",
"\n", "\r\n",
"# Print out some of the predictions\n", "# Print out some of the predictions\r\n",
"predictions %>% \n", "predictions %>% \r\n",
" slice(1:5)" " slice(1:5)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "nXHbY7M2_aao"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "R_JstwUY_bIs"
},
"source": [ "source": [
"Woohoo! 💃🕺 We just trained a model and used it to make predictions!\n", "Woohoo! 💃🕺 We just trained a model and used it to make predictions!\n",
"\n", "\n",
"When making predictions, the tidymodels convention is to always produce a tibble/data frame of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format for subsequent operations such as plotting.\n", "When making predictions, the tidymodels convention is to always produce a tibble/data frame of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format for subsequent operations such as plotting.\n",
"\n", "\n",
"`dplyr::bind_cols()` efficiently binds multiple data frames column." "`dplyr::bind_cols()` efficiently binds multiple data frames column."
] ],
"metadata": {
"id": "R_JstwUY_bIs"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "RybsMJR7_iI8"
},
"source": [ "source": [
"# Combine the predictions and the original test set\n", "# Combine the predictions and the original test set\r\n",
"results <- diabetes_test %>% \n", "results <- diabetes_test %>% \r\n",
" bind_cols(predictions)\n", " bind_cols(predictions)\r\n",
"\n", "\r\n",
"\n", "\r\n",
"results %>% \n", "results %>% \r\n",
" slice(1:5)" " slice(1:5)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "RybsMJR7_iI8"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "XJbYbMZW_n_s"
},
"source": [ "source": [
"## 6. Plot modelling results\n", "## 6. Plot modelling results\n",
"\n", "\n",
"Now, its time to see this visually 📈. We'll create a scatter plot of all the `y` and `bmi` values of the test set, then use the predictions to draw a line in the most appropriate place, between the model's data groupings.\n", "Now, its time to see this visually 📈. We'll create a scatter plot of all the `y` and `bmi` values of the test set, then use the predictions to draw a line in the most appropriate place, between the model's data groupings.\n",
"\n", "\n",
"R has several systems for making graphs, but `ggplot2` is one of the most elegant and most versatile. This allows you to compose graphs by **combining independent components**." "R has several systems for making graphs, but `ggplot2` is one of the most elegant and most versatile. This allows you to compose graphs by **combining independent components**."
] ],
"metadata": {
"id": "XJbYbMZW_n_s"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "R9tYp3VW_sTn"
},
"source": [ "source": [
"# Set a theme for the plot\n", "# Set a theme for the plot\r\n",
"theme_set(theme_light())\n", "theme_set(theme_light())\r\n",
"# Create a scatter plot\n", "# Create a scatter plot\r\n",
"results %>% \n", "results %>% \r\n",
" ggplot(aes(x = bmi)) +\n", " ggplot(aes(x = bmi)) +\r\n",
" # Add a scatter plot\n", " # Add a scatter plot\r\n",
" geom_point(aes(y = y), size = 1.6) +\n", " geom_point(aes(y = y), size = 1.6) +\r\n",
" # Add a line plot\n", " # Add a line plot\r\n",
" geom_line(aes(y = .pred), color = \"blue\", size = 1.5)" " geom_line(aes(y = .pred), color = \"blue\", size = 1.5)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "R9tYp3VW_sTn"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "zrPtHIxx_tNI"
},
"source": [ "source": [
"> ✅ Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model.\n", "> ✅ Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model.\n",
"\n", "\n",
"Congratulations, you built your first linear regression model, created a prediction with it, and displayed it in a plot!\n" "Congratulations, you built your first linear regression model, created a prediction with it, and displayed it in a plot!\n"
] ],
"metadata": {
"id": "zrPtHIxx_tNI"
}
} }
] ]
} }

@ -1,6 +1,6 @@
{ {
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 0, "nbformat_minor": 2,
"metadata": { "metadata": {
"colab": { "colab": {
"name": "lesson_2-R.ipynb", "name": "lesson_2-R.ipynb",
@ -19,35 +19,39 @@
"cells": [ "cells": [
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"# Build a regression model: prepare and visualize data\r\n",
"\r\n",
"## **Linear Regression for Pumpkins - Lesson 2**\r\n",
"#### Introduction\r\n",
"\r\n",
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\r\n",
"\r\n",
"In this lesson, you will learn:\r\n",
"\r\n",
"- How to prepare your data for model-building.\r\n",
"\r\n",
"- How to use `ggplot2` for data visualization.\r\n",
"\r\n",
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\r\n",
"\r\n",
"Let's see this by working through a practical exercise.\r\n",
"\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/unruly_data.jpg\"\r\n",
" width=\"700\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/unruly_data.jpg)<br>Artwork by \\@allison_horst-->"
],
"metadata": { "metadata": {
"id": "Pg5aexcOPqAZ" "id": "Pg5aexcOPqAZ"
}, }
"source": [
"# Build a regression model: prepare and visualize data\n",
"\n",
"## **Linear Regression for Pumpkins - Lesson 2**\n",
"#### Introduction\n",
"\n",
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\n",
"\n",
"In this lesson, you will learn:\n",
"\n",
"- How to prepare your data for model-building.\n",
"\n",
"- How to use `ggplot2` for data visualization.\n",
"\n",
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\n",
"\n",
"Let's see this by working through a practical exercise.\n",
"\n",
"![Artwork by \\@allison_horst](../images/unruly_data.jpg)<br>Artwork by \\@allison_horst"
]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "dc5WhyVdXAjR"
},
"source": [ "source": [
"## 1. Importing pumpkins data and summoning the Tidyverse\n", "## 1. Importing pumpkins data and summoning the Tidyverse\n",
"\n", "\n",
@ -60,58 +64,58 @@
"`install.packages(c(\"tidyverse\"))`\n", "`install.packages(c(\"tidyverse\"))`\n",
"\n", "\n",
"The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing." "The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing."
] ],
"metadata": {
"id": "dc5WhyVdXAjR"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "GqPYUZgfXOBt"
},
"source": [ "source": [
"if (!require(\"pacman\")) install.packages(\"pacman\")\n", "suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"pacman::p_load(tidyverse)" "pacman::p_load(tidyverse)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "GqPYUZgfXOBt"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "kvjDTPDSXRr2"
},
"source": [ "source": [
"Now, let's fire up some packages and load the [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) provided for this lesson!" "Now, let's fire up some packages and load the [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) provided for this lesson!"
] ],
"metadata": {
"id": "kvjDTPDSXRr2"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "VMri-t2zXqgD"
},
"source": [ "source": [
"# Load the core Tidyverse packages\n", "# Load the core Tidyverse packages\r\n",
"library(tidyverse)\n", "library(tidyverse)\r\n",
"\n", "\r\n",
"# Import the pumpkins data\n", "# Import the pumpkins data\r\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n", "pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\r\n",
"\n", "\r\n",
"\n", "\r\n",
"# Get a glimpse and dimensions of the data\n", "# Get a glimpse and dimensions of the data\r\n",
"glimpse(pumpkins)\n", "glimpse(pumpkins)\r\n",
"\n", "\r\n",
"\n", "\r\n",
"# Print the first 50 rows of the data set\n", "# Print the first 50 rows of the data set\r\n",
"pumpkins %>% \n", "pumpkins %>% \r\n",
" slice_head(n =50)" " slice_head(n =50)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "VMri-t2zXqgD"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "REWcIv9yX29v"
},
"source": [ "source": [
"A quick `glimpse()` immediately shows that there are blanks and a mix of strings (`chr`) and numeric data (`dbl`). The `Date` is of type character and there's also a strange column called `Package` where the data is a mix between `sacks`, `bins` and other values. The data, in fact, is a bit of a mess 😤.\n", "A quick `glimpse()` immediately shows that there are blanks and a mix of strings (`chr`) and numeric data (`dbl`). The `Date` is of type character and there's also a strange column called `Package` where the data is a mix between `sacks`, `bins` and other values. The data, in fact, is a bit of a mess 😤.\n",
"\n", "\n",
@ -120,13 +124,13 @@
"\n", "\n",
"> A refresher: The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code.\n", "> A refresher: The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code.\n",
"\n" "\n"
] ],
"metadata": {
"id": "REWcIv9yX29v"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "Zxfb3AM5YbUe"
},
"source": [ "source": [
"## 2. Check for missing data\n", "## 2. Check for missing data\n",
"\n", "\n",
@ -135,105 +139,112 @@
"So how would we know that the data frame contains missing values?\n", "So how would we know that the data frame contains missing values?\n",
"<br>\n", "<br>\n",
"- One straight forward way would be to use the base R function `anyNA` which returns the logical objects `TRUE` or `FALSE`" "- One straight forward way would be to use the base R function `anyNA` which returns the logical objects `TRUE` or `FALSE`"
] ],
"metadata": {
"id": "Zxfb3AM5YbUe"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "G--DQutAYltj"
},
"source": [ "source": [
"pumpkins %>% \n", "pumpkins %>% \r\n",
" anyNA()" " anyNA()"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "G--DQutAYltj"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "mU-7-SB6YokF"
},
"source": [ "source": [
"Great, there seems to be some missing data! That's a good place to start.\n", "Great, there seems to be some missing data! That's a good place to start.\n",
"\n", "\n",
"- Another way would be to use the function `is.na()` that indicates which individual column elements are missing with a logical `TRUE`." "- Another way would be to use the function `is.na()` that indicates which individual column elements are missing with a logical `TRUE`."
] ],
"metadata": {
"id": "mU-7-SB6YokF"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "W-DxDOR4YxSW"
},
"source": [ "source": [
"pumpkins %>% \n", "pumpkins %>% \r\n",
" is.na() %>% \n", " is.na() %>% \r\n",
" head(n = 7)" " head(n = 7)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "W-DxDOR4YxSW"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "xUWxipKYY0o7"
},
"source": [ "source": [
"Okay, got the job done but with a large data frame such as this, it would be inefficient and practically impossible to review all of the rows and columns individually😴.\n", "Okay, got the job done but with a large data frame such as this, it would be inefficient and practically impossible to review all of the rows and columns individually😴.\n",
"\n", "\n",
"- A more intuitive way would be to calculate the sum of the missing values for each column:" "- A more intuitive way would be to calculate the sum of the missing values for each column:"
] ],
"metadata": {
"id": "xUWxipKYY0o7"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "ZRBWV6P9ZArL"
},
"source": [ "source": [
"pumpkins %>% \n", "pumpkins %>% \r\n",
" is.na() %>% \n", " is.na() %>% \r\n",
" colSums()" " colSums()"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "ZRBWV6P9ZArL"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "9gv-crB6ZD1Y"
},
"source": [ "source": [
"Much better! There is missing data, but maybe it won't matter for the task at hand. Let's see what further analysis brings forth.\n", "Much better! There is missing data, but maybe it won't matter for the task at hand. Let's see what further analysis brings forth.\n",
"\n", "\n",
"> Along with the awesome sets of packages and functions, R has a very good documentation. For instance, use `help(colSums)` or `?colSums` to find out more about the function." "> Along with the awesome sets of packages and functions, R has a very good documentation. For instance, use `help(colSums)` or `?colSums` to find out more about the function."
] ],
"metadata": {
"id": "9gv-crB6ZD1Y"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"## 3. Dplyr: A Grammar of Data Manipulation\r\n",
"\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/dplyr_wrangling.png\"\r\n",
" width=\"569\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/dplyr_wrangling.png)<br/>Artwork by \\@allison_horst-->"
],
"metadata": { "metadata": {
"id": "o4jLY5-VZO2C" "id": "o4jLY5-VZO2C"
}, }
"source": [
"## 3. Dplyr: A Grammar of Data Manipulation\n",
"\n",
"![Artwork by \\@allison_horst](../images/dplyr_wrangling.png)<br/>Artwork by \\@allison_horst"
]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "i5o33MQBZWWw"
},
"source": [ "source": [
"[`dplyr`](https://dplyr.tidyverse.org/), a package in the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges. In this section, we'll explore some of dplyr's verbs!\n", "[`dplyr`](https://dplyr.tidyverse.org/), a package in the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges. In this section, we'll explore some of dplyr's verbs!\n",
"<br>\n" "<br>\n"
] ],
"metadata": {
"id": "i5o33MQBZWWw"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "x3VGMAGBZiUr"
},
"source": [ "source": [
"#### dplyr::select()\n", "#### dplyr::select()\n",
"\n", "\n",
@ -242,31 +253,31 @@
"To make your data frame easier to work with, drop several of its columns, using `select()`, keeping only the columns you need.\n", "To make your data frame easier to work with, drop several of its columns, using `select()`, keeping only the columns you need.\n",
"\n", "\n",
"For instance, in this exercise, our analysis will involve the columns `Package`, `Low Price`, `High Price` and `Date`. Let's select these columns." "For instance, in this exercise, our analysis will involve the columns `Package`, `Low Price`, `High Price` and `Date`. Let's select these columns."
] ],
"metadata": {
"id": "x3VGMAGBZiUr"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "F_FgxQnVZnM0"
},
"source": [ "source": [
"# Select desired columns\n", "# Select desired columns\r\n",
"pumpkins <- pumpkins %>% \n", "pumpkins <- pumpkins %>% \r\n",
" select(Package, `Low Price`, `High Price`, Date)\n", " select(Package, `Low Price`, `High Price`, Date)\r\n",
"\n", "\r\n",
"\n", "\r\n",
"# Print data set\n", "# Print data set\r\n",
"pumpkins %>% \n", "pumpkins %>% \r\n",
" slice_head(n = 5)" " slice_head(n = 5)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "F_FgxQnVZnM0"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "2KKo0Ed9Z1VB"
},
"source": [ "source": [
"#### dplyr::mutate()\n", "#### dplyr::mutate()\n",
"\n", "\n",
@ -283,66 +294,66 @@
"2. Extract the month from the dates to a new column.\n", "2. Extract the month from the dates to a new column.\n",
"\n", "\n",
"In R, the package [lubridate](https://lubridate.tidyverse.org/) makes it easier to work with Date-time data. So, let's use `dplyr::mutate()`, `lubridate::mdy()`, `lubridate::month()` and see how to achieve the above objectives. We can drop the Date column since we won't be needing it again in subsequent operations." "In R, the package [lubridate](https://lubridate.tidyverse.org/) makes it easier to work with Date-time data. So, let's use `dplyr::mutate()`, `lubridate::mdy()`, `lubridate::month()` and see how to achieve the above objectives. We can drop the Date column since we won't be needing it again in subsequent operations."
] ],
"metadata": {
"id": "2KKo0Ed9Z1VB"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "5joszIVSZ6xe"
},
"source": [ "source": [
"# Load lubridate\n", "# Load lubridate\r\n",
"library(lubridate)\n", "library(lubridate)\r\n",
"\n", "\r\n",
"pumpkins <- pumpkins %>% \n", "pumpkins <- pumpkins %>% \r\n",
" # Convert the Date column to a date object\n", " # Convert the Date column to a date object\r\n",
" mutate(Date = mdy(Date)) %>% \n", " mutate(Date = mdy(Date)) %>% \r\n",
" # Extract month from Date\n", " # Extract month from Date\r\n",
" mutate(Month = month(Date)) %>% \n", " mutate(Month = month(Date)) %>% \r\n",
" # Drop Date column\n", " # Drop Date column\r\n",
" select(-Date)\n", " select(-Date)\r\n",
"\n", "\r\n",
"# View the first few rows\n", "# View the first few rows\r\n",
"pumpkins %>% \n", "pumpkins %>% \r\n",
" slice_head(n = 7)" " slice_head(n = 7)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "5joszIVSZ6xe"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "nIgLjNMCZ-6Y"
},
"source": [ "source": [
"Woohoo! 🤩\n", "Woohoo! 🤩\n",
"\n", "\n",
"Next, let's create a new column `Price`, which represents the average price of a pumpkin. Now, let's take the average of the `Low Price` and `High Price` columns to populate the new Price column.\n", "Next, let's create a new column `Price`, which represents the average price of a pumpkin. Now, let's take the average of the `Low Price` and `High Price` columns to populate the new Price column.\n",
"<br>" "<br>"
] ],
"metadata": {
"id": "nIgLjNMCZ-6Y"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "Zo0BsqqtaJw2"
},
"source": [ "source": [
"# Create a new column Price\n", "# Create a new column Price\r\n",
"pumpkins <- pumpkins %>% \n", "pumpkins <- pumpkins %>% \r\n",
" mutate(Price = (`Low Price` + `High Price`)/2)\n", " mutate(Price = (`Low Price` + `High Price`)/2)\r\n",
"\n", "\r\n",
"# View the first few rows of the data\n", "# View the first few rows of the data\r\n",
"pumpkins %>% \n", "pumpkins %>% \r\n",
" slice_head(n = 5)" " slice_head(n = 5)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "Zo0BsqqtaJw2"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "p77WZr-9aQAR"
},
"source": [ "source": [
"Yeees!💪\n", "Yeees!💪\n",
"\n", "\n",
@ -351,38 +362,38 @@
"If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in `1 1/9 bushel` measures, and some in `1/2 bushel` measures, some per pumpkin, some per pound, and some in big boxes with varying widths.\n", "If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in `1 1/9 bushel` measures, and some in `1/2 bushel` measures, some per pumpkin, some per pound, and some in big boxes with varying widths.\n",
"\n", "\n",
"Let's verify this:" "Let's verify this:"
] ],
"metadata": {
"id": "p77WZr-9aQAR"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "XISGfh0IaUy6"
},
"source": [ "source": [
"# Verify the distinct observations in Package column\n", "# Verify the distinct observations in Package column\r\n",
"pumpkins %>% \n", "pumpkins %>% \r\n",
" distinct(Package)" " distinct(Package)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "XISGfh0IaUy6"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "7sMjiVujaZxY"
},
"source": [ "source": [
"Amazing!👏\n", "Amazing!👏\n",
"\n", "\n",
"Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string *bushel* in the `Package` column and put this in a new data frame `new_pumpkins`.\n", "Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string *bushel* in the `Package` column and put this in a new data frame `new_pumpkins`.\n",
"<br>" "<br>"
] ],
"metadata": {
"id": "7sMjiVujaZxY"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "L8Qfcs92ageF"
},
"source": [ "source": [
"#### dplyr::filter() and stringr::str_detect()\n", "#### dplyr::filter() and stringr::str_detect()\n",
"\n", "\n",
@ -391,43 +402,43 @@
"[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html): detects the presence or absence of a pattern in a string.\n", "[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html): detects the presence or absence of a pattern in a string.\n",
"\n", "\n",
"The [`stringr`](https://github.com/tidyverse/stringr) package provides simple functions for common string operations." "The [`stringr`](https://github.com/tidyverse/stringr) package provides simple functions for common string operations."
] ],
"metadata": {
"id": "L8Qfcs92ageF"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "hy_SGYREampd"
},
"source": [ "source": [
"# Retain only pumpkins with \"bushel\"\n", "# Retain only pumpkins with \"bushel\"\r\n",
"new_pumpkins <- pumpkins %>% \n", "new_pumpkins <- pumpkins %>% \r\n",
" filter(str_detect(Package, \"bushel\"))\n", " filter(str_detect(Package, \"bushel\"))\r\n",
"\n", "\r\n",
"# Get the dimensions of the new data\n", "# Get the dimensions of the new data\r\n",
"dim(new_pumpkins)\n", "dim(new_pumpkins)\r\n",
"\n", "\r\n",
"# View a few rows of the new data\n", "# View a few rows of the new data\r\n",
"new_pumpkins %>% \n", "new_pumpkins %>% \r\n",
" slice_head(n = 5)" " slice_head(n = 5)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "hy_SGYREampd"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "VrDwF031avlR"
},
"source": [ "source": [
"You can see that we have narrowed down to 415 or so rows of data containing pumpkins by the bushel.🤩\n", "You can see that we have narrowed down to 415 or so rows of data containing pumpkins by the bushel.🤩\n",
"<br>" "<br>"
] ],
"metadata": {
"id": "VrDwF031avlR"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "mLpw2jH4a0tx"
},
"source": [ "source": [
"#### dplyr::case_when()\n", "#### dplyr::case_when()\n",
"\n", "\n",
@ -436,33 +447,33 @@
"Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel. Time to do some math to standardize it.\n", "Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel. Time to do some math to standardize it.\n",
"\n", "\n",
"We'll use the function [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to *mutate* the Price column depending on some conditions. `case_when` allows you to vectorise multiple `if_else()`statements.\n" "We'll use the function [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to *mutate* the Price column depending on some conditions. `case_when` allows you to vectorise multiple `if_else()`statements.\n"
] ],
"metadata": {
"id": "mLpw2jH4a0tx"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "P68kLVQmbM6I"
},
"source": [ "source": [
"# Convert the price if the Package contains fractional bushel values\n", "# Convert the price if the Package contains fractional bushel values\r\n",
"new_pumpkins <- new_pumpkins %>% \n", "new_pumpkins <- new_pumpkins %>% \r\n",
" mutate(Price = case_when(\n", " mutate(Price = case_when(\r\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n", " str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\r\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n", " str_detect(Package, \"1/2\") ~ Price/(1/2),\r\n",
" TRUE ~ Price))\n", " TRUE ~ Price))\r\n",
"\n", "\r\n",
"# View the first few rows of the data\n", "# View the first few rows of the data\r\n",
"new_pumpkins %>% \n", "new_pumpkins %>% \r\n",
" slice_head(n = 30)" " slice_head(n = 30)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "P68kLVQmbM6I"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "pS2GNPagbSdb"
},
"source": [ "source": [
"Now, we can analyze the pricing per unit based on their bushel measurement. All this study of bushels of pumpkins, however, goes to show how very `important` it is to `understand the nature of your data`!\n", "Now, we can analyze the pricing per unit based on their bushel measurement. All this study of bushels of pumpkins, however, goes to show how very `important` it is to `understand the nature of your data`!\n",
"\n", "\n",
@ -470,103 +481,109 @@
">\n", ">\n",
"> ✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.\n", "> ✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.\n",
"<br>\n" "<br>\n"
] ],
"metadata": {
"id": "pS2GNPagbSdb"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "qql1SowfbdnP"
},
"source": [ "source": [
"Now lastly, for the sheer sake of adventure 💁‍♀️, let's also move the Month column to the first position i.e `before` column `Package`.\n", "Now lastly, for the sheer sake of adventure 💁‍♀️, let's also move the Month column to the first position i.e `before` column `Package`.\n",
"\n", "\n",
"`dplyr::relocate()` is used to change column positions." "`dplyr::relocate()` is used to change column positions."
] ],
"metadata": {
"id": "qql1SowfbdnP"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "JJ1x6kw8bixF"
},
"source": [ "source": [
"# Create a new data frame new_pumpkins\n", "# Create a new data frame new_pumpkins\r\n",
"new_pumpkins <- new_pumpkins %>% \n", "new_pumpkins <- new_pumpkins %>% \r\n",
" relocate(Month, .before = Package)\n", " relocate(Month, .before = Package)\r\n",
"\n", "\r\n",
"new_pumpkins %>% \n", "new_pumpkins %>% \r\n",
" slice_head(n = 7)" " slice_head(n = 7)"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "JJ1x6kw8bixF"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "y8TJ0Za_bn5Y"
},
"source": [ "source": [
"Good job!👌 You now have a clean, tidy dataset on which you can build your new regression model!\n", "Good job!👌 You now have a clean, tidy dataset on which you can build your new regression model!\n",
"<br>" "<br>"
] ],
"metadata": {
"id": "y8TJ0Za_bn5Y"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"## 4. Data visualization with ggplot2\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/data-visualization.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\r\n",
"\r\n",
"\r\n",
"<!--![Infographic by Dasani Madipalli](../images/data-visualization.png){width=\"600\"}-->\r\n",
"\r\n",
"There is a *wise* saying that goes like this:\r\n",
"\r\n",
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\r\n",
"\r\n",
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\r\n",
"\r\n",
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\r\n",
"\r\n",
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\r\n",
"\r\n",
"Let's start with a simple scatter plot for the Price and Month columns.\r\n",
"\r\n",
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\r\n"
],
"metadata": { "metadata": {
"id": "mYSH6-EtbvNa" "id": "mYSH6-EtbvNa"
}, }
"source": [
"## 4. Data visualization with ggplot2\n",
"\n",
"![Infographic by Dasani Madipalli](../images/data-visualization.png){width=\"600\"}\n",
"\n",
"There is a *wise* saying that goes like this:\n",
"\n",
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\n",
"\n",
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\n",
"\n",
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\n",
"\n",
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\n",
"\n",
"Let's start with a simple scatter plot for the Price and Month columns.\n",
"\n",
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\n"
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "g2YjnGeOcLo4"
},
"source": [ "source": [
"# Set a theme for the plots\n", "# Set a theme for the plots\r\n",
"theme_set(theme_light())\n", "theme_set(theme_light())\r\n",
"\n", "\r\n",
"# Create a scatter plot\n", "# Create a scatter plot\r\n",
"p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\n", "p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\r\n",
"p + geom_point()" "p + geom_point()"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "g2YjnGeOcLo4"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "Ml7SDCLQcPvE"
},
"source": [ "source": [
"Is this a useful plot 🤷? Does anything about it surprise you?\n", "Is this a useful plot 🤷? Does anything about it surprise you?\n",
"\n", "\n",
"It's not particularly useful as all it does is display in your data as a spread of points in a given month.\n", "It's not particularly useful as all it does is display in your data as a spread of points in a given month.\n",
"<br>" "<br>"
] ],
"metadata": {
"id": "Ml7SDCLQcPvE"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "jMakvJZIcVkh"
},
"source": [ "source": [
"### **How do we make it useful?**\n", "### **How do we make it useful?**\n",
"\n", "\n",
@ -583,62 +600,65 @@
"- `dplyr::summarize()` creates a new data frame with one column for each grouping variable and one column for each of the summary statistics that you have specified.\n", "- `dplyr::summarize()` creates a new data frame with one column for each grouping variable and one column for each of the summary statistics that you have specified.\n",
"\n", "\n",
"For example, we can use the `dplyr::group_by() %>% summarize()` to group the pumpkins into groups based on the **Month** columns and then find the **mean price** for each month." "For example, we can use the `dplyr::group_by() %>% summarize()` to group the pumpkins into groups based on the **Month** columns and then find the **mean price** for each month."
] ],
"metadata": {
"id": "jMakvJZIcVkh"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "6kVSUa2Bcilf"
},
"source": [ "source": [
"# Find the average price of pumpkins per month\n", "# Find the average price of pumpkins per month\r\n",
"new_pumpkins %>%\n", "new_pumpkins %>%\r\n",
" group_by(Month) %>% \n", " group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price))" " summarise(mean_price = mean(Price))"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "6kVSUa2Bcilf"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "Kds48GUBcj3W"
},
"source": [ "source": [
"Succinct!✨\n", "Succinct!✨\n",
"\n", "\n",
"Categorical features such as months are better represented using a bar plot 📊. The layers responsible for bar charts are `geom_bar()` and `geom_col()`. Consult `?geom_bar` to find out more.\n", "Categorical features such as months are better represented using a bar plot 📊. The layers responsible for bar charts are `geom_bar()` and `geom_col()`. Consult `?geom_bar` to find out more.\n",
"\n", "\n",
"Let's whip up one!" "Let's whip up one!"
] ],
"metadata": {
"id": "Kds48GUBcj3W"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"metadata": { "execution_count": null,
"id": "VNbU1S3BcrxO"
},
"source": [ "source": [
"# Find the average price of pumpkins per month then plot a bar chart\n", "# Find the average price of pumpkins per month then plot a bar chart\r\n",
"new_pumpkins %>%\n", "new_pumpkins %>%\r\n",
" group_by(Month) %>% \n", " group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price)) %>% \n", " summarise(mean_price = mean(Price)) %>% \r\n",
" ggplot(aes(x = Month, y = mean_price)) +\n", " ggplot(aes(x = Month, y = mean_price)) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\n", " geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"Pumpkin Price\")" " ylab(\"Pumpkin Price\")"
], ],
"execution_count": null, "outputs": [],
"outputs": [] "metadata": {
"id": "VNbU1S3BcrxO"
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "zDm0VOzzcuzR"
},
"source": [ "source": [
"🤩🤩This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?\n", "🤩🤩This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?\n",
"\n", "\n",
"Congratulations on finishing the second lesson 👏! You prepared your data for model building, then uncovered more insights using visualizations!" "Congratulations on finishing the second lesson 👏! You prepared your data for model building, then uncovered more insights using visualizations!"
] ],
"metadata": {
"id": "zDm0VOzzcuzR"
}
} }
] ]
} }
Loading…
Cancel
Save