You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
664 lines
24 KiB
664 lines
24 KiB
3 years ago
|
{
|
||
|
"nbformat": 4,
|
||
3 years ago
|
"nbformat_minor": 2,
|
||
3 years ago
|
"metadata": {
|
||
|
"colab": {
|
||
|
"name": "lesson_2-R.ipynb",
|
||
|
"provenance": [],
|
||
|
"collapsed_sections": [],
|
||
|
"toc_visible": true
|
||
|
},
|
||
|
"kernelspec": {
|
||
|
"name": "ir",
|
||
|
"display_name": "R"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"name": "R"
|
||
|
}
|
||
|
},
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Build a regression model: prepare and visualize data\n",
|
||
|
"\n",
|
||
|
"## **Linear Regression for Pumpkins - Lesson 2**\n",
|
||
|
"#### Introduction\n",
|
||
|
"\n",
|
||
|
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\n",
|
||
|
"\n",
|
||
|
"In this lesson, you will learn:\n",
|
||
|
"\n",
|
||
|
"- How to prepare your data for model-building.\n",
|
||
|
"\n",
|
||
|
"- How to use `ggplot2` for data visualization.\n",
|
||
|
"\n",
|
||
|
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\n",
|
||
|
"\n",
|
||
|
"Let's see this by working through a practical exercise.\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"<p >\n",
|
||
|
" <img src=\"../../images/unruly_data.jpg\"\n",
|
||
|
" width=\"700\"/>\n",
|
||
|
" <figcaption>Artwork by @allison_horst</figcaption>\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"<!--![Artwork by \\@allison_horst](../../images/unruly_data.jpg)<br>Artwork by \\@allison_horst-->"
|
||
3 years ago
|
],
|
||
3 years ago
|
"metadata": {
|
||
|
"id": "Pg5aexcOPqAZ"
|
||
3 years ago
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"## 1. Importing pumpkins data and summoning the Tidyverse\n",
|
||
|
"\n",
|
||
|
"We'll require the following packages to slice and dice this lesson:\n",
|
||
|
"\n",
|
||
|
"- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\n",
|
||
|
"\n",
|
||
|
"You can have them installed as:\n",
|
||
|
"\n",
|
||
|
"`install.packages(c(\"tidyverse\"))`\n",
|
||
|
"\n",
|
||
|
"The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing."
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "dc5WhyVdXAjR"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n",
|
||
3 years ago
|
"pacman::p_load(tidyverse)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "GqPYUZgfXOBt"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Now, let's fire up some packages and load the [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) provided for this lesson!"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "kvjDTPDSXRr2"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Load the core Tidyverse packages\n",
|
||
|
"library(tidyverse)\n",
|
||
|
"\n",
|
||
|
"# Import the pumpkins data\n",
|
||
|
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"# Get a glimpse and dimensions of the data\n",
|
||
|
"glimpse(pumpkins)\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"# Print the first 50 rows of the data set\n",
|
||
|
"pumpkins %>% \n",
|
||
3 years ago
|
" slice_head(n =50)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "VMri-t2zXqgD"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"A quick `glimpse()` immediately shows that there are blanks and a mix of strings (`chr`) and numeric data (`dbl`). The `Date` is of type character and there's also a strange column called `Package` where the data is a mix between `sacks`, `bins` and other values. The data, in fact, is a bit of a mess 😤.\n",
|
||
|
"\n",
|
||
|
"In fact, it is not very common to be gifted a dataset that is completely ready to use to create a ML model out of the box. But worry not, in this lesson, you will learn how to prepare a raw dataset using standard R libraries 🧑🔧. You will also learn various techniques to visualize the data.📈📊\n",
|
||
|
"<br>\n",
|
||
|
"\n",
|
||
|
"> A refresher: The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code.\n",
|
||
|
"\n"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "REWcIv9yX29v"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"## 2. Check for missing data\n",
|
||
|
"\n",
|
||
|
"One of the most common issues data scientists need to deal with is incomplete or missing data. R represents missing, or unknown values, with special sentinel value: `NA` (Not Available).\n",
|
||
|
"\n",
|
||
|
"So how would we know that the data frame contains missing values?\n",
|
||
|
"<br>\n",
|
||
|
"- One straight forward way would be to use the base R function `anyNA` which returns the logical objects `TRUE` or `FALSE`"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "Zxfb3AM5YbUe"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"pumpkins %>% \n",
|
||
3 years ago
|
" anyNA()"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "G--DQutAYltj"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Great, there seems to be some missing data! That's a good place to start.\n",
|
||
|
"\n",
|
||
|
"- Another way would be to use the function `is.na()` that indicates which individual column elements are missing with a logical `TRUE`."
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "mU-7-SB6YokF"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"pumpkins %>% \n",
|
||
|
" is.na() %>% \n",
|
||
3 years ago
|
" head(n = 7)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "W-DxDOR4YxSW"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Okay, got the job done but with a large data frame such as this, it would be inefficient and practically impossible to review all of the rows and columns individually😴.\n",
|
||
|
"\n",
|
||
|
"- A more intuitive way would be to calculate the sum of the missing values for each column:"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "xUWxipKYY0o7"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"pumpkins %>% \n",
|
||
|
" is.na() %>% \n",
|
||
3 years ago
|
" colSums()"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "ZRBWV6P9ZArL"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Much better! There is missing data, but maybe it won't matter for the task at hand. Let's see what further analysis brings forth.\n",
|
||
|
"\n",
|
||
|
"> Along with the awesome sets of packages and functions, R has a very good documentation. For instance, use `help(colSums)` or `?colSums` to find out more about the function."
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "9gv-crB6ZD1Y"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"## 3. Dplyr: A Grammar of Data Manipulation\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"<p >\n",
|
||
|
" <img src=\"../../images/dplyr_wrangling.png\"\n",
|
||
|
" width=\"569\"/>\n",
|
||
|
" <figcaption>Artwork by @allison_horst</figcaption>\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"<!--![Artwork by \\@allison_horst](../../images/dplyr_wrangling.png)<br/>Artwork by \\@allison_horst-->"
|
||
3 years ago
|
],
|
||
3 years ago
|
"metadata": {
|
||
|
"id": "o4jLY5-VZO2C"
|
||
3 years ago
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"[`dplyr`](https://dplyr.tidyverse.org/), a package in the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges. In this section, we'll explore some of dplyr's verbs!\n",
|
||
|
"<br>\n"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "i5o33MQBZWWw"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"#### dplyr::select()\n",
|
||
|
"\n",
|
||
|
"`select()` is a function in the package `dplyr` which helps you pick columns to keep or exclude.\n",
|
||
|
"\n",
|
||
|
"To make your data frame easier to work with, drop several of its columns, using `select()`, keeping only the columns you need.\n",
|
||
|
"\n",
|
||
|
"For instance, in this exercise, our analysis will involve the columns `Package`, `Low Price`, `High Price` and `Date`. Let's select these columns."
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "x3VGMAGBZiUr"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Select desired columns\n",
|
||
|
"pumpkins <- pumpkins %>% \n",
|
||
|
" select(Package, `Low Price`, `High Price`, Date)\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"# Print data set\n",
|
||
|
"pumpkins %>% \n",
|
||
3 years ago
|
" slice_head(n = 5)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "F_FgxQnVZnM0"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"#### dplyr::mutate()\n",
|
||
|
"\n",
|
||
|
"`mutate()` is a function in the package `dplyr` which helps you create or modify columns, while keeping the existing columns.\n",
|
||
|
"\n",
|
||
|
"The general structure of mutate is:\n",
|
||
|
"\n",
|
||
|
"`data %>% mutate(new_column_name = what_it_contains)`\n",
|
||
|
"\n",
|
||
|
"Let's take `mutate` out for a spin using the `Date` column by doing the following operations:\n",
|
||
|
"\n",
|
||
|
"1. Convert the dates (currently of type character) to a month format (these are US dates, so the format is `MM/DD/YYYY`).\n",
|
||
|
"\n",
|
||
|
"2. Extract the month from the dates to a new column.\n",
|
||
|
"\n",
|
||
|
"In R, the package [lubridate](https://lubridate.tidyverse.org/) makes it easier to work with Date-time data. So, let's use `dplyr::mutate()`, `lubridate::mdy()`, `lubridate::month()` and see how to achieve the above objectives. We can drop the Date column since we won't be needing it again in subsequent operations."
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "2KKo0Ed9Z1VB"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Load lubridate\n",
|
||
|
"library(lubridate)\n",
|
||
|
"\n",
|
||
|
"pumpkins <- pumpkins %>% \n",
|
||
|
" # Convert the Date column to a date object\n",
|
||
|
" mutate(Date = mdy(Date)) %>% \n",
|
||
|
" # Extract month from Date\n",
|
||
|
" mutate(Month = month(Date)) %>% \n",
|
||
|
" # Drop Date column\n",
|
||
|
" select(-Date)\n",
|
||
|
"\n",
|
||
|
"# View the first few rows\n",
|
||
|
"pumpkins %>% \n",
|
||
3 years ago
|
" slice_head(n = 7)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "5joszIVSZ6xe"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Woohoo! 🤩\n",
|
||
|
"\n",
|
||
|
"Next, let's create a new column `Price`, which represents the average price of a pumpkin. Now, let's take the average of the `Low Price` and `High Price` columns to populate the new Price column.\n",
|
||
|
"<br>"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "nIgLjNMCZ-6Y"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Create a new column Price\n",
|
||
|
"pumpkins <- pumpkins %>% \n",
|
||
|
" mutate(Price = (`Low Price` + `High Price`)/2)\n",
|
||
|
"\n",
|
||
|
"# View the first few rows of the data\n",
|
||
|
"pumpkins %>% \n",
|
||
3 years ago
|
" slice_head(n = 5)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "Zo0BsqqtaJw2"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Yeees!💪\n",
|
||
|
"\n",
|
||
|
"\"But wait!\", you'll say after skimming through the whole data set with `View(pumpkins)`, \"There's something odd here!\"🤔\n",
|
||
|
"\n",
|
||
|
"If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in `1 1/9 bushel` measures, and some in `1/2 bushel` measures, some per pumpkin, some per pound, and some in big boxes with varying widths.\n",
|
||
|
"\n",
|
||
|
"Let's verify this:"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "p77WZr-9aQAR"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Verify the distinct observations in Package column\n",
|
||
|
"pumpkins %>% \n",
|
||
3 years ago
|
" distinct(Package)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "XISGfh0IaUy6"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Amazing!👏\n",
|
||
|
"\n",
|
||
|
"Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string *bushel* in the `Package` column and put this in a new data frame `new_pumpkins`.\n",
|
||
|
"<br>"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "7sMjiVujaZxY"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"#### dplyr::filter() and stringr::str_detect()\n",
|
||
|
"\n",
|
||
|
"[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html): creates a subset of the data only containing **rows** that satisfy your conditions, in this case, pumpkins with the string *bushel* in the `Package` column.\n",
|
||
|
"\n",
|
||
|
"[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html): detects the presence or absence of a pattern in a string.\n",
|
||
|
"\n",
|
||
|
"The [`stringr`](https://github.com/tidyverse/stringr) package provides simple functions for common string operations."
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "L8Qfcs92ageF"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Retain only pumpkins with \"bushel\"\n",
|
||
|
"new_pumpkins <- pumpkins %>% \n",
|
||
|
" filter(str_detect(Package, \"bushel\"))\n",
|
||
|
"\n",
|
||
|
"# Get the dimensions of the new data\n",
|
||
|
"dim(new_pumpkins)\n",
|
||
|
"\n",
|
||
|
"# View a few rows of the new data\n",
|
||
|
"new_pumpkins %>% \n",
|
||
3 years ago
|
" slice_head(n = 5)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "hy_SGYREampd"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"You can see that we have narrowed down to 415 or so rows of data containing pumpkins by the bushel.🤩\n",
|
||
|
"<br>"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "VrDwF031avlR"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"#### dplyr::case_when()\n",
|
||
|
"\n",
|
||
|
"**But wait! There's one more thing to do**\n",
|
||
|
"\n",
|
||
|
"Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel. Time to do some math to standardize it.\n",
|
||
|
"\n",
|
||
|
"We'll use the function [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to *mutate* the Price column depending on some conditions. `case_when` allows you to vectorise multiple `if_else()`statements.\n"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "mLpw2jH4a0tx"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Convert the price if the Package contains fractional bushel values\n",
|
||
|
"new_pumpkins <- new_pumpkins %>% \n",
|
||
|
" mutate(Price = case_when(\n",
|
||
|
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
|
||
|
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
|
||
|
" TRUE ~ Price))\n",
|
||
|
"\n",
|
||
|
"# View the first few rows of the data\n",
|
||
|
"new_pumpkins %>% \n",
|
||
3 years ago
|
" slice_head(n = 30)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "P68kLVQmbM6I"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Now, we can analyze the pricing per unit based on their bushel measurement. All this study of bushels of pumpkins, however, goes to show how very `important` it is to `understand the nature of your data`!\n",
|
||
|
"\n",
|
||
|
"> ✅ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), a bushel's weight depends on the type of produce, as it's a volume measurement. \"A bushel of tomatoes, for example, is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds.\" It's all pretty complicated! Let's not bother with making a bushel-to-pound conversion, and instead price by the bushel. All this study of bushels of pumpkins, however, goes to show how very important it is to understand the nature of your data!\n",
|
||
|
">\n",
|
||
|
"> ✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.\n",
|
||
|
"<br>\n"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "pS2GNPagbSdb"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Now lastly, for the sheer sake of adventure 💁♀️, let's also move the Month column to the first position i.e `before` column `Package`.\n",
|
||
|
"\n",
|
||
|
"`dplyr::relocate()` is used to change column positions."
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "qql1SowfbdnP"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Create a new data frame new_pumpkins\n",
|
||
|
"new_pumpkins <- new_pumpkins %>% \n",
|
||
|
" relocate(Month, .before = Package)\n",
|
||
|
"\n",
|
||
|
"new_pumpkins %>% \n",
|
||
3 years ago
|
" slice_head(n = 7)"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "JJ1x6kw8bixF"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Good job!👌 You now have a clean, tidy dataset on which you can build your new regression model!\n",
|
||
|
"<br>"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "y8TJ0Za_bn5Y"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"## 4. Data visualization with ggplot2\n",
|
||
|
"\n",
|
||
|
"<p >\n",
|
||
|
" <img src=\"../../images/data-visualization.png\"\n",
|
||
|
" width=\"600\"/>\n",
|
||
|
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"<!--![Infographic by Dasani Madipalli](../../images/data-visualization.png){width=\"600\"}-->\n",
|
||
|
"\n",
|
||
|
"There is a *wise* saying that goes like this:\n",
|
||
|
"\n",
|
||
|
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\n",
|
||
|
"\n",
|
||
|
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\n",
|
||
|
"\n",
|
||
|
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\n",
|
||
|
"\n",
|
||
|
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\n",
|
||
|
"\n",
|
||
|
"Let's start with a simple scatter plot for the Price and Month columns.\n",
|
||
|
"\n",
|
||
|
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\n"
|
||
3 years ago
|
],
|
||
3 years ago
|
"metadata": {
|
||
|
"id": "mYSH6-EtbvNa"
|
||
3 years ago
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Set a theme for the plots\n",
|
||
|
"theme_set(theme_light())\n",
|
||
|
"\n",
|
||
|
"# Create a scatter plot\n",
|
||
|
"p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\n",
|
||
3 years ago
|
"p + geom_point()"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "g2YjnGeOcLo4"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Is this a useful plot 🤷? Does anything about it surprise you?\n",
|
||
|
"\n",
|
||
|
"It's not particularly useful as all it does is display in your data as a spread of points in a given month.\n",
|
||
|
"<br>"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "Ml7SDCLQcPvE"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"### **How do we make it useful?**\n",
|
||
|
"\n",
|
||
|
"To get charts to display useful data, you usually need to group the data somehow. For instance in our case, finding the average price of pumpkins for each month would provide more insights to the underlying patterns in our data. This leads us to one more **dplyr** flyby:\n",
|
||
|
"\n",
|
||
|
"#### `dplyr::group_by() %>% summarize()`\n",
|
||
|
"\n",
|
||
|
"Grouped aggregation in R can be easily computed using\n",
|
||
|
"\n",
|
||
|
"`dplyr::group_by() %>% summarize()`\n",
|
||
|
"\n",
|
||
|
"- `dplyr::group_by()` changes the unit of analysis from the complete dataset to individual groups such as per month.\n",
|
||
|
"\n",
|
||
|
"- `dplyr::summarize()` creates a new data frame with one column for each grouping variable and one column for each of the summary statistics that you have specified.\n",
|
||
|
"\n",
|
||
|
"For example, we can use the `dplyr::group_by() %>% summarize()` to group the pumpkins into groups based on the **Month** columns and then find the **mean price** for each month."
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "jMakvJZIcVkh"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
3 years ago
|
"source": [
|
||
3 years ago
|
"# Find the average price of pumpkins per month\r\n",
|
||
|
"new_pumpkins %>%\r\n",
|
||
|
" group_by(Month) %>% \r\n",
|
||
3 years ago
|
" summarise(mean_price = mean(Price))"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "6kVSUa2Bcilf"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"Succinct!✨\n",
|
||
|
"\n",
|
||
|
"Categorical features such as months are better represented using a bar plot 📊. The layers responsible for bar charts are `geom_bar()` and `geom_col()`. Consult `?geom_bar` to find out more.\n",
|
||
|
"\n",
|
||
|
"Let's whip up one!"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "Kds48GUBcj3W"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
3 years ago
|
"execution_count": null,
|
||
|
"source": [
|
||
|
"# Find the average price of pumpkins per month then plot a bar chart\r\n",
|
||
|
"new_pumpkins %>%\r\n",
|
||
|
" group_by(Month) %>% \r\n",
|
||
|
" summarise(mean_price = mean(Price)) %>% \r\n",
|
||
|
" ggplot(aes(x = Month, y = mean_price)) +\r\n",
|
||
|
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
|
||
3 years ago
|
" ylab(\"Pumpkin Price\")"
|
||
|
],
|
||
3 years ago
|
"outputs": [],
|
||
|
"metadata": {
|
||
|
"id": "VNbU1S3BcrxO"
|
||
|
}
|
||
3 years ago
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"source": [
|
||
|
"🤩🤩This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?\n",
|
||
|
"\n",
|
||
3 years ago
|
"Congratulations on finishing the second lesson 👏! You prepared your data for model building, then uncovered more insights using visualizations!"
|
||
3 years ago
|
],
|
||
|
"metadata": {
|
||
|
"id": "zDm0VOzzcuzR"
|
||
|
}
|
||
3 years ago
|
}
|
||
|
]
|
||
3 years ago
|
}
|