diff --git a/2-Regression/2-Data/solution/R/lesson_2.html b/2-Regression/2-Data/solution/R/lesson_2.html new file mode 100644 index 00000000..97af866c --- /dev/null +++ b/2-Regression/2-Data/solution/R/lesson_2.html @@ -0,0 +1,3534 @@ + + + + +
+ + + + + + + + +Now that you are set up with the tools you need to start tackling +machine learning model building with Tidymodels and the Tidyverse, you +are ready to start asking questions of your data. As you work with data +and apply ML solutions, itβs very important to understand how to ask the +right question to properly unlock the potentials of your dataset.
+In this lesson, you will learn:
+How to prepare your data for model-building.
How to use ggplot2
for data visualization.
The question you need answered will determine what type of ML +algorithms you will leverage. And the quality of the answer you get back +will be heavily dependent on the nature of your data.
+Letβs see this by working through a practical exercise.
+Weβll require the following packages to slice and dice this +lesson:
+tidyverse
: The tidyverse is a collection of R packages
+designed to makes data science faster, easier and more fun!You can have them installed as:
+install.packages(c("tidyverse"))
The script below checks whether you have the packages required to +complete this module and installs them for you in case they are +missing.
+if (!require("pacman")) install.packages("pacman")
+::p_load(tidyverse) pacman
Now, letβs fire up some packages and load the data +provided for this lesson!
+# Load the core Tidyverse packages
+library(tidyverse)
+
+# Import the pumpkins data
+<- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv")
+ pumpkins
+
+# Get a glimpse and dimensions of the data
+glimpse(pumpkins)
## Rows: 1,757
+## Columns: 26
+## $ `City Name` <chr> "BALTIMORE", "BALTIMORE", "BALTIMORE", "BALTIMORE", ~
+## $ Type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Package <chr> "24 inch bins", "24 inch bins", "24 inch bins", "24 ~
+## $ Variety <chr> NA, NA, "HOWDEN TYPE", "HOWDEN TYPE", "HOWDEN TYPE",~
+## $ `Sub Variety` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Grade <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Date <chr> "4/29/17", "5/6/17", "9/24/16", "9/24/16", "11/5/16"~
+## $ `Low Price` <dbl> 270, 270, 160, 160, 90, 90, 160, 160, 160, 160, 160,~
+## $ `High Price` <dbl> 280, 280, 160, 160, 100, 100, 170, 160, 170, 160, 17~
+## $ `Mostly Low` <dbl> 270, 270, 160, 160, 90, 90, 160, 160, 160, 160, 160,~
+## $ `Mostly High` <dbl> 280, 280, 160, 160, 100, 100, 170, 160, 170, 160, 17~
+## $ Origin <chr> "MARYLAND", "MARYLAND", "DELAWARE", "VIRGINIA", "MAR~
+## $ `Origin District` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ `Item Size` <chr> "lge", "lge", "med", "med", "lge", "lge", "med", "lg~
+## $ Color <chr> NA, NA, "ORANGE", "ORANGE", "ORANGE", "ORANGE", "ORA~
+## $ Environment <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ `Unit of Sale` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Quality <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Condition <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Appearance <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Storage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Crop <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Repack <chr> "E", "E", "N", "N", "N", "N", "N", "N", "N", "N", "N~
+## $ `Trans Mode` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ ...25 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ ...26 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+# Print the first 50 rows of the data set
+%>%
+ pumpkins slice_head(n =50)
A quick glimpse()
immediately shows that there are
+blanks and a mix of strings (chr
) and numeric data
+(dbl
). The Date
is of type character and
+thereβs also a strange column called Package
where the data
+is a mix between sacks
, bins
and other values.
+The data, in fact, is a bit of a mess π€.
In fact, it is not very common to be gifted a dataset that is +completely ready to use to create a ML model out of the box. But worry +not, in this lesson, you will learn how to prepare a raw dataset using +standard R libraries π§βπ§. You will also learn various techniques to +visualize the data.ππ
+++A refresher: The pipe operator (
+%>%
) performs +operations in logical sequence by passing an object forward into a +function or call expression. You can think of the pipe operator as +saying βand thenβ in your code.
One of the most common issues data scientists need to deal with is
+incomplete or missing data. R represents missing, or unknown values,
+with special sentinel value: NA
(Not Available).
So how would we know that the data frame contains missing values?
+anyNA
which returns the logical objects TRUE
+or FALSE
%>%
+ pumpkins anyNA()
## [1] TRUE
+Great, there seems to be some missing data! Thatβs a good place to +start.
+is.na()
that
+indicates which individual column elements are missing with a logical
+TRUE
.%>%
+ pumpkins is.na() %>%
+ head(n = 7)
## City Name Type Package Variety Sub Variety Grade Date Low Price
+## [1,] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
+## [2,] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
+## [3,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
+## [4,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
+## [5,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
+## [6,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
+## [7,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
+## High Price Mostly Low Mostly High Origin Origin District Item Size Color
+## [1,] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
+## [2,] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
+## [3,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
+## [4,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
+## [5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
+## [6,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
+## [7,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
+## Environment Unit of Sale Quality Condition Appearance Storage Crop Repack
+## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
+## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
+## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
+## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
+## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
+## [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
+## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
+## Trans Mode ...25 ...26
+## [1,] TRUE TRUE TRUE
+## [2,] TRUE TRUE TRUE
+## [3,] TRUE TRUE TRUE
+## [4,] TRUE TRUE TRUE
+## [5,] TRUE TRUE TRUE
+## [6,] TRUE TRUE TRUE
+## [7,] TRUE TRUE TRUE
+Okay, got the job done but with a large data frame such as this, it +would be inefficient and practically impossible to review all of the +rows and columns individuallyπ΄.
+%>%
+ pumpkins is.na() %>%
+ colSums()
## City Name Type Package Variety Sub Variety
+## 0 1712 0 5 1461
+## Grade Date Low Price High Price Mostly Low
+## 1757 0 0 0 103
+## Mostly High Origin Origin District Item Size Color
+## 103 3 1626 279 616
+## Environment Unit of Sale Quality Condition Appearance
+## 1757 1595 1757 1757 1757
+## Storage Crop Repack Trans Mode ...25
+## 1757 1757 0 1757 1757
+## ...26
+## 1654
+Much better! There is missing data, but maybe it wonβt matter for the +task at hand. Letβs see what further analysis brings forth.
+++Along with the awesome sets of packages and functions, R has a very +good documentation. For instance, use
+help(colSums)
or +?colSums
to find out more about the function.
dplyr
, a
+package in the Tidyverse, is a grammar of data manipulation that
+provides a consistent set of verbs that help you solve the most common
+data manipulation challenges. In this section, weβll explore some of
+dplyrβs verbs!
select()
is a function in the package dplyr
+which helps you pick columns to keep or exclude.
To make your data frame easier to work with, drop several of its
+columns, using select()
, keeping only the columns you
+need.
For instance, in this exercise, our analysis will involve the columns
+Package
, Low Price
, High Price
+and Date
. Letβs select these columns.
# Select desired columns
+<- pumpkins %>%
+ pumpkins select(Package, `Low Price`, `High Price`, Date)
+
+
+# Print data set
+%>%
+ pumpkins slice_head(n = 5)
mutate()
is a function in the package dplyr
+which helps you create or modify columns, while keeping the existing
+columns.
The general structure of mutate is:
+data %>% mutate(new_column_name = what_it_contains)
Letβs take mutate
out for a spin using the
+Date
column by doing the following operations:
Convert the dates (currently of type character) to a month format
+(these are US dates, so the format is MM/DD/YYYY
).
Extract the month from the dates to a new column.
In R, the package lubridate makes it easier to
+work with Date-time data. So, letβs use dplyr::mutate()
,
+lubridate::mdy()
, lubridate::month()
and see
+how to achieve the above objectives. We can drop the Date column since
+we wonβt be needing it again in subsequent operations.
# Load lubridate
+library(lubridate)
+
+<- pumpkins %>%
+ pumpkins # Convert the Date column to a date object
+ mutate(Date = mdy(Date)) %>%
+ # Extract month from Date
+ mutate(Month = month(Date)) %>%
+ # Drop Date column
+ select(-Date)
+
+# View the first few rows
+%>%
+ pumpkins slice_head(n = 7)
Woohoo! π€©
+Next, letβs create a new column Price
, which represents
+the average price of a pumpkin. Now, letβs take the average of the
+Low Price
and High Price
columns to populate
+the new Price column.
# Create a new column Price
+<- pumpkins %>%
+ pumpkins mutate(Price = (`Low Price` + `High Price`)/2)
+
+# View the first few rows of the data
+%>%
+ pumpkins slice_head(n = 5)
Yeees!πͺ
+βBut wait!β, youβll say after skimming through the whole data set
+with View(pumpkins)
, βThereβs something odd here!βπ€
If you look at the Package
column, pumpkins are sold in
+many different configurations. Some are sold in
+1 1/9 bushel
measures, and some in 1/2 bushel
+measures, some per pumpkin, some per pound, and some in big boxes with
+varying widths.
Letβs verify this:
+# Verify the distinct observations in Package column
+%>%
+ pumpkins distinct(Package)
Amazing!π
+Pumpkins seem to be very hard to weigh consistently, so letβs filter
+them by selecting only pumpkins with the string bushel in the
+Package
column and put this in a new data frame
+new_pumpkins
.
dplyr::filter()
:
+creates a subset of the data only containing rows that
+satisfy your conditions, in this case, pumpkins with the string
+bushel in the Package
column.
stringr::str_detect(): +detects the presence or absence of a pattern in a string.
+The stringr
+package provides simple functions for common string operations.
# Retain only pumpkins with "bushel"
+<- pumpkins %>%
+ new_pumpkins filter(str_detect(Package, "bushel"))
+
+# Get the dimensions of the new data
+dim(new_pumpkins)
## [1] 415 5
+# View a few rows of the new data
+%>%
+ new_pumpkins slice_head(n = 5)
You can see that we have narrowed down to 415 or so rows of data +containing pumpkins by the bushel.π€©
+But wait! Thereβs one more thing to do
+Did you notice that the bushel amount varies per row? You need to +normalize the pricing so that you show the pricing per bushel, not per 1 +1/9 or 1/2 bushel. Time to do some math to standardize it.
+Weβll use the function case_when()
+to mutate the Price column depending on some conditions.
+case_when
allows you to vectorise multiple
+if_else()
statements.
# Convert the price if the Package contains fractional bushel values
+<- new_pumpkins %>%
+ new_pumpkins mutate(Price = case_when(
+ str_detect(Package, "1 1/9") ~ Price/(1 + 1/9),
+ str_detect(Package, "1/2") ~ Price/(1/2),
+ TRUE ~ Price))
+
+# View the first few rows of the data
+%>%
+ new_pumpkins slice_head(n = 30)
Now, we can analyze the pricing per unit based on their bushel
+measurement. All this study of bushels of pumpkins, however, goes to
+show how very important
it is to
+understand the nature of your data
!
++β According to The +Spruce Eats, a bushelβs weight depends on the type of produce, as +itβs a volume measurement. βA bushel of tomatoes, for example, is +supposed to weigh 56 poundsβ¦ Leaves and greens take up more space with +less weight, so a bushel of spinach is only 20 pounds.β Itβs all pretty +complicated! Letβs not bother with making a bushel-to-pound conversion, +and instead price by the bushel. All this study of bushels of pumpkins, +however, goes to show how very important it is to understand the nature +of your data!
+β Did you notice that pumpkins sold by the half-bushel are very +expensive? Can you figure out why? Hint: little pumpkins are way pricier +than big ones, probably because there are so many more of them per +bushel, given the unused space taken by one big hollow pie pumpkin.
+
Now lastly, for the sheer sake of adventure πββοΈ, letβs also move the
+Month column to the first position i.e before
column
+Package
.
dplyr::relocate()
is used to change column
+positions.
# Create a new data frame new_pumpkins
+<- new_pumpkins %>%
+ new_pumpkins relocate(Month, .before = Package)
+
+%>%
+ new_pumpkins slice_head(n = 7)
Good job!π You now have a clean, tidy dataset on which you can build +your new regression model!
+There is a wise saying that goes like this:
+++βThe simple graph has brought more information to the data analystβs +mind than any other device.β β John Tukey
+
Part of the data scientistβs role is to demonstrate the quality and +nature of the data they are working with. To do this, they often create +interesting visualizations, or plots, graphs, and charts, showing +different aspects of data. In this way, they are able to visually show +relationships and gaps that are otherwise hard to uncover.
+Visualizations can also help determine the machine learning technique +most appropriate for the data. A scatterplot that seems to follow a +line, for example, indicates that the data is a good candidate for a +linear regression exercise.
+R offers a number of several systems for making graphs, but ggplot2
+is one of the most elegant and most versatile. ggplot2
+allows you to compose graphs by combining independent
+components.
Letβs start with a simple scatter plot for the Price and Month +columns.
+So in this case, weβll start with ggplot()
,
+supply a dataset and aesthetic mapping (with aes()
)
+then add a layers (like geom_point()
)
+for scatter plots.
# Set a theme for the plots
+theme_set(theme_light())
+
+# Create a scatter plot
+<- ggplot(data = new_pumpkins, aes(x = Price, y = Month))
+ p + geom_point() p
Is this a useful plot π€·? Does anything about it surprise you?
+Itβs not particularly useful as all it does is display in your data +as a spread of points in a given month.
+To get charts to display useful data, you usually need to group the +data somehow. For instance in our case, finding the average price of +pumpkins for each month would provide more insights to the underlying +patterns in our data. This leads us to one more dplyr +flyby:
+dplyr::group_by() %>% summarize()
Grouped aggregation in R can be easily computed using
+dplyr::group_by() %>% summarize()
dplyr::group_by()
changes the unit of analysis from
+the complete dataset to individual groups such as per month.
dplyr::summarize()
creates a new data frame with one
+column for each grouping variable and one column for each of the summary
+statistics that you have specified.
For example, we can use the
+dplyr::group_by() %>% summarize()
to group the pumpkins
+into groups based on the Month columns and then find
+the mean price for each month.
# Find the average price of pumpkins per month
+%>%
+ new_pumpkins group_by(Month) %>%
+ summarise(mean_price = mean(Price))
Succinct!β¨
+Categorical features such as months are better represented using a
+bar plot π. The layers responsible for bar charts are
+geom_bar()
and geom_col()
. Consult
?geom_bar
to find out more.
Letβs whip up one!
+# Find the average price of pumpkins per month then plot a bar chart
+%>%
+ new_pumpkins group_by(Month) %>%
+ summarise(mean_price = mean(Price)) %>%
+ ggplot(aes(x = Month, y = mean_price)) +
+ geom_col(fill = "midnightblue", alpha = 0.7) +
+ ylab("Pumpkin Price")
π€©π€©This is a more useful data visualization! It seems to indicate +that the highest price for pumpkins occurs in September and October. +Does that meet your expectation? Why or why not?
+Congratulations on finishing the second lesson π! You prepared your +data for model building, then uncovered more insights using +visualizations!
+