diff --git a/2-Regression/3-Linear/solution/R/lesson_3.Rmd b/2-Regression/3-Linear/solution/R/lesson_3.Rmd index 50d4c134..8f58f371 100644 --- a/2-Regression/3-Linear/solution/R/lesson_3.Rmd +++ b/2-Regression/3-Linear/solution/R/lesson_3.Rmd @@ -84,8 +84,8 @@ We do so since we want to model a line that has the least cumulative distance fr > > In other words, and referring to our pumpkin data's original question: "predict the price of a pumpkin per bushel by month", `X` would refer to the price and `Y` would refer to the month of sale. > -> ![Infographic by Jen Looper](../../images/calculation.png) -> +![Infographic by Jen Looper](../../images/calculation.png) + > Calculate the value of Y. If you're paying around \$4, it must be April! > > The math that calculates the line must demonstrate the slope of the line, which is also dependent on the intercept, or where `Y` is situated when `X = 0`. @@ -114,7 +114,7 @@ Load up required libraries and dataset. Convert the data to a data frame contain - Convert the price to reflect the pricing by bushel quantity -> We covered these steps in the [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/2-Data/solution/lesson_2-R.ipynb). +> We covered these steps in the [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/2-Data/solution/lesson_2.html). ```{r load_tidy_verse_models, message=F, warning=F} # Load the core Tidyverse packages @@ -285,7 +285,7 @@ That's an awesome thought! You see, once your recipe is defined, you can estimat For that, you'll need two more verbs: `prep()` and `bake()` and as always, our little R friends by [`Allison Horst`](https://github.com/allisonhorst/stats-illustrations) help you in understanding this better! -![Artwork by \@allison_horst](../images/recipes.png){width="550"} +![Artwork by \@allison_horst](../../images/recipes.png){width="550"} [`prep()`](https://recipes.tidymodels.org/reference/prep.html): estimates the required parameters from a training set that can be later applied to other data sets. For instance, for a given predictor column, what observation will be assigned integer 0 or 1 or 2 etc diff --git a/2-Regression/3-Linear/solution/R/lesson_3.html b/2-Regression/3-Linear/solution/R/lesson_3.html new file mode 100644 index 00000000..35a3fd1e --- /dev/null +++ b/2-Regression/3-Linear/solution/R/lesson_3.html @@ -0,0 +1,3966 @@ + + + + +
+ + + + + + + + +So far you have explored what regression is with sample data gathered
+from the pumpkin pricing dataset that we will use throughout this
+lesson. You have also visualized it using ggplot2
.💪
Now you are ready to dive deeper into regression for ML. In this +lesson, you will learn more about two types of regression: basic +linear regression and polynomial regression, along with +some of the math underlying these techniques.
+++Throughout this curriculum, we assume minimal knowledge of math, and +seek to make it accessible for students coming from other fields, so +watch for notes, 🧮 callouts, diagrams, and other learning tools to aid +in comprehension.
+
As a reminder, you are loading this data so as to ask questions of +it.
+When is the best time to buy pumpkins?
What price can I expect of a case of miniature pumpkins?
Should I buy them in half-bushel baskets or by the 1 1/9 bushel +box? Let’s keep digging into this data.
In the previous lesson, you created a tibble
(a modern
+reimagining of the data frame) and populated it with part of the
+original dataset, standardizing the pricing by the bushel. By doing
+that, however, you were only able to gather about 400 data points and
+only for the fall months. Maybe we can get a little more detail about
+the nature of the data by cleaning it more? We’ll see… 🕵️♀️
For this task, we’ll require the following packages:
+tidyverse
: The tidyverse is a collection of R packages
+designed to makes data science faster, easier and more fun!
tidymodels
: The tidymodels framework is a collection of packages
+for modeling and machine learning.
janitor
: The janitor package provides
+simple little tools for examining and cleaning dirty data.
corrplot
: The corrplot
+package provides a visual exploratory tool on correlation matrix
+that supports automatic variable reordering to help detect hidden
+patterns among variables.
You can have them installed as:
+install.packages(c("tidyverse", "tidymodels", "janitor", "corrplot"))
The script below checks whether you have the packages required to +complete this module and installs them for you in case they are +missing.
+suppressWarnings(if (!require("pacman")) install.packages("pacman"))
+
+::p_load(tidyverse, tidymodels, janitor, corrplot) pacman
We’ll later load these awesome packages and make them available in
+our current R session. (This is for mere illustration,
+pacman::p_load()
already did that for you)
As you learned in Lesson 1, the goal of a linear regression exercise +is to be able to plot a line of best fit +to:
+Show variable relationships. Show the +relationship between variables
Make predictions. Make accurate predictions on +where a new data point would fall in relationship to that line.
To draw this type of line, we use a statistical technique called
+Least-Squares Regression. The term
+least-squares
means that all the data points surrounding
+the regression line are squared and then added up. Ideally, that final
+sum is as small as possible, because we want a low number of errors, or
+least-squares
. As such, the line of best fit is the line
+that gives us the lowest value for the sum of the squared errors - hence
+the name least squares regression.
We do so since we want to model a line that has the least cumulative +distance from all of our data points. We also square the terms before +adding them since we are concerned with its magnitude rather than its +direction.
+++🧮 Show me the math
+This line, called the line of best fit can be expressed by +an +equation:
++Y = a + bX
+
X
is the ‘explanatory variable
or +predictor
’.Y
is the +‘dependent variable
oroutcome
’. The slope of +the line isb
anda
is the y-intercept, which +refers to the value ofY
whenX = 0
.+ ++ +First, calculate the slope
+b
.In other words, and referring to our pumpkin data’s original +question: “predict the price of a pumpkin per bushel by month”, +
+X
would refer to the price andY
would refer +to the month of sale.+ ++ +
++Calculate the value of Y. If you’re paying around $4, it must be +April!
+The math that calculates the line must demonstrate the slope of the +line, which is also dependent on the intercept, or where
+Y
+is situated whenX = 0
.You can observe the method of calculation for these values on the Math +is Fun web site. Also visit this +Least-squares calculator to watch how the numbers’ values impact the +line.
+
Not so scary, right? 🤓
+One more term to understand is the Correlation +Coefficient between given X and Y variables. Using a +scatterplot, you can quickly visualize this coefficient. A plot with +datapoints scattered in a neat line have high correlation, but a plot +with datapoints scattered everywhere between X and Y have a low +correlation.
+A good linear regression model will be one that has a high (nearer to +1 than 0) Correlation Coefficient using the Least-Squares Regression +method with a line of regression.
+Load up required libraries and dataset. Convert the data to a data +frame containing a subset of the data:
+Only get pumpkins priced by the bushel
Convert the date to a month
Calculate the price to be an average of high and low +prices
Convert the price to reflect the pricing by bushel +quantity
++We covered these steps in the previous +lesson.
+
# Load the core Tidyverse packages
+library(tidyverse)
+library(lubridate)
+
+# Import the pumpkins data
+<- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv")
+ pumpkins
+
+# Get a glimpse and dimensions of the data
+glimpse(pumpkins)
## Rows: 1,757
+## Columns: 26
+## $ `City Name` <chr> "BALTIMORE", "BALTIMORE", "BALTIMORE", "BALTIMORE", ~
+## $ Type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Package <chr> "24 inch bins", "24 inch bins", "24 inch bins", "24 ~
+## $ Variety <chr> NA, NA, "HOWDEN TYPE", "HOWDEN TYPE", "HOWDEN TYPE",~
+## $ `Sub Variety` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Grade <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Date <chr> "4/29/17", "5/6/17", "9/24/16", "9/24/16", "11/5/16"~
+## $ `Low Price` <dbl> 270, 270, 160, 160, 90, 90, 160, 160, 160, 160, 160,~
+## $ `High Price` <dbl> 280, 280, 160, 160, 100, 100, 170, 160, 170, 160, 17~
+## $ `Mostly Low` <dbl> 270, 270, 160, 160, 90, 90, 160, 160, 160, 160, 160,~
+## $ `Mostly High` <dbl> 280, 280, 160, 160, 100, 100, 170, 160, 170, 160, 17~
+## $ Origin <chr> "MARYLAND", "MARYLAND", "DELAWARE", "VIRGINIA", "MAR~
+## $ `Origin District` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ `Item Size` <chr> "lge", "lge", "med", "med", "lge", "lge", "med", "lg~
+## $ Color <chr> NA, NA, "ORANGE", "ORANGE", "ORANGE", "ORANGE", "ORA~
+## $ Environment <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ `Unit of Sale` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Quality <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Condition <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Appearance <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Storage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Crop <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ Repack <chr> "E", "E", "N", "N", "N", "N", "N", "N", "N", "N", "N~
+## $ `Trans Mode` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ ...25 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+## $ ...26 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
+# Print the first 50 rows of the data set
+%>%
+ pumpkins slice_head(n = 5)
In the spirit of sheer adventure, let’s explore the janitor package
that
+provides simple functions for examining and cleaning dirty data. For
+instance, let’s take a look at the column names for our data:
# Return column names
+%>%
+ pumpkins names()
## [1] "City Name" "Type" "Package" "Variety"
+## [5] "Sub Variety" "Grade" "Date" "Low Price"
+## [9] "High Price" "Mostly Low" "Mostly High" "Origin"
+## [13] "Origin District" "Item Size" "Color" "Environment"
+## [17] "Unit of Sale" "Quality" "Condition" "Appearance"
+## [21] "Storage" "Crop" "Repack" "Trans Mode"
+## [25] "...25" "...26"
+🤔 We can do better. Let’s make these column names
+friendR
by converting them to the snake_case
+convention using janitor::clean_names
. To find out more
+about this function: ?clean_names
# Clean names to the snake_case convention
+<- pumpkins %>%
+ pumpkins clean_names(case = "snake")
+
+# Return column names
+%>%
+ pumpkins names()
## [1] "city_name" "type" "package" "variety"
+## [5] "sub_variety" "grade" "date" "low_price"
+## [9] "high_price" "mostly_low" "mostly_high" "origin"
+## [13] "origin_district" "item_size" "color" "environment"
+## [17] "unit_of_sale" "quality" "condition" "appearance"
+## [21] "storage" "crop" "repack" "trans_mode"
+## [25] "x25" "x26"
+Much tidyR 🧹! Now, a dance with the data using dplyr
as
+in the previous lesson! 💃
# Select desired columns
+<- pumpkins %>%
+ pumpkins select(variety, city_name, package, low_price, high_price, date)
+
+
+
+# Extract the month from the dates to a new column
+<- pumpkins %>%
+ pumpkins mutate(date = mdy(date),
+ month = month(date)) %>%
+ select(-date)
+
+
+
+# Create a new column for average Price
+<- pumpkins %>%
+ pumpkins mutate(price = (low_price + high_price)/2)
+
+
+# Retain only pumpkins with the string "bushel"
+<- pumpkins %>%
+ new_pumpkins filter(str_detect(string = package, pattern = "bushel"))
+
+
+# Normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel
+<- new_pumpkins %>%
+ new_pumpkins mutate(price = case_when(
+ str_detect(package, "1 1/9") ~ price/(1.1),
+ str_detect(package, "1/2") ~ price*2,
+ TRUE ~ price))
+
+# Relocate column positions
+<- new_pumpkins %>%
+ new_pumpkins relocate(month, .before = variety)
+
+
+# Display the first 5 rows
+%>%
+ new_pumpkins slice_head(n = 5)
Good job!👌 You now have a clean, tidy data set on which you can +build your new regression model!
+Mind a scatter plot?
+# Set theme
+theme_set(theme_light())
+
+# Make a scatter plot of month and price
+%>%
+ new_pumpkins ggplot(mapping = aes(x = month, y = price)) +
+ geom_point(size = 1.6)
A scatter plot reminds us that we only have month data from August +through December. We probably need more data to be able to draw +conclusions in a linear fashion.
+Let’s take a look at our modelling data again:
+# Display first 5 rows
+%>%
+ new_pumpkins slice_head(n = 5)
What if we wanted to predict the price
of a pumpkin
+based on the city
or package
columns which are
+of type character? Or even more simply, how could we find the
+correlation (which requires both of its inputs to be numeric) between,
+say, package
and price
? 🤷🤷
Machine learning models work best with numeric features rather than +text values, so you generally need to convert categorical features into +numeric representations.
+This means that we have to find a way to reformat our predictors to
+make them easier for a model to use effectively, a process known as
+feature engineering
.
Activities that reformat predictor values to make them easier for a
+model to use effectively has been termed
+feature engineering
.
Different models have different preprocessing requirements. For
+instance, least squares requires
+encoding categorical variables
such as month, variety and
+city_name. This simply involves translating
a column with
+categorical values
into one or more
+numeric columns
that take the place of the original.
For example, suppose your data includes the following categorical +feature:
+city | +
---|
Denver | +
Nairobi | +
Tokyo | +
You can apply ordinal encoding to substitute a unique +integer value for each category, like this:
+city | +
---|
0 | +
1 | +
2 | +
And that’s what we’ll do to our data!
+In this section, we’ll explore another amazing Tidymodels package: recipes - which is +designed to help you preprocess your data before +training your model. At its core, a recipe is an object that defines +what steps should be applied to a data set in order to get it ready for +modelling.
+Now, let’s create a recipe that prepares our data for modelling by +substituting a unique integer for all the observations in the predictor +columns:
+# Specify a recipe
+<- recipe(price ~ ., data = new_pumpkins) %>%
+ pumpkins_recipe step_integer(all_predictors(), zero_based = TRUE)
+
+
+# Print out the recipe
+ pumpkins_recipe
##
+## -- Recipe ----------------------------------------------------------------------
+##
+## -- Inputs
+## Number of variables by role
+## outcome: 1
+## predictor: 6
+##
+## -- Operations
+## * Integer encoding for: all_predictors()
+Awesome! 👏 We just created our first recipe that specifies an +outcome (price) and its corresponding predictors and that all the +predictor columns should be encoded into a set of integers 🙌! Let’s +quickly break it down:
+The call to recipe()
with a formula tells the recipe
+the roles of the variables using new_pumpkins
data
+as the reference. For instance the price
column has been
+assigned an outcome
role while the rest of the columns have
+been assigned a predictor
role.
step_integer(all_predictors(), zero_based = TRUE)
+specifies that all the predictors should be converted into a set of
+integers with the numbering starting at 0.
We are sure you may be having thoughts such as: “This is so cool!! +But what if I needed to confirm that the recipes are doing exactly what +I expect them to do? 🤔”
+That’s an awesome thought! You see, once your recipe is defined, you
+can estimate the parameters required to actually preprocess the data,
+and then extract the processed data. You don’t typically need to do this
+when you use Tidymodels (we’ll see the normal convention in just a
+minute-> workflows
) but it can come in handy when you
+want to do some kind of sanity check for confirming that recipes are
+doing what you expect.
For that, you’ll need two more verbs: prep()
and
+bake()
and as always, our little R friends by Allison Horst
+help you in understanding this better!
prep()
:
+estimates the required parameters from a training set that can be later
+applied to other data sets. For instance, for a given predictor column,
+what observation will be assigned integer 0 or 1 or 2 etc
bake()
:
+takes a prepped recipe and applies the operations to any data set.
That said, lets prep and bake our recipes to really confirm that +under the hood, the predictor columns will be first encoded before a +model is fit.
+# Prep the recipe
+<- prep(pumpkins_recipe)
+ pumpkins_prep
+# Bake the recipe to extract a preprocessed new_pumpkins data
+<- bake(pumpkins_prep, new_data = NULL)
+ baked_pumpkins
+# Print out the baked data set
+%>%
+ baked_pumpkins slice_head(n = 10)
Woo-hoo!🥳 The processed data baked_pumpkins
has all
+it’s predictors encoded confirming that indeed the preprocessing steps
+defined as our recipe will work as expected. This makes it harder for
+you to read but much more intelligible for Tidymodels! Take some time to
+find out what observation has been mapped to a corresponding
+integer.
It is also worth mentioning that baked_pumpkins
is a
+data frame that we can perform computations on.
For instance, let’s try to find a good correlation between two points
+of your data to potentially build a good predictive model. We’ll use the
+function cor()
to do this. Type ?cor()
to find
+out more about the function.
# Find the correlation between the city_name and the price
+cor(baked_pumpkins$city_name, baked_pumpkins$price)
## [1] 0.3236397
+# Find the correlation between the package and the price
+cor(baked_pumpkins$package, baked_pumpkins$price)
## [1] 0.6061713
+As it turns out, there’s only weak correlation between the City and +Price. However there’s a bit better correlation between the Package and +its Price. That makes sense, right? Normally, the bigger the produce +box, the higher the price.
+While we are at it, let’s also try and visualize a correlation matrix
+of all the columns using the corrplot
package.
# Load the corrplot package
+library(corrplot)
+
+# Obtain correlation matrix
+<- cor(baked_pumpkins %>%
+ corr_mat # Drop columns that are not really informative
+ select(-c(low_price, high_price)))
+
+# Make a correlation plot between the variables
+corrplot(corr_mat, method = "shade", shade.col = NA, tl.col = "black", tl.srt = 45, addCoef.col = "black", cl.pos = "n", order = "original")
🤩🤩 Much better.
+A good question to now ask of this data will be:
+‘What price can I expect of a given pumpkin package?
’ Let’s
+get right into it!
++Note: When you
+bake()
the prepped +recipepumpkins_prep
with +new_data = NULL
, you extract the processed +(i.e. encoded) training data. If you had another data set for example a +test set and would want to see how a recipe would pre-process it, you +would simply bakepumpkins_prep
with +new_data = test_set
Now that we have build a recipe, and actually confirmed that the data
+will be pre-processed appropriately, let’s now build a regression model
+to answer the question:
+What price can I expect of a given pumpkin package?
As you may have already figured out, the column price is the
+outcome
variable while the package column is the
+predictor
variable.
To do this, we’ll first split the data such that 80% goes into +training and 20% into test set, then define a recipe that will encode +the predictor column into a set of integers, then build a model +specification. We won’t prep and bake our recipe since we already know +it will preprocess the data as expected.
+set.seed(2056)
+# Split the data into training and test sets
+<- new_pumpkins %>%
+ pumpkins_split initial_split(prop = 0.8)
+
+
+# Extract training and test data
+<- training(pumpkins_split)
+ pumpkins_train <- testing(pumpkins_split)
+ pumpkins_test
+
+
+# Create a recipe for preprocessing the data
+<- recipe(price ~ package, data = pumpkins_train) %>%
+ lm_pumpkins_recipe step_integer(all_predictors(), zero_based = TRUE)
+
+
+
+# Create a linear model specification
+<- linear_reg() %>%
+ lm_spec set_engine("lm") %>%
+ set_mode("regression")
Good job! Now that we have a recipe and a model specification, we +need to find a way of bundling them together into an object that will +first preprocess the data (prep+bake behind the scenes), fit the model +on the preprocessed data and also allow for potential post-processing +activities. How’s that for your peace of mind!🤩
+In Tidymodels, this convenient object is called a workflow
and
+conveniently holds your modeling components! This is what we’d call
+pipelines in Python.
So let’s bundle everything up into a workflow!📦
+# Hold modelling components in a workflow
+<- workflow() %>%
+ lm_wf add_recipe(lm_pumpkins_recipe) %>%
+ add_model(lm_spec)
+
+# Print out the workflow
+ lm_wf
## == Workflow ====================================================================
+## Preprocessor: Recipe
+## Model: linear_reg()
+##
+## -- Preprocessor ----------------------------------------------------------------
+## 1 Recipe Step
+##
+## * step_integer()
+##
+## -- Model -----------------------------------------------------------------------
+## Linear Regression Model Specification (regression)
+##
+## Computational engine: lm
+👌 Into the bargain, a workflow can be fit/trained in much the same +way a model can.
+# Train the model
+<- lm_wf %>%
+ lm_wf_fit fit(data = pumpkins_train)
+
+# Print the model coefficients learned
+ lm_wf_fit
## == Workflow [trained] ==========================================================
+## Preprocessor: Recipe
+## Model: linear_reg()
+##
+## -- Preprocessor ----------------------------------------------------------------
+## 1 Recipe Step
+##
+## * step_integer()
+##
+## -- Model -----------------------------------------------------------------------
+##
+## Call:
+## stats::lm(formula = ..y ~ ., data = data)
+##
+## Coefficients:
+## (Intercept) package
+## 19.977 4.884
+From the model output, we can see the coefficients learned during +training. They represent the coefficients of the line of best fit that +gives us the lowest overall error between the actual and predicted +variable.
+It’s time to see how the model performed 📏! How do we do this?
+Now that we’ve trained the model, we can use it to make predictions
+for the test_set using parsnip::predict()
. Then we can
+compare these predictions to the actual label values to evaluate how
+well (or not!) the model is working.
Let’s start with making predictions for the test set then bind the +columns to the test set.
+# Make predictions for the test set
+<- lm_wf_fit %>%
+ predictions predict(new_data = pumpkins_test)
+
+
+# Bind predictions to the test set
+<- pumpkins_test %>%
+ lm_results select(c(package, price)) %>%
+ bind_cols(predictions)
+
+
+# Print the first ten rows of the tibble
+%>%
+ lm_results slice_head(n = 10)
Yes, you have just trained a model and used it to make predictions!🔮 +Is it any good, let’s evaluate the model’s performance!
+In Tidymodels, we do this using yardstick::metrics()
!
+For linear regression, let’s focus on the following metrics:
Root Mean Square Error (RMSE)
: The square root of
+the MSE.
+This yields an absolute metric in the same unit as the label (in this
+case, the price of a pumpkin). The smaller the value, the better the
+model (in a simplistic sense, it represents the average price by which
+the predictions are wrong!)
Coefficient of Determination (usually known as R-squared or R2)
:
+A relative metric in which the higher the value, the better the fit of
+the model. In essence, this metric represents how much of the variance
+between predicted and actual label values the model is able to
+explain.
# Evaluate performance of linear regression
+metrics(data = lm_results,
+truth = price,
+ estimate = .pred)
There goes the model performance. Let’s see if we can get a better +indication by visualizing a scatter plot of the package and price then +use the predictions made to overlay a line of best fit.
+This means we’ll have to prep and bake the test set in order to +encode the package column then bind this to the predictions made by our +model.
+# Encode package column
+<- lm_pumpkins_recipe %>%
+ package_encode prep() %>%
+ bake(new_data = pumpkins_test) %>%
+ select(package)
+
+
+# Bind encoded package column to the results
+<- lm_results %>%
+ lm_results bind_cols(package_encode %>%
+ rename(package_integer = package)) %>%
+ relocate(package_integer, .after = package)
+
+
+# Print new results data frame
+%>%
+ lm_results slice_head(n = 5)
# Make a scatter plot
+%>%
+ lm_results ggplot(mapping = aes(x = package_integer, y = price)) +
+ geom_point(size = 1.6) +
+ # Overlay a line of best fit
+ geom_line(aes(y = .pred), color = "orange", size = 1.2) +
+ xlab("package")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+## i Please use `linewidth` instead.
+## [90mThis warning is displayed once every 8 hours.[39m
+## [90mCall `lifecycle::last_lifecycle_warnings()` to see where this warning was[39m
+## [90mgenerated.[39m
+
+Great! As you can see, the linear regression model does not really +well generalize the relationship between a package and its corresponding +price.
+🎃 Congratulations, you just created a model that can help predict +the price of a few varieties of pumpkins. Your holiday pumpkin patch +will be beautiful. But you can probably create a better model!
+Sometimes our data may not have a linear relationship, but we still +want to predict an outcome. Polynomial regression can help us make +predictions for more complex non-linear relationships.
+Take for instance the relationship between the package and price for +our pumpkins data set. While sometimes there’s a linear relationship +between variables - the bigger the pumpkin in volume, the higher the +price - sometimes these relationships can’t be plotted as a plane or +straight line.
+++✅ Here are some more +examples of data that could use polynomial regression
+Take another look at the relationship between Variety to Price in the +previous plot. Does this scatterplot seem like it should necessarily be +analyzed by a straight line? Perhaps not. In this case, you can try +polynomial regression.
+✅ Polynomials are mathematical expressions that might consist of one +or more variables and coefficients
+
Polynomial regression creates a curved line to better fit +nonlinear data.
+Let’s see whether a polynomial model will perform better in making +predictions. We’ll follow a somewhat similar procedure as we did +before:
+Create a recipe that specifies the preprocessing steps that +should be carried out on our data to get it ready for modelling i.e: +encoding predictors and computing polynomials of degree +n
Build a model specification
Bundle the recipe and model specification into a +workflow
Create a model by fitting the workflow
Evaluate how well the model performs on the test data
Let’s get right into it!
+# Specify a recipe
+<-
+ poly_pumpkins_recipe recipe(price ~ package, data = pumpkins_train) %>%
+ step_integer(all_predictors(), zero_based = TRUE) %>%
+ step_poly(all_predictors(), degree = 4)
+
+
+# Create a model specification
+<- linear_reg() %>%
+ poly_spec set_engine("lm") %>%
+ set_mode("regression")
+
+
+# Bundle recipe and model spec into a workflow
+<- workflow() %>%
+ poly_wf add_recipe(poly_pumpkins_recipe) %>%
+ add_model(poly_spec)
+
+
+# Create a model
+<- poly_wf %>%
+ poly_wf_fit fit(data = pumpkins_train)
+
+
+# Print learned model coefficients
+ poly_wf_fit
## == Workflow [trained] ==========================================================
+## Preprocessor: Recipe
+## Model: linear_reg()
+##
+## -- Preprocessor ----------------------------------------------------------------
+## 2 Recipe Steps
+##
+## * step_integer()
+## * step_poly()
+##
+## -- Model -----------------------------------------------------------------------
+##
+## Call:
+## stats::lm(formula = ..y ~ ., data = data)
+##
+## Coefficients:
+## (Intercept) package_poly_1 package_poly_2 package_poly_3 package_poly_4
+## 27.818 104.444 -113.001 -56.399 1.044
+👏👏You’ve built a polynomial model let’s make predictions on the +test set!
+# Make price predictions on test data
+<- poly_wf_fit %>% predict(new_data = pumpkins_test) %>%
+ poly_results bind_cols(pumpkins_test %>% select(c(package, price))) %>%
+ relocate(.pred, .after = last_col())
+
+
+# Print the results
+%>%
+ poly_results slice_head(n = 10)
Woo-hoo , let’s evaluate how the model performed on the test_set
+using yardstick::metrics()
.
metrics(data = poly_results, truth = price, estimate = .pred)
🤩🤩 Much better performance.
+The rmse
decreased from about 7. to about 3. an
+indication that of a reduced error between the actual price and the
+predicted price. You can loosely interpret this as meaning that
+on average, incorrect predictions are wrong by around $3. The
+rsq
increased from about 0.4 to 0.8.
All these metrics indicate that the polynomial model performs way +better than the linear model. Good job!
+Let’s see if we can visualize this!
+# Bind encoded package column to the results
+<- poly_results %>%
+ poly_results bind_cols(package_encode %>%
+ rename(package_integer = package)) %>%
+ relocate(package_integer, .after = package)
+
+
+# Print new results data frame
+%>%
+ poly_results slice_head(n = 5)
# Make a scatter plot
+%>%
+ poly_results ggplot(mapping = aes(x = package_integer, y = price)) +
+ geom_point(size = 1.6) +
+ # Overlay a line of best fit
+ geom_line(aes(y = .pred), color = "midnightblue", size = 1.2) +
+ xlab("package")
You can see a curved line that fits your data better! 🤩
+You can make this more smoother by passing a polynomial formula to
+geom_smooth
like this:
# Make a scatter plot
+%>%
+ poly_results ggplot(mapping = aes(x = package_integer, y = price)) +
+ geom_point(size = 1.6) +
+ # Overlay a line of best fit
+ geom_smooth(method = lm, formula = y ~ poly(x, degree = 4), color = "midnightblue", size = 1.2, se = FALSE) +
+ xlab("package")
Much like a smooth curve!🤩
+Here’s how you would make a new prediction:
+# Make a hypothetical data frame
+<- tibble(package = "bushel baskets")
+ hypo_tibble
+# Make predictions using linear model
+<- lm_wf_fit %>% predict(new_data = hypo_tibble)
+ lm_pred
+# Make predictions using polynomial model
+<- poly_wf_fit %>% predict(new_data = hypo_tibble)
+ poly_pred
+# Return predictions in a list
+list("linear model prediction" = lm_pred,
+"polynomial model prediction" = poly_pred)
## $`linear model prediction`
+## # A tibble: 1 x 1
+## .pred
+## <dbl>
+## 1 34.6
+##
+## $`polynomial model prediction`
+## # A tibble: 1 x 1
+## .pred
+## <dbl>
+## 1 46.6
+The polynomial model
prediction does make sense, given
+the scatter plots of price
and package
! And,
+if this is a better model than the previous one, looking at the same
+data, you need to budget for these more expensive pumpkins!
🏆 Well done! You created two regression models in one lesson. In the +final section on regression, you will learn about logistic regression to +determine categories.
+Test several different variables in this notebook to see how +correlation corresponds to model accuracy.
+In this lesson we learned about Linear Regression. There are other +important types of Regression. Read about Stepwise, Ridge, Lasso and +Elasticnet techniques. A good course to study to learn more is the Stanford +Statistical Learning course
+If you want to learn more about how to use the amazing Tidymodels +framework, please check out the following resources:
+Tidymodels website: Get started with +Tidymodels
Max Kuhn and Julia Silge, Tidy Modeling with +R.
Allison Horst +for creating the amazing illustrations that make R more welcoming and +engaging. Find more illustrations at her gallery.
+