You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/pcm/2-Regression/3-Linear/solution/R/lesson_3-R.ipynb

1080 lines
40 KiB

{
"nbformat": 4,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "lesson_3-R.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
},
"coopTranslator": {
"original_hash": "5015d65d61ba75a223bfc56c273aa174",
"translation_date": "2025-11-18T19:19:28+00:00",
"source_file": "2-Regression/3-Linear/solution/R/lesson_3-R.ipynb",
"language_code": "pcm"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Build regression model: linear and polynomial regression model\n"
],
"metadata": {
"id": "EgQw8osnsUV-"
}
},
{
"cell_type": "markdown",
"source": [
"## Linear and Polynomial Regression for Pumpkin Pricing - Lesson 3\n",
"<p >\n",
" <img src=\"../../../../../../translated_images/pcm/linear-polynomial.5523c7cb6576ccab.webp\"\n",
" width=\"800\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"#### Introduction\n",
"\n",
"So far, you don learn wetin regression mean wit sample data wey dem gather from pumpkin pricing dataset wey we go use for dis lesson. You don also show am for graph using `ggplot2`.💪\n",
"\n",
"Now, you ready to go deeper into regression for ML. For dis lesson, you go sabi more about two types of regression: *basic linear regression* and *polynomial regression*, plus some of di math wey dey behind dis techniques.\n",
"\n",
"> For dis curriculum, we dey assume say you no sabi plenty math, and we wan make am easy for students wey dey come from other fields, so dey look out for notes, 🧮 callouts, diagrams, and other learning tools wey go help you understand am well.\n",
"\n",
"#### Preparation\n",
"\n",
"Make we remember say you dey load dis data so you fit ask questions about am.\n",
"\n",
"- When e go make sense to buy pumpkins?\n",
"\n",
"- How much I fit expect to pay for one case of miniature pumpkins?\n",
"\n",
"- E better make I buy dem for half-bushel baskets or for di 1 1/9 bushel box? Make we continue to check dis data.\n",
"\n",
"For di last lesson, you create one `tibble` (na modern way to reimagine data frame) and you put part of di original dataset inside, wey standardize di pricing by di bushel. But as you do am like dat, you only fit gather about 400 data points and na only for di fall months. Maybe we fit get more details about di nature of di data if we clean am well? We go see... 🕵️‍♀️\n",
"\n",
"For dis task, we go need di following packages:\n",
"\n",
"- `tidyverse`: Di [tidyverse](https://www.tidyverse.org/) na [collection of R packages](https://www.tidyverse.org/packages) wey dey make data science fast, easy and fun!\n",
"\n",
"- `tidymodels`: Di [tidymodels](https://www.tidymodels.org/) framework na [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n",
"\n",
"- `janitor`: Di [janitor package](https://github.com/sfirke/janitor) dey provide simple tools wey dey help check and clean dirty data.\n",
"\n",
"- `corrplot`: Di [corrplot package](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) dey provide one visual tool for correlation matrix wey dey support automatic variable reordering to help find hidden patterns among variables.\n",
"\n",
"You fit install dem like dis:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"corrplot\"))`\n",
"\n",
"Di script wey dey below go check whether you get di packages wey you need to complete dis module and e go install dem for you if dem no dey.\n"
],
"metadata": {
"id": "WqQPS1OAsg3H"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if (!require(\"pacman\")) install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load(tidyverse, tidymodels, janitor, corrplot)"
],
"outputs": [],
"metadata": {
"id": "tA4C2WN3skCf",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "c06cd805-5534-4edc-f72b-d0d1dab96ac0"
}
},
{
"cell_type": "markdown",
"source": [
"We go later load dis beta packages dem and make dem dey available for our current R session. (Dis na just for show, `pacman::p_load()` don already do am for you)\n",
"\n",
"## 1. Linear regression line\n",
"\n",
"As you learn for Lesson 1, di goal of linear regression na to fit one *line* *of* *best fit* wey go:\n",
"\n",
"- **Show how variables relate**. Show di connection wey dey between di variables dem.\n",
"\n",
"- **Make predictions**. Make correct predictions about where new data point go fall for di line.\n",
"\n",
"To draw dis kain line, we dey use one statistical method wey dem dey call **Least-Squares Regression**. Di word `least-squares` mean say all di data points wey dey around di regression line go dey squared and then we go add dem together. Di goal na to make sure say di final sum dey as small as e fit be, because we want make error no too plenty, or `least-squares`. So, di line of best fit na di line wey go give us di smallest value for di sum of di squared errors - na why dem dey call am *least squares regression*.\n",
"\n",
"We dey do am like dis because we wan model one line wey go get di least total distance from all our data points. We dey square di terms before we add dem because we dey focus on di size (magnitude) and no di direction.\n",
"\n",
"> **🧮 Show me di maths**\n",
">\n",
"> Dis line, wey dem dey call *line of best fit* fit be expressed by [one equation](https://en.wikipedia.org/wiki/Simple_linear_regression):\n",
">\n",
"> Y = a + bX\n",
">\n",
"> `X` na di '`explanatory variable` or `predictor`'. `Y` na di '`dependent variable` or `outcome`'. Di slope of di line na `b` and `a` na di y-intercept, wey mean di value of `Y` when `X = 0`.\n",
">\n",
"\n",
"> ![](../../../../../../2-Regression/3-Linear/solution/images/slope.png \"slope = $y/x$\")\n",
" Infographic by Jen Looper\n",
">\n",
"> First, calculate di slope `b`.\n",
">\n",
"> For example, if we dey talk about our pumpkin data question: \"predict di price of pumpkin per bushel by month\", `X` go mean di price and `Y` go mean di month wey dem sell am.\n",
">\n",
"> ![](../../../../../../translated_images/calculation.989aa7822020d9d0ba9fc781f1ab5192f3421be86ebb88026528aef33c37b0d8.pcm.png)\n",
" Infographic by Jen Looper\n",
"> \n",
"> Calculate di value of Y. If you dey pay around \\$4, e mean say na April!\n",
">\n",
"> Di maths wey dey calculate di line go show di slope of di line, wey still depend on di intercept, or where `Y` dey when `X = 0`.\n",
">\n",
"> You fit see di method wey dem dey use calculate dis values for [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) website. You fit also check [dis Least-squares calculator](https://www.mathsisfun.com/data/least-squares-calculator.html) to see how di numbers dey affect di line.\n",
"\n",
"E no too hard, abi? 🤓\n",
"\n",
"#### Correlation\n",
"\n",
"One more word wey you need sabi na **Correlation Coefficient** between di X and Y variables. If you use scatterplot, you fit quick see dis coefficient. Scatterplot wey di points dey arrange well for straight line get high correlation, but scatterplot wey di points dey scatter anyhow between X and Y get low correlation.\n",
"\n",
"Beta linear regression model go be di one wey get high (near 1 pass 0) Correlation Coefficient using di Least-Squares Regression method with regression line.\n"
],
"metadata": {
"id": "cdX5FRpvsoP5"
}
},
{
"cell_type": "markdown",
"source": [
"## **2. Dance wit data: how we go take create data frame wey we go use for modelling**\n",
"\n",
"<p >\n",
" <img src=\"../../../../../../translated_images/pcm/janitor.e4a77dd3d3e6a32e.webp\"\n",
" width=\"700\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n"
],
"metadata": {
"id": "WdUKXk7Bs8-V"
}
},
{
"cell_type": "markdown",
"source": [
"Load di libraries wey you need and di dataset. Change di data to data frame wey get only small part of di data:\n",
"\n",
"- Only collect pumpkins wey dem price by bushel\n",
"\n",
"- Change di date to month\n",
"\n",
"- Calculate di price make e be average of di high and low prices\n",
"\n",
"- Change di price make e show di pricing by bushel quantity\n",
"\n",
"> We don talk about dis steps for [di previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/2-Data/solution/lesson_2-R.ipynb).\n"
],
"metadata": {
"id": "fMCtu2G2s-p8"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load the core Tidyverse packages\n",
"library(tidyverse)\n",
"library(lubridate)\n",
"\n",
"# Import the pumpkins data\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n",
"\n",
"\n",
"# Get a glimpse and dimensions of the data\n",
"glimpse(pumpkins)\n",
"\n",
"\n",
"# Print the first 50 rows of the data set\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "ryMVZEEPtERn"
}
},
{
"cell_type": "markdown",
"source": [
"For di spirit of adventure, make we explore di [`janitor package`](../../../../../../2-Regression/3-Linear/solution/R/github.com/sfirke/janitor) wey dey provide simple functions to check and clean dirty data. For example, make we look di column names for our data:\n"
],
"metadata": {
"id": "xcNxM70EtJjb"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Return column names\n",
"pumpkins %>% \n",
" names()"
],
"outputs": [],
"metadata": {
"id": "5XtpaIigtPfW"
}
},
{
"cell_type": "markdown",
"source": [
"🤔 We fit do beta. Make we change dis column name `friendR` by convert dem to [snake_case](https://en.wikipedia.org/wiki/Snake_case) way using `janitor::clean_names`. To sabi more about dis function: `?clean_names`\n"
],
"metadata": {
"id": "IbIqrMINtSHe"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Clean names to the snake_case convention\n",
"pumpkins <- pumpkins %>% \n",
" clean_names(case = \"snake\")\n",
"\n",
"# Return column names\n",
"pumpkins %>% \n",
" names()"
],
"outputs": [],
"metadata": {
"id": "a2uYvclYtWvX"
}
},
{
"cell_type": "markdown",
"source": [
"Plenty tidyR 🧹! Now, time to dance wit di data usin `dplyr` like we do for di last lesson! 💃\n"
],
"metadata": {
"id": "HfhnuzDDtaDd"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Select desired columns\n",
"pumpkins <- pumpkins %>% \n",
" select(variety, city_name, package, low_price, high_price, date)\n",
"\n",
"\n",
"\n",
"# Extract the month from the dates to a new column\n",
"pumpkins <- pumpkins %>%\n",
" mutate(date = mdy(date),\n",
" month = month(date)) %>% \n",
" select(-date)\n",
"\n",
"\n",
"\n",
"# Create a new column for average Price\n",
"pumpkins <- pumpkins %>% \n",
" mutate(price = (low_price + high_price)/2)\n",
"\n",
"\n",
"# Retain only pumpkins with the string \"bushel\"\n",
"new_pumpkins <- pumpkins %>% \n",
" filter(str_detect(string = package, pattern = \"bushel\"))\n",
"\n",
"\n",
"# Normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(price = case_when(\n",
" str_detect(package, \"1 1/9\") ~ price/(1.1),\n",
" str_detect(package, \"1/2\") ~ price*2,\n",
" TRUE ~ price))\n",
"\n",
"# Relocate column positions\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(month, .before = variety)\n",
"\n",
"\n",
"# Display the first 5 rows\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "X0wU3gQvtd9f"
}
},
{
"cell_type": "markdown",
"source": [
"Good work!👌 You don get clean, correct data set wey you fit use build your new regression model!\n",
"\n",
"How you see scatter plot?\n"
],
"metadata": {
"id": "UpaIwaxqth82"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Set theme\n",
"theme_set(theme_light())\n",
"\n",
"# Make a scatter plot of month and price\n",
"new_pumpkins %>% \n",
" ggplot(mapping = aes(x = month, y = price)) +\n",
" geom_point(size = 1.6)\n"
],
"outputs": [],
"metadata": {
"id": "DXgU-j37tl5K"
}
},
{
"cell_type": "markdown",
"source": [
"Scatter plot dey remind us say we only get month data from August go reach December. E be like say we go need more data to fit take draw conclusion for linear way.\n",
"\n",
"Make we check our modelling data again:\n"
],
"metadata": {
"id": "Ve64wVbwtobI"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Display first 5 rows\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "HFQX2ng1tuSJ"
}
},
{
"cell_type": "markdown",
"source": [
"Wetin go happen if we wan predict di `price` of pumpkin based on di `city` or `package` columns wey be character type? Or even beta, how we fit find di correlation (wey need both inputs to be numeric) between, like, `package` and `price`? 🤷🤷\n",
"\n",
"Machine learning models dey work well wit numeric features pass text values, so normally, you go need change categorical features to numeric format.\n",
"\n",
"Dis mean say we go need find way to reformat our predictors so e go dey easy for model to use am well, dis process na wetin dem dey call `feature engineering`.\n"
],
"metadata": {
"id": "7hsHoxsStyjJ"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Preprocessing data for modelling wit recipes 👩‍🍳👨‍🍳\n",
"\n",
"Di work wey dey reformat predictor values so dat model go fit use am well well, dem dey call am `feature engineering`.\n",
"\n",
"Different models get different preprocessing requirements. For example, least squares dey need `encoding categorical variables` like month, variety and city_name. Dis one na just to `translate` one column wey get `categorical values` into one or more `numeric columns` wey go replace di original column.\n",
"\n",
"For example, make we say your data get di following categorical feature:\n",
"\n",
"| city |\n",
"|:-------:|\n",
"| Denver |\n",
"| Nairobi |\n",
"| Tokyo |\n",
"\n",
"You fit use *ordinal encoding* to change each category to unique integer value, like dis:\n",
"\n",
"| city |\n",
"|:----:|\n",
"| 0 |\n",
"| 1 |\n",
"| 2 |\n",
"\n",
"Na wetin we go do to our data be dat!\n",
"\n",
"For dis section, we go check out one other correct Tidymodels package: [recipes](https://tidymodels.github.io/recipes/) - e dey help you preprocess your data **before** you train your model. Di main thing for recipe be say e be object wey dey define di steps wey you go apply to di data set so e go ready for modelling.\n",
"\n",
"Now, make we create one recipe wey go prepare our data for modelling by changing all di observations for di predictor columns to unique integer:\n"
],
"metadata": {
"id": "AD5kQbcvt3Xl"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Specify a recipe\n",
"pumpkins_recipe <- recipe(price ~ ., data = new_pumpkins) %>% \n",
" step_integer(all_predictors(), zero_based = TRUE)\n",
"\n",
"\n",
"# Print out the recipe\n",
"pumpkins_recipe"
],
"outputs": [],
"metadata": {
"id": "BNaFKXfRt9TU"
}
},
{
"cell_type": "markdown",
"source": [
"Wow! 👏 We don create our first recipe wey dey show outcome (price) and di predictors wey match am, plus e go make all di predictor columns turn set of integers 🙌! Make we break am down quick:\n",
"\n",
"- Di call to `recipe()` wey get formula dey tell di recipe di *roles* of di variables using `new_pumpkins` data as di reference. For example, di `price` column don get `outcome` role, while di other columns don get `predictor` role.\n",
"\n",
"- `step_integer(all_predictors(), zero_based = TRUE)` dey specify say all di predictors go turn set of integers wey numbering go start from 0.\n",
"\n",
"We sabi say you fit dey reason something like: \"This thing make sense die!! But wetin if I wan confirm say di recipes dey do wetin I expect dem to do? 🤔\"\n",
"\n",
"Na correct question be dat! You see, once you don define your recipe, you fit estimate di parameters wey you need to preprocess di data, and then extract di processed data. Normally, you no go need to do dis when you dey use Tidymodels (we go soon see di normal way-\\> `workflows`), but e go help you do sanity check to confirm say di recipes dey work as you expect.\n",
"\n",
"To do dat, you go need two more verbs: `prep()` and `bake()`, and as usual, our small R friends wey [`Allison Horst`](https://github.com/allisonhorst/stats-illustrations) don create go help you understand am better!\n",
"\n",
"<p >\n",
" <img src=\"../../../../../../translated_images/pcm/recipes.9ad10d8a4056bf89.webp\"\n",
" width=\"550\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n"
],
"metadata": {
"id": "KEiO0v7kuC9O"
}
},
{
"cell_type": "markdown",
"source": [
"[`prep()`](https://recipes.tidymodels.org/reference/prep.html): e go calculate di parameters wey dem need from di training set, wey dem fit use later for other data sets. For example, for one predictor column, which observation dem go give integer 0 or 1 or 2, etc.\n",
"\n",
"[`bake()`](https://recipes.tidymodels.org/reference/bake.html): e go carry one prepped recipe and use di operations for any data set.\n",
"\n",
"So, make we prep and bake our recipes to really confirm say for di background, dem go first encode di predictor columns before dem fit di model.\n"
],
"metadata": {
"id": "Q1xtzebuuTCP"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Prep the recipe\n",
"pumpkins_prep <- prep(pumpkins_recipe)\n",
"\n",
"# Bake the recipe to extract a preprocessed new_pumpkins data\n",
"baked_pumpkins <- bake(pumpkins_prep, new_data = NULL)\n",
"\n",
"# Print out the baked data set\n",
"baked_pumpkins %>% \n",
" slice_head(n = 10)"
],
"outputs": [],
"metadata": {
"id": "FGBbJbP_uUUn"
}
},
{
"cell_type": "markdown",
"source": [
"Yay! 🥳 Di processed data `baked_pumpkins` don get all im predictors encoded, e mean say di preprocessing steps wey we define as our recipe go work well as we expect. E go make am hard for you to read, but e go dey more clear for Tidymodels! Take small time check wetin observation don map to di correct integer.\n",
"\n",
"E good make we mention say `baked_pumpkins` na data frame wey we fit use do calculations.\n",
"\n",
"For example, make we try find better correlation between two points for your data so we fit build better predictive model. We go use di function `cor()` do am. Type `?cor()` to learn more about di function.\n"
],
"metadata": {
"id": "1dvP0LBUueAW"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the correlation between the city_name and the price\n",
"cor(baked_pumpkins$city_name, baked_pumpkins$price)\n",
"\n",
"# Find the correlation between the package and the price\n",
"cor(baked_pumpkins$package, baked_pumpkins$price)\n"
],
"outputs": [],
"metadata": {
"id": "3bQzXCjFuiSV"
}
},
{
"cell_type": "markdown",
"source": [
"E be like say di connection wey dey between City and Price no strong. But di connection wey dey between Package and im Price dey small strong pass. E make sense abi? Normally, di bigger di box wey carry produce, di higher di price.\n",
"\n",
"As we dey talk am, make we try use di `corrplot` package take show di correlation matrix for all di columns.\n"
],
"metadata": {
"id": "BToPWbgjuoZw"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load the corrplot package\n",
"library(corrplot)\n",
"\n",
"# Obtain correlation matrix\n",
"corr_mat <- cor(baked_pumpkins %>% \n",
" # Drop columns that are not really informative\n",
" select(-c(low_price, high_price)))\n",
"\n",
"# Make a correlation plot between the variables\n",
"corrplot(corr_mat, method = \"shade\", shade.col = NA, tl.col = \"black\", tl.srt = 45, addCoef.col = \"black\", cl.pos = \"n\", order = \"original\")"
],
"outputs": [],
"metadata": {
"id": "ZwAL3ksmutVR"
}
},
{
"cell_type": "markdown",
"source": [
"🤩🤩 Dis one beta well well.\n",
"\n",
"Beta question wey we fit ask for dis data now be: '`Wetin be di price wey I fit expect for one pumpkin package?`' Make we dive enter am sharp sharp!\n",
"\n",
"> Note: Wen you **`bake()`** di prepped recipe **`pumpkins_prep`** wit **`new_data = NULL`**, you go fit collect di processed (i.e. encoded) training data. If you get another data set, like test set, and you wan see how di recipe go pre-process am, you go just bake **`pumpkins_prep`** wit **`new_data = test_set`**\n",
"\n",
"## 4. Build linear regression model\n",
"\n",
"<p >\n",
" <img src=\"../../../../../../translated_images/pcm/linear-polynomial.5523c7cb6576ccab.webp\"\n",
" width=\"800\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n"
],
"metadata": {
"id": "YqXjLuWavNxW"
}
},
{
"cell_type": "markdown",
"source": [
"Now wey we don build recipe, and confirm say di data go pre-process well, make we now build regression model to answer di question: `Wetin be di price wey I fit expect for one pumpkin package?`\n",
"\n",
"#### Train linear regression model wit di training set\n",
"\n",
"As you don already sabi, di column *price* na di `outcome` variable, while di *package* column na di `predictor` variable.\n",
"\n",
"To do dis one, we go first split di data so dat 80% go enter training set and 20% go enter test set, then we go define recipe wey go encode di predictor column into set of integers, then build model specification. We no go prep and bake di recipe because we don already sabi say e go preprocess di data as we expect.\n"
],
"metadata": {
"id": "Pq0bSzCevW-h"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"set.seed(2056)\n",
"# Split the data into training and test sets\n",
"pumpkins_split <- new_pumpkins %>% \n",
" initial_split(prop = 0.8)\n",
"\n",
"\n",
"# Extract training and test data\n",
"pumpkins_train <- training(pumpkins_split)\n",
"pumpkins_test <- testing(pumpkins_split)\n",
"\n",
"\n",
"\n",
"# Create a recipe for preprocessing the data\n",
"lm_pumpkins_recipe <- recipe(price ~ package, data = pumpkins_train) %>% \n",
" step_integer(all_predictors(), zero_based = TRUE)\n",
"\n",
"\n",
"\n",
"# Create a linear model specification\n",
"lm_spec <- linear_reg() %>% \n",
" set_engine(\"lm\") %>% \n",
" set_mode(\"regression\")"
],
"outputs": [],
"metadata": {
"id": "CyoEh_wuvcLv"
}
},
{
"cell_type": "markdown",
"source": [
"Good job! Now wey we don get recipe and model specification, we need find way to join dem together for one object wey go first preprocess di data (prep+bake for back), fit di model on top di preprocessed data, and still allow for any post-processing wey fit dey. How dat one sound for your peace of mind!🤩\n",
"\n",
"For Tidymodels, dis kain better object na wetin dem dey call [`workflow`](https://workflows.tidymodels.org/) and e dey hold all your modeling components well well! Na wetin we go call *pipelines* for *Python*.\n",
"\n",
"So make we bundle everything together inside one workflow!📦\n"
],
"metadata": {
"id": "G3zF_3DqviFJ"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Hold modelling components in a workflow\n",
"lm_wf <- workflow() %>% \n",
" add_recipe(lm_pumpkins_recipe) %>% \n",
" add_model(lm_spec)\n",
"\n",
"# Print out the workflow\n",
"lm_wf"
],
"outputs": [],
"metadata": {
"id": "T3olroU3v-WX"
}
},
{
"cell_type": "markdown",
"source": [
"👌 On top, you fit/train workflow same way you fit/train model.\n"
],
"metadata": {
"id": "zd1A5tgOwEPX"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Train the model\n",
"lm_wf_fit <- lm_wf %>% \n",
" fit(data = pumpkins_train)\n",
"\n",
"# Print the model coefficients learned \n",
"lm_wf_fit"
],
"outputs": [],
"metadata": {
"id": "NhJagFumwFHf"
}
},
{
"cell_type": "markdown",
"source": [
"From di model output, we fit see di coefficients wey e learn during training. Dem represent di coefficients of di line of best fit wey go give us di lowest overall error between di real and di predicted variable.\n",
"\n",
"#### Check how di model perform wit di test set\n",
"\n",
"E don reach time to see how di model take perform 📏! How we go take do am?\n",
"\n",
"Now wey we don train di model, we fit use am take make predictions for di test_set wit `parsnip::predict()`. After dat, we go fit compare di predictions to di real label values to check how well (or no well!) di model dey work.\n",
"\n",
"Make we start by making predictions for di test set, then join di columns to di test set.\n"
],
"metadata": {
"id": "_4QkGtBTwItF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make predictions for the test set\n",
"predictions <- lm_wf_fit %>% \n",
" predict(new_data = pumpkins_test)\n",
"\n",
"\n",
"# Bind predictions to the test set\n",
"lm_results <- pumpkins_test %>% \n",
" select(c(package, price)) %>% \n",
" bind_cols(predictions)\n",
"\n",
"\n",
"# Print the first ten rows of the tibble\n",
"lm_results %>% \n",
" slice_head(n = 10)"
],
"outputs": [],
"metadata": {
"id": "UFZzTG0gwTs9"
}
},
{
"cell_type": "markdown",
"source": [
"Yes, you don train model finish and use am take make predictions!🔮 E good? Make we check how di model perform!\n",
"\n",
"For Tidymodels, we fit check dis one wit `yardstick::metrics()`! For linear regression, make we focus on dis metrics:\n",
"\n",
"- `Root Mean Square Error (RMSE)`: Na di square root of di [MSE](https://en.wikipedia.org/wiki/Mean_squared_error). Dis one go give us absolute metric wey dey di same unit as di label (for dis case, na di price of pumpkin). Di smaller di value, di better di model (to put am simple, e mean di average price wey di predictions miss!)\n",
"\n",
"- `Coefficient of Determination (wey people dey call R-squared or R2)`: Na relative metric wey di higher di value, di better di model fit. Dis metric dey show how much di model fit explain di variance between di predicted and actual label values.\n"
],
"metadata": {
"id": "0A5MjzM7wW9M"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Evaluate performance of linear regression\n",
"metrics(data = lm_results,\n",
" truth = price,\n",
" estimate = .pred)"
],
"outputs": [],
"metadata": {
"id": "reJ0UIhQwcEH"
}
},
{
"cell_type": "markdown",
"source": [
"Model performance don waka. Make we see if we fit get better idea by drawing scatter plot wey go show package and price, then use the predictions wey we make to put line of best fit on top.\n",
"\n",
"Dis one mean say we go need prepare and process the test set so we fit encode the package column, then join am with the predictions wey our model don make.\n"
],
"metadata": {
"id": "fdgjzjkBwfWt"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Encode package column\n",
"package_encode <- lm_pumpkins_recipe %>% \n",
" prep() %>% \n",
" bake(new_data = pumpkins_test) %>% \n",
" select(package)\n",
"\n",
"\n",
"# Bind encoded package column to the results\n",
"lm_results <- lm_results %>% \n",
" bind_cols(package_encode %>% \n",
" rename(package_integer = package)) %>% \n",
" relocate(package_integer, .after = package)\n",
"\n",
"\n",
"# Print new results data frame\n",
"lm_results %>% \n",
" slice_head(n = 5)\n",
"\n",
"\n",
"# Make a scatter plot\n",
"lm_results %>% \n",
" ggplot(mapping = aes(x = package_integer, y = price)) +\n",
" geom_point(size = 1.6) +\n",
" # Overlay a line of best fit\n",
" geom_line(aes(y = .pred), color = \"orange\", size = 1.2) +\n",
" xlab(\"package\")\n",
" \n"
],
"outputs": [],
"metadata": {
"id": "R0nw719lwkHE"
}
},
{
"cell_type": "markdown",
"source": [
"Great! As you fit see, di linear regression model no really sabi well how di relationship between one package and di price wey follow am be.\n",
"\n",
"🎃 Congrats, you don build model wey fit help predict di price of some types of pumpkins. Your holiday pumpkin patch go fine well well. But you fit still make better model!\n",
"\n",
"## 5. Build polynomial regression model\n",
"\n",
"<p >\n",
" <img src=\"../../../../../../translated_images/pcm/linear-polynomial.5523c7cb6576ccab.webp\"\n",
" width=\"800\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../../../../../translated_images/pcm/linear-polynomial.5523c7cb6576ccab.webp){width=\"800\"}-->\n"
],
"metadata": {
"id": "HOCqJXLTwtWI"
}
},
{
"cell_type": "markdown",
"source": [
"Sometimes our data no go get linear relationship, but we still wan predict wetin go happen. Polynomial regression fit help us predict for more complex non-linear relationships.\n",
"\n",
"Make we look example for the relationship between package and price for our pumpkins data set. Sometimes, e get linear relationship between variables - like say the bigger the pumpkin volume, the higher the price - but sometimes, these relationships no fit show as plane or straight line.\n",
"\n",
"> ✅ Here be [some more examples](https://online.stat.psu.edu/stat501/lesson/9/9.8) of data wey fit use polynomial regression\n",
">\n",
"> Check again the relationship between Variety and Price for the previous plot. This scatterplot, e sure say e suppose dey analyzed by straight line? Maybe no. For this case, you fit try polynomial regression.\n",
">\n",
"> ✅ Polynomials na mathematical expressions wey fit get one or more variables and coefficients\n",
"\n",
"#### Train polynomial regression model using the training set\n",
"\n",
"Polynomial regression dey create *curved line* wey go fit nonlinear data well well.\n",
"\n",
"Make we see whether polynomial model go perform better for predictions. We go follow similar steps like before:\n",
"\n",
"- Create recipe wey go show the preprocessing steps wey we go do for our data to prepare am for modelling, like encoding predictors and calculating polynomials of degree *n*\n",
"\n",
"- Build model specification\n",
"\n",
"- Combine the recipe and model specification inside one workflow\n",
"\n",
"- Create model by fitting the workflow\n",
"\n",
"- Check how well the model dey perform for the test data\n",
"\n",
"Make we start!\n"
],
"metadata": {
"id": "VcEIpRV9wzYr"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Specify a recipe\r\n",
"poly_pumpkins_recipe <-\r\n",
" recipe(price ~ package, data = pumpkins_train) %>%\r\n",
" step_integer(all_predictors(), zero_based = TRUE) %>% \r\n",
" step_poly(all_predictors(), degree = 4)\r\n",
"\r\n",
"\r\n",
"# Create a model specification\r\n",
"poly_spec <- linear_reg() %>% \r\n",
" set_engine(\"lm\") %>% \r\n",
" set_mode(\"regression\")\r\n",
"\r\n",
"\r\n",
"# Bundle recipe and model spec into a workflow\r\n",
"poly_wf <- workflow() %>% \r\n",
" add_recipe(poly_pumpkins_recipe) %>% \r\n",
" add_model(poly_spec)\r\n",
"\r\n",
"\r\n",
"# Create a model\r\n",
"poly_wf_fit <- poly_wf %>% \r\n",
" fit(data = pumpkins_train)\r\n",
"\r\n",
"\r\n",
"# Print learned model coefficients\r\n",
"poly_wf_fit\r\n",
"\r\n",
" "
],
"outputs": [],
"metadata": {
"id": "63n_YyRXw3CC"
}
},
{
"cell_type": "markdown",
"source": [
"#### Check how model dey perform\n",
"\n",
"👏👏You don build polynomial model, make we use am predict for di test set!\n"
],
"metadata": {
"id": "-LHZtztSxDP0"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make price predictions on test data\r\n",
"poly_results <- poly_wf_fit %>% predict(new_data = pumpkins_test) %>% \r\n",
" bind_cols(pumpkins_test %>% select(c(package, price))) %>% \r\n",
" relocate(.pred, .after = last_col())\r\n",
"\r\n",
"\r\n",
"# Print the results\r\n",
"poly_results %>% \r\n",
" slice_head(n = 10)"
],
"outputs": [],
"metadata": {
"id": "YUFpQ_dKxJGx"
}
},
{
"cell_type": "markdown",
"source": [
"Woo-hoo, make we check how di model take perform for di test_set using `yardstick::metrics()`.\n"
],
"metadata": {
"id": "qxdyj86bxNGZ"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"metrics(data = poly_results, truth = price, estimate = .pred)"
],
"outputs": [],
"metadata": {
"id": "8AW5ltkBxXDm"
}
},
{
"cell_type": "markdown",
"source": [
"🤩🤩 Performance don beta well well.\n",
"\n",
"Di `rmse` don drop from like 7 go reach like 3. Dis one mean say di mistake wey dey between di real price and di predicted price don reduce. You fit *roughly* talk say on average, di wrong predictions dey miss di real price by around \\$3. Di `rsq` sef don increase from like 0.4 go reach 0.8.\n",
"\n",
"All dis metrics dey show say di polynomial model dey perform pass di linear model. Nice one!\n",
"\n",
"Make we see if we fit show am for graph!\n"
],
"metadata": {
"id": "6gLHNZDwxYaS"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Bind encoded package column to the results\r\n",
"poly_results <- poly_results %>% \r\n",
" bind_cols(package_encode %>% \r\n",
" rename(package_integer = package)) %>% \r\n",
" relocate(package_integer, .after = package)\r\n",
"\r\n",
"\r\n",
"# Print new results data frame\r\n",
"poly_results %>% \r\n",
" slice_head(n = 5)\r\n",
"\r\n",
"\r\n",
"# Make a scatter plot\r\n",
"poly_results %>% \r\n",
" ggplot(mapping = aes(x = package_integer, y = price)) +\r\n",
" geom_point(size = 1.6) +\r\n",
" # Overlay a line of best fit\r\n",
" geom_line(aes(y = .pred), color = \"midnightblue\", size = 1.2) +\r\n",
" xlab(\"package\")\r\n"
],
"outputs": [],
"metadata": {
"id": "A83U16frxdF1"
}
},
{
"cell_type": "markdown",
"source": [
"You fit see one curved line wey match your data well well! 🤩\n",
"\n",
"You fit make am more smooth if you pass polynomial formula give `geom_smooth` like dis:\n"
],
"metadata": {
"id": "4U-7aHOVxlGU"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make a scatter plot\r\n",
"poly_results %>% \r\n",
" ggplot(mapping = aes(x = package_integer, y = price)) +\r\n",
" geom_point(size = 1.6) +\r\n",
" # Overlay a line of best fit\r\n",
" geom_smooth(method = lm, formula = y ~ poly(x, degree = 4), color = \"midnightblue\", size = 1.2, se = FALSE) +\r\n",
" xlab(\"package\")"
],
"outputs": [],
"metadata": {
"id": "5vzNT0Uexm-w"
}
},
{
"cell_type": "markdown",
"source": [
"E be like smooth curve!🤩\n",
"\n",
"Na so you go take make new prediction:\n"
],
"metadata": {
"id": "v9u-wwyLxq4G"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make a hypothetical data frame\r\n",
"hypo_tibble <- tibble(package = \"bushel baskets\")\r\n",
"\r\n",
"# Make predictions using linear model\r\n",
"lm_pred <- lm_wf_fit %>% predict(new_data = hypo_tibble)\r\n",
"\r\n",
"# Make predictions using polynomial model\r\n",
"poly_pred <- poly_wf_fit %>% predict(new_data = hypo_tibble)\r\n",
"\r\n",
"# Return predictions in a list\r\n",
"list(\"linear model prediction\" = lm_pred, \r\n",
" \"polynomial model prediction\" = poly_pred)\r\n"
],
"outputs": [],
"metadata": {
"id": "jRPSyfQGxuQv"
}
},
{
"cell_type": "markdown",
"source": [
"Di `polynomial model` prediction make sense wella, as we see for di scatter plots of `price` and `package`! And, if dis model beta pass di one wey we do before, based on di same data, you go need plan well for di more expensive pumpkins!\n",
"\n",
"🏆 Well done! You don make two regression models for one lesson. For di final part of regression, you go learn about logistic regression to fit categories.\n",
"\n",
"## **🚀Challenge**\n",
"\n",
"Try test different variables for dis notebook to see how di correlation go match di model accuracy.\n",
"\n",
"## [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/14/)\n",
"\n",
"## **Review & Self Study**\n",
"\n",
"For dis lesson, we learn about Linear Regression. But e get other important types of Regression. Go read about Stepwise, Ridge, Lasso and Elasticnet techniques. One better course wey you fit study to sabi more na di [Stanford Statistical Learning course](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning).\n",
"\n",
"If you wan sabi more about how to use di amazing Tidymodels framework, check dis resources:\n",
"\n",
"- Tidymodels website: [Get started with Tidymodels](https://www.tidymodels.org/start/)\n",
"\n",
"- Max Kuhn and Julia Silge, [*Tidy Modeling with R*](https://www.tmwr.org/)*.*\n",
"\n",
"###### **THANK YOU TO:**\n",
"\n",
"[Allison Horst](https://twitter.com/allison_horst?lang=en) for di amazing illustrations wey make R dey more friendly and fun. You fit see more of her illustrations for her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\n"
],
"metadata": {
"id": "8zOLOWqMxzk5"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n\n<!-- CO-OP TRANSLATOR DISCLAIMER START -->\n**Disclaimer**: \nDis dokyument don use AI transle-shun service [Co-op Translator](https://github.com/Azure/co-op-translator) do di transle-shun. Even as we dey try make am correct, abeg make you sabi say machine transle-shun fit get mistake or no dey accurate well. Di original dokyument wey dey for im native language na di one wey you go take as di correct source. For important mata, e good make you use professional human transle-shun. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis transle-shun.\n<!-- CO-OP TRANSLATOR DISCLAIMER END -->\n"
]
}
]
}