{ "nbformat": 4, "nbformat_minor": 2, "metadata": { "colab": { "name": "lesson_3-R.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "5015d65d61ba75a223bfc56c273aa174", "translation_date": "2025-09-06T15:29:43+00:00", "source_file": "2-Regression/3-Linear/solution/R/lesson_3-R.ipynb", "language_code": "en" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Build a regression model: linear and polynomial regression models\n" ], "metadata": { "id": "EgQw8osnsUV-" } }, { "cell_type": "markdown", "source": [ "## Linear and Polynomial Regression for Pumpkin Pricing - Lesson 3\n", "

\n", " \n", "

Infographic by Dasani Madipalli
\n", "\n", "\n", "\n", "\n", "#### Introduction\n", "\n", "Up to this point, you’ve explored what regression is using sample data from the pumpkin pricing dataset that we’ll be working with throughout this lesson. You’ve also visualized the data using `ggplot2`. 💪\n", "\n", "Now, you’re ready to dive deeper into regression for machine learning. In this lesson, you’ll learn more about two types of regression: *basic linear regression* and *polynomial regression*, along with some of the mathematics behind these techniques.\n", "\n", "> Throughout this curriculum, we assume minimal math knowledge and aim to make the content accessible to students from various fields. Look out for notes, 🧮 math callouts, diagrams, and other learning tools to help with understanding.\n", "\n", "#### Preparation\n", "\n", "As a reminder, you’re working with this data to answer specific questions:\n", "\n", "- When is the best time to buy pumpkins?\n", "\n", "- What price can I expect for a case of miniature pumpkins?\n", "\n", "- Should I buy them in half-bushel baskets or in 1 1/9 bushel boxes? Let’s keep exploring this data.\n", "\n", "In the previous lesson, you created a `tibble` (a modern reimagining of the data frame) and filled it with part of the original dataset, standardizing the pricing by the bushel. However, by doing so, you were only able to gather about 400 data points, and only for the fall months. Could we uncover more details about the data by cleaning it further? Let’s find out... 🕵️‍♀️\n", "\n", "For this task, we’ll need the following packages:\n", "\n", "- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to make data science faster, easier, and more enjoyable!\n", "\n", "- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n", "\n", "- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple tools for examining and cleaning messy data.\n", "\n", "- `corrplot`: The [corrplot package](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) offers a visual exploratory tool for correlation matrices, with automatic variable reordering to help detect hidden patterns among variables.\n", "\n", "You can install them using the following command:\n", "\n", "`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"corrplot\"))`\n", "\n", "The script below checks whether you have the required packages for this module and installs them for you if they’re missing.\n" ], "metadata": { "id": "WqQPS1OAsg3H" } }, { "cell_type": "code", "execution_count": null, "source": [ "suppressWarnings(if (!require(\"pacman\")) install.packages(\"pacman\"))\n", "\n", "pacman::p_load(tidyverse, tidymodels, janitor, corrplot)" ], "outputs": [], "metadata": { "id": "tA4C2WN3skCf", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c06cd805-5534-4edc-f72b-d0d1dab96ac0" } }, { "cell_type": "markdown", "source": [ "We'll later load these awesome packages and make them available in our current R session. (This is just for illustration purposes, `pacman::p_load()` already does this for you)\n", "\n", "## 1. A linear regression line\n", "\n", "As you learned in Lesson 1, the goal of a linear regression exercise is to plot a *line of best fit* to:\n", "\n", "- **Show variable relationships**. Illustrate the relationship between variables.\n", "\n", "- **Make predictions**. Accurately predict where a new data point might fall in relation to that line.\n", "\n", "To draw this type of line, we use a statistical technique called **Least-Squares Regression**. The term `least-squares` means that all the data points surrounding the regression line are squared and then summed. Ideally, this final sum is as small as possible, because we want a low number of errors, or `least-squares`. Therefore, the line of best fit is the line that minimizes the sum of the squared errors—hence the name *least squares regression*.\n", "\n", "We do this because we want to model a line that has the smallest cumulative distance from all of our data points. The terms are squared before summing because we care about the magnitude of the errors, not their direction.\n", "\n", "> **🧮 Show me the math**\n", ">\n", "> This line, called the *line of best fit*, can be expressed by [an equation](https://en.wikipedia.org/wiki/Simple_linear_regression):\n", ">\n", "> Y = a + bX\n", ">\n", "> `X` is the '`explanatory variable` or `predictor`'. `Y` is the '`dependent variable` or `outcome`'. The slope of the line is `b`, and `a` is the y-intercept, which represents the value of `Y` when `X = 0`.\n", ">\n", "\n", "> ![](../../../../../../2-Regression/3-Linear/solution/images/slope.png \"slope = $y/x$\")\n", " Infographic by Jen Looper\n", ">\n", "> First, calculate the slope `b`.\n", ">\n", "> In other words, referring to our pumpkin data's original question: \"predict the price of a pumpkin per bushel by month,\" `X` would represent the price, and `Y` would represent the month of sale.\n", ">\n", "> ![](../../../../../../translated_images/calculation.989aa7822020d9d0ba9fc781f1ab5192f3421be86ebb88026528aef33c37b0d8.en.png)\n", " Infographic by Jen Looper\n", "> \n", "> Calculate the value of Y. If you're paying around \\$4, it must be April!\n", ">\n", "> The math used to calculate the line must account for the slope of the line, which also depends on the intercept, or where `Y` is positioned when `X = 0`.\n", ">\n", "> You can see the method for calculating these values on the [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) website. Also, check out [this Least-Squares Calculator](https://www.mathsisfun.com/data/least-squares-calculator.html) to see how the values affect the line.\n", "\n", "Not so intimidating, right? 🤓\n", "\n", "#### Correlation\n", "\n", "Another important term to understand is the **Correlation Coefficient** between the given X and Y variables. Using a scatterplot, you can quickly visualize this coefficient. A plot where the data points form a neat line has high correlation, while a plot where the data points are scattered randomly between X and Y has low correlation.\n", "\n", "A good linear regression model will have a high Correlation Coefficient (closer to 1 than 0) when using the Least-Squares Regression method with a line of regression.\n" ], "metadata": { "id": "cdX5FRpvsoP5" } }, { "cell_type": "markdown", "source": [ "## **2. A dance with data: creating a data frame that will be used for modeling**\n", "\n", "

\n", " \n", "

Artwork by @allison_horst
\n", "\n", "\n", "\n" ], "metadata": { "id": "WdUKXk7Bs8-V" } }, { "cell_type": "markdown", "source": [ "Load the necessary libraries and dataset. Transform the data into a data frame containing a subset of the data:\n", "\n", "- Only include pumpkins priced by the bushel.\n", "\n", "- Convert the date into a month.\n", "\n", "- Calculate the price as the average of the high and low prices.\n", "\n", "- Adjust the price to reflect the cost per bushel quantity.\n", "\n", "> These steps were covered in the [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/2-Data/solution/lesson_2-R.ipynb).\n" ], "metadata": { "id": "fMCtu2G2s-p8" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Load the core Tidyverse packages\n", "library(tidyverse)\n", "library(lubridate)\n", "\n", "# Import the pumpkins data\n", "pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n", "\n", "\n", "# Get a glimpse and dimensions of the data\n", "glimpse(pumpkins)\n", "\n", "\n", "# Print the first 50 rows of the data set\n", "pumpkins %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "ryMVZEEPtERn" } }, { "cell_type": "markdown", "source": [ "In the spirit of sheer adventure, let's explore the [`janitor package`](../../../../../../2-Regression/3-Linear/solution/R/github.com/sfirke/janitor) that provides simple functions for examining and cleaning dirty data. For instance, let's take a look at the column names for our data:\n" ], "metadata": { "id": "xcNxM70EtJjb" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Return column names\n", "pumpkins %>% \n", " names()" ], "outputs": [], "metadata": { "id": "5XtpaIigtPfW" } }, { "cell_type": "markdown", "source": [ "🤔 We can do better. Let's make these column names `friendR` by converting them to the [snake_case](https://en.wikipedia.org/wiki/Snake_case) convention using `janitor::clean_names`. To find out more about this function: `?clean_names`\n" ], "metadata": { "id": "IbIqrMINtSHe" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Clean names to the snake_case convention\n", "pumpkins <- pumpkins %>% \n", " clean_names(case = \"snake\")\n", "\n", "# Return column names\n", "pumpkins %>% \n", " names()" ], "outputs": [], "metadata": { "id": "a2uYvclYtWvX" } }, { "cell_type": "markdown", "source": [ "Much tidyR 🧹! Now, let's dance with the data using `dplyr` just like in the previous lesson! 💃\n" ], "metadata": { "id": "HfhnuzDDtaDd" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Select desired columns\n", "pumpkins <- pumpkins %>% \n", " select(variety, city_name, package, low_price, high_price, date)\n", "\n", "\n", "\n", "# Extract the month from the dates to a new column\n", "pumpkins <- pumpkins %>%\n", " mutate(date = mdy(date),\n", " month = month(date)) %>% \n", " select(-date)\n", "\n", "\n", "\n", "# Create a new column for average Price\n", "pumpkins <- pumpkins %>% \n", " mutate(price = (low_price + high_price)/2)\n", "\n", "\n", "# Retain only pumpkins with the string \"bushel\"\n", "new_pumpkins <- pumpkins %>% \n", " filter(str_detect(string = package, pattern = \"bushel\"))\n", "\n", "\n", "# Normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel\n", "new_pumpkins <- new_pumpkins %>% \n", " mutate(price = case_when(\n", " str_detect(package, \"1 1/9\") ~ price/(1.1),\n", " str_detect(package, \"1/2\") ~ price*2,\n", " TRUE ~ price))\n", "\n", "# Relocate column positions\n", "new_pumpkins <- new_pumpkins %>% \n", " relocate(month, .before = variety)\n", "\n", "\n", "# Display the first 5 rows\n", "new_pumpkins %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "X0wU3gQvtd9f" } }, { "cell_type": "markdown", "source": [ "Good job!👌 You now have a clean, organized dataset ready to build your new regression model!\n", "\n", "How about a scatter plot?\n" ], "metadata": { "id": "UpaIwaxqth82" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Set theme\n", "theme_set(theme_light())\n", "\n", "# Make a scatter plot of month and price\n", "new_pumpkins %>% \n", " ggplot(mapping = aes(x = month, y = price)) +\n", " geom_point(size = 1.6)\n" ], "outputs": [], "metadata": { "id": "DXgU-j37tl5K" } }, { "cell_type": "markdown", "source": [ "A scatter plot shows that we only have data for the months from August to December. We likely need more data to make conclusions in a linear manner.\n", "\n", "Let's revisit our modeling data:\n" ], "metadata": { "id": "Ve64wVbwtobI" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Display first 5 rows\n", "new_pumpkins %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "HFQX2ng1tuSJ" } }, { "cell_type": "markdown", "source": [ "What if we wanted to predict the `price` of a pumpkin based on the `city` or `package` columns, which are character types? Or, even more simply, how could we determine the correlation (which requires both inputs to be numeric) between, for example, `package` and `price`? 🤷🤷\n", "\n", "Machine learning models perform best with numeric features rather than text values, so it's usually necessary to convert categorical features into numeric representations.\n", "\n", "This means we need to find a way to reformat our predictors to make them more suitable for a model to use effectively—a process known as `feature engineering`.\n" ], "metadata": { "id": "7hsHoxsStyjJ" } }, { "cell_type": "markdown", "source": [ "## 3. Preprocessing data for modeling with recipes 👩‍🍳👨‍🍳\n", "\n", "Activities that reshape predictor values to make them more suitable for a model are often referred to as `feature engineering`.\n", "\n", "Different models have different preprocessing needs. For example, least squares requires `encoding categorical variables` like month, variety, and city_name. This involves `converting` a column with `categorical values` into one or more `numeric columns` that replace the original.\n", "\n", "For instance, imagine your data includes the following categorical feature:\n", "\n", "| city |\n", "|:-------:|\n", "| Denver |\n", "| Nairobi |\n", "| Tokyo |\n", "\n", "You can use *ordinal encoding* to replace each category with a unique integer value, like this:\n", "\n", "| city |\n", "|:----:|\n", "| 0 |\n", "| 1 |\n", "| 2 |\n", "\n", "And that's exactly what we'll do with our data!\n", "\n", "In this section, we'll dive into another fantastic Tidymodels package: [recipes](https://tidymodels.github.io/recipes/) - designed to help you preprocess your data **before** training your model. At its core, a recipe is an object that specifies the steps to be applied to a dataset to prepare it for modeling.\n", "\n", "Now, let's create a recipe that gets our data ready for modeling by replacing all the observations in the predictor columns with unique integers:\n" ], "metadata": { "id": "AD5kQbcvt3Xl" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Specify a recipe\n", "pumpkins_recipe <- recipe(price ~ ., data = new_pumpkins) %>% \n", " step_integer(all_predictors(), zero_based = TRUE)\n", "\n", "\n", "# Print out the recipe\n", "pumpkins_recipe" ], "outputs": [], "metadata": { "id": "BNaFKXfRt9TU" } }, { "cell_type": "markdown", "source": [ "Awesome! 👏 We just created our first recipe that defines an outcome (price) and its corresponding predictors, ensuring that all predictor columns are encoded as integers 🙌! Let's break it down step by step:\n", "\n", "- The `recipe()` function, combined with a formula, assigns *roles* to the variables using the `new_pumpkins` dataset as a reference. For example, the `price` column is designated as the `outcome`, while the other columns are assigned the `predictor` role.\n", "\n", "- `step_integer(all_predictors(), zero_based = TRUE)` specifies that all predictors should be converted into integers, starting the numbering at 0.\n", "\n", "You might be thinking: \"This is so cool!! But what if I want to make sure the recipes are doing exactly what I expect them to do? 🤔\"\n", "\n", "That's a great question! Once your recipe is defined, you can calculate the parameters needed to preprocess the data and then extract the processed data. While this step isn't typically necessary when using Tidymodels (we'll explore the standard approach soon—via `workflows`), it can be useful for performing a sanity check to ensure the recipes are functioning as intended.\n", "\n", "To do this, you'll need two additional functions: `prep()` and `bake()`. And, as always, our friendly R illustrations by [`Allison Horst`](https://github.com/allisonhorst/stats-illustrations) make it easier to understand!\n", "\n", "

\n", " \n", "

Artwork by @allison_horst
\n" ], "metadata": { "id": "KEiO0v7kuC9O" } }, { "cell_type": "markdown", "source": [ "[`prep()`](https://recipes.tidymodels.org/reference/prep.html): calculates the necessary parameters from a training dataset, which can then be used on other datasets. For example, for a specific predictor column, it determines which observation will be assigned integer values like 0, 1, 2, and so on.\n", "\n", "[`bake()`](https://recipes.tidymodels.org/reference/bake.html): uses a prepped recipe to apply the transformations to any dataset.\n", "\n", "With that in mind, let's prep and bake our recipes to confirm that, behind the scenes, the predictor columns will first be encoded before fitting a model.\n" ], "metadata": { "id": "Q1xtzebuuTCP" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Prep the recipe\n", "pumpkins_prep <- prep(pumpkins_recipe)\n", "\n", "# Bake the recipe to extract a preprocessed new_pumpkins data\n", "baked_pumpkins <- bake(pumpkins_prep, new_data = NULL)\n", "\n", "# Print out the baked data set\n", "baked_pumpkins %>% \n", " slice_head(n = 10)" ], "outputs": [], "metadata": { "id": "FGBbJbP_uUUn" } }, { "cell_type": "markdown", "source": [ "Woo-hoo! 🥳 The processed data `baked_pumpkins` has all its predictors encoded, confirming that the preprocessing steps defined in our recipe are working as expected. While this might make it harder for you to read, it becomes much more understandable for Tidymodels! Take a moment to identify which observation has been mapped to its corresponding integer.\n", "\n", "It's also worth noting that `baked_pumpkins` is a data frame that we can use for computations.\n", "\n", "For example, let's try to identify a strong correlation between two points in your data to potentially build a solid predictive model. We'll use the function `cor()` for this. Type `?cor()` to learn more about the function.\n" ], "metadata": { "id": "1dvP0LBUueAW" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Find the correlation between the city_name and the price\n", "cor(baked_pumpkins$city_name, baked_pumpkins$price)\n", "\n", "# Find the correlation between the package and the price\n", "cor(baked_pumpkins$package, baked_pumpkins$price)\n" ], "outputs": [], "metadata": { "id": "3bQzXCjFuiSV" } }, { "cell_type": "markdown", "source": [ "As it turns out, there's only a weak correlation between the City and Price. However, there's a slightly stronger correlation between the Package and its Price. That makes sense, right? Typically, the larger the produce box, the higher the price.\n", "\n", "While we're at it, let's also try visualizing a correlation matrix of all the columns using the `corrplot` package.\n" ], "metadata": { "id": "BToPWbgjuoZw" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Load the corrplot package\n", "library(corrplot)\n", "\n", "# Obtain correlation matrix\n", "corr_mat <- cor(baked_pumpkins %>% \n", " # Drop columns that are not really informative\n", " select(-c(low_price, high_price)))\n", "\n", "# Make a correlation plot between the variables\n", "corrplot(corr_mat, method = \"shade\", shade.col = NA, tl.col = \"black\", tl.srt = 45, addCoef.col = \"black\", cl.pos = \"n\", order = \"original\")" ], "outputs": [], "metadata": { "id": "ZwAL3ksmutVR" } }, { "cell_type": "markdown", "source": [ "🤩🤩 Much better.\n", "\n", "A good question to ask now about this data is: '`What price can I expect for a given pumpkin package?`' Let's dive right in!\n", "\n", "> Note: When you **`bake()`** the prepared recipe **`pumpkins_prep`** with **`new_data = NULL`**, you retrieve the processed (i.e., encoded) training data. If you had another dataset, such as a test set, and wanted to see how the recipe would pre-process it, you would simply bake **`pumpkins_prep`** with **`new_data = test_set`**.\n", "\n", "## 4. Build a linear regression model\n", "\n", "

\n", " \n", "

Infographic by Dasani Madipalli
\n" ], "metadata": { "id": "YqXjLuWavNxW" } }, { "cell_type": "markdown", "source": [ "Now that we have created a recipe and confirmed that the data will be pre-processed correctly, let's proceed to build a regression model to answer the question: `What price can I expect for a given pumpkin package?`\n", "\n", "#### Train a linear regression model using the training set\n", "\n", "As you may have already noticed, the column *price* is the `outcome` variable, while the *package* column is the `predictor` variable.\n", "\n", "To achieve this, we'll first split the data so that 80% is allocated to the training set and 20% to the test set. Then, we'll define a recipe that encodes the predictor column into a set of integers, followed by building a model specification. We won't prep and bake our recipe since we already know it will preprocess the data as intended.\n" ], "metadata": { "id": "Pq0bSzCevW-h" } }, { "cell_type": "code", "execution_count": null, "source": [ "set.seed(2056)\n", "# Split the data into training and test sets\n", "pumpkins_split <- new_pumpkins %>% \n", " initial_split(prop = 0.8)\n", "\n", "\n", "# Extract training and test data\n", "pumpkins_train <- training(pumpkins_split)\n", "pumpkins_test <- testing(pumpkins_split)\n", "\n", "\n", "\n", "# Create a recipe for preprocessing the data\n", "lm_pumpkins_recipe <- recipe(price ~ package, data = pumpkins_train) %>% \n", " step_integer(all_predictors(), zero_based = TRUE)\n", "\n", "\n", "\n", "# Create a linear model specification\n", "lm_spec <- linear_reg() %>% \n", " set_engine(\"lm\") %>% \n", " set_mode(\"regression\")" ], "outputs": [], "metadata": { "id": "CyoEh_wuvcLv" } }, { "cell_type": "markdown", "source": [ "Great job! Now that we have a recipe and a model specification, we need a way to combine them into a single object that will preprocess the data (prep + bake behind the scenes), fit the model on the preprocessed data, and even allow for potential post-processing steps. Sounds reassuring, right? 🤩\n", "\n", "In Tidymodels, this handy object is called a [`workflow`](https://workflows.tidymodels.org/), and it conveniently organizes all your modeling components! This is similar to what we’d refer to as *pipelines* in *Python*.\n", "\n", "So, let’s package everything into a workflow! 📦\n" ], "metadata": { "id": "G3zF_3DqviFJ" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Hold modelling components in a workflow\n", "lm_wf <- workflow() %>% \n", " add_recipe(lm_pumpkins_recipe) %>% \n", " add_model(lm_spec)\n", "\n", "# Print out the workflow\n", "lm_wf" ], "outputs": [], "metadata": { "id": "T3olroU3v-WX" } }, { "cell_type": "markdown", "source": [ "In addition, a workflow can be adjusted or trained in a similar way to how a model is.\n" ], "metadata": { "id": "zd1A5tgOwEPX" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Train the model\n", "lm_wf_fit <- lm_wf %>% \n", " fit(data = pumpkins_train)\n", "\n", "# Print the model coefficients learned \n", "lm_wf_fit" ], "outputs": [], "metadata": { "id": "NhJagFumwFHf" } }, { "cell_type": "markdown", "source": [ "From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that minimizes the overall error between the actual and predicted variable.\n", "\n", "#### Evaluate model performance using the test set\n", "\n", "It's time to check how the model performed 📏! How do we do this?\n", "\n", "Now that we've trained the model, we can use it to make predictions for the test_set using `parsnip::predict()`. Afterward, we can compare these predictions to the actual label values to assess how well (or poorly!) the model is functioning.\n", "\n", "Let's begin by making predictions for the test set and then binding the columns to the test set.\n" ], "metadata": { "id": "_4QkGtBTwItF" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Make predictions for the test set\n", "predictions <- lm_wf_fit %>% \n", " predict(new_data = pumpkins_test)\n", "\n", "\n", "# Bind predictions to the test set\n", "lm_results <- pumpkins_test %>% \n", " select(c(package, price)) %>% \n", " bind_cols(predictions)\n", "\n", "\n", "# Print the first ten rows of the tibble\n", "lm_results %>% \n", " slice_head(n = 10)" ], "outputs": [], "metadata": { "id": "UFZzTG0gwTs9" } }, { "cell_type": "markdown", "source": [ "Yes, you’ve just trained a model and used it to make predictions! 🔮 How good is it? Let’s evaluate the model’s performance!\n", "\n", "In Tidymodels, we do this using `yardstick::metrics()`! For linear regression, let’s focus on the following metrics:\n", "\n", "- `Root Mean Square Error (RMSE)`: The square root of the [MSE](https://en.wikipedia.org/wiki/Mean_squared_error). This gives an absolute metric in the same unit as the target variable (in this case, the price of a pumpkin). The smaller the value, the better the model (in simple terms, it represents the average amount by which the predictions are off!).\n", "\n", "- `Coefficient of Determination (commonly known as R-squared or R2)`: A relative metric where a higher value indicates a better fit of the model. Essentially, this metric shows how much of the variance between the predicted and actual target values the model is able to explain.\n" ], "metadata": { "id": "0A5MjzM7wW9M" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Evaluate performance of linear regression\n", "metrics(data = lm_results,\n", " truth = price,\n", " estimate = .pred)" ], "outputs": [], "metadata": { "id": "reJ0UIhQwcEH" } }, { "cell_type": "markdown", "source": [ "There goes the model performance. Let's see if we can get a clearer picture by visualizing a scatter plot of the package and price, then overlaying a line of best fit using the predictions.\n", "\n", "To do this, we'll need to prepare and process the test set to encode the package column, and then combine it with the predictions generated by our model.\n" ], "metadata": { "id": "fdgjzjkBwfWt" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Encode package column\n", "package_encode <- lm_pumpkins_recipe %>% \n", " prep() %>% \n", " bake(new_data = pumpkins_test) %>% \n", " select(package)\n", "\n", "\n", "# Bind encoded package column to the results\n", "lm_results <- lm_results %>% \n", " bind_cols(package_encode %>% \n", " rename(package_integer = package)) %>% \n", " relocate(package_integer, .after = package)\n", "\n", "\n", "# Print new results data frame\n", "lm_results %>% \n", " slice_head(n = 5)\n", "\n", "\n", "# Make a scatter plot\n", "lm_results %>% \n", " ggplot(mapping = aes(x = package_integer, y = price)) +\n", " geom_point(size = 1.6) +\n", " # Overlay a line of best fit\n", " geom_line(aes(y = .pred), color = \"orange\", size = 1.2) +\n", " xlab(\"package\")\n", " \n" ], "outputs": [], "metadata": { "id": "R0nw719lwkHE" } }, { "cell_type": "markdown", "source": [ "Great! As you can see, the linear regression model doesn't do a very good job of capturing the relationship between a package and its corresponding price.\n", "\n", "🎃 Congratulations, you've just created a model that can help predict the price of several types of pumpkins. Your holiday pumpkin patch will look amazing! But you can probably build an even better model.\n", "\n", "## 5. Build a polynomial regression model\n", "\n", "

\n", " \n", "

Infographic by Dasani Madipalli
\n", "\n", "\n", "\n" ], "metadata": { "id": "HOCqJXLTwtWI" } }, { "cell_type": "markdown", "source": [ "Sometimes our data might not follow a linear pattern, but we still want to predict an outcome. Polynomial regression can help us make predictions for more complex, non-linear relationships.\n", "\n", "For example, consider the relationship between package size and price in our pumpkin dataset. While there might sometimes be a linear relationship between variables—like the larger the pumpkin's volume, the higher the price—there are cases where these relationships can't be represented by a plane or straight line.\n", "\n", "> ✅ Here are [some more examples](https://online.stat.psu.edu/stat501/lesson/9/9.8) of data that could benefit from polynomial regression.\n", ">\n", "> Take another look at the relationship between Variety and Price in the previous plot. Does this scatterplot seem like it should necessarily be analyzed with a straight line? Perhaps not. In such cases, you can try polynomial regression.\n", ">\n", "> ✅ Polynomials are mathematical expressions that may include one or more variables and coefficients.\n", "\n", "#### Train a polynomial regression model using the training set\n", "\n", "Polynomial regression creates a *curved line* to better fit non-linear data.\n", "\n", "Let's explore whether a polynomial model can perform better in making predictions. We'll follow a process similar to what we did earlier:\n", "\n", "- Create a recipe that outlines the preprocessing steps needed to prepare our data for modeling, such as encoding predictors and calculating polynomials of degree *n*.\n", "\n", "- Define a model specification.\n", "\n", "- Combine the recipe and model specification into a workflow.\n", "\n", "- Train the model by fitting the workflow.\n", "\n", "- Assess how well the model performs on the test data.\n", "\n", "Let's dive in!\n" ], "metadata": { "id": "VcEIpRV9wzYr" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Specify a recipe\r\n", "poly_pumpkins_recipe <-\r\n", " recipe(price ~ package, data = pumpkins_train) %>%\r\n", " step_integer(all_predictors(), zero_based = TRUE) %>% \r\n", " step_poly(all_predictors(), degree = 4)\r\n", "\r\n", "\r\n", "# Create a model specification\r\n", "poly_spec <- linear_reg() %>% \r\n", " set_engine(\"lm\") %>% \r\n", " set_mode(\"regression\")\r\n", "\r\n", "\r\n", "# Bundle recipe and model spec into a workflow\r\n", "poly_wf <- workflow() %>% \r\n", " add_recipe(poly_pumpkins_recipe) %>% \r\n", " add_model(poly_spec)\r\n", "\r\n", "\r\n", "# Create a model\r\n", "poly_wf_fit <- poly_wf %>% \r\n", " fit(data = pumpkins_train)\r\n", "\r\n", "\r\n", "# Print learned model coefficients\r\n", "poly_wf_fit\r\n", "\r\n", " " ], "outputs": [], "metadata": { "id": "63n_YyRXw3CC" } }, { "cell_type": "markdown", "source": [ "#### Evaluate model performance\n", "\n", "👏👏You've created a polynomial model—let's make predictions on the test set!\n" ], "metadata": { "id": "-LHZtztSxDP0" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Make price predictions on test data\r\n", "poly_results <- poly_wf_fit %>% predict(new_data = pumpkins_test) %>% \r\n", " bind_cols(pumpkins_test %>% select(c(package, price))) %>% \r\n", " relocate(.pred, .after = last_col())\r\n", "\r\n", "\r\n", "# Print the results\r\n", "poly_results %>% \r\n", " slice_head(n = 10)" ], "outputs": [], "metadata": { "id": "YUFpQ_dKxJGx" } }, { "cell_type": "markdown", "source": [ "Woo-hoo, let's evaluate how the model performed on the test_set using `yardstick::metrics()`.\n" ], "metadata": { "id": "qxdyj86bxNGZ" } }, { "cell_type": "code", "execution_count": null, "source": [ "metrics(data = poly_results, truth = price, estimate = .pred)" ], "outputs": [], "metadata": { "id": "8AW5ltkBxXDm" } }, { "cell_type": "markdown", "source": [ "🤩🤩 Much better performance.\n", "\n", "The `rmse` dropped from approximately 7 to around 3, indicating a smaller error between the actual price and the predicted price. You can *roughly* interpret this as meaning that, on average, incorrect predictions are off by about $3. The `rsq` improved from roughly 0.4 to 0.8.\n", "\n", "All these metrics show that the polynomial model significantly outperforms the linear model. Great work!\n", "\n", "Let’s see if we can visualize this!\n" ], "metadata": { "id": "6gLHNZDwxYaS" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Bind encoded package column to the results\r\n", "poly_results <- poly_results %>% \r\n", " bind_cols(package_encode %>% \r\n", " rename(package_integer = package)) %>% \r\n", " relocate(package_integer, .after = package)\r\n", "\r\n", "\r\n", "# Print new results data frame\r\n", "poly_results %>% \r\n", " slice_head(n = 5)\r\n", "\r\n", "\r\n", "# Make a scatter plot\r\n", "poly_results %>% \r\n", " ggplot(mapping = aes(x = package_integer, y = price)) +\r\n", " geom_point(size = 1.6) +\r\n", " # Overlay a line of best fit\r\n", " geom_line(aes(y = .pred), color = \"midnightblue\", size = 1.2) +\r\n", " xlab(\"package\")\r\n" ], "outputs": [], "metadata": { "id": "A83U16frxdF1" } }, { "cell_type": "markdown", "source": [ "You can see a curved line that better fits your data! 🤩\n", "\n", "You can make it even smoother by providing a polynomial formula to `geom_smooth` like this:\n" ], "metadata": { "id": "4U-7aHOVxlGU" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Make a scatter plot\r\n", "poly_results %>% \r\n", " ggplot(mapping = aes(x = package_integer, y = price)) +\r\n", " geom_point(size = 1.6) +\r\n", " # Overlay a line of best fit\r\n", " geom_smooth(method = lm, formula = y ~ poly(x, degree = 4), color = \"midnightblue\", size = 1.2, se = FALSE) +\r\n", " xlab(\"package\")" ], "outputs": [], "metadata": { "id": "5vzNT0Uexm-w" } }, { "cell_type": "markdown", "source": [ "Much like a smooth curve!🤩\n", "\n", "Here's how you would make a new prediction:\n" ], "metadata": { "id": "v9u-wwyLxq4G" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Make a hypothetical data frame\r\n", "hypo_tibble <- tibble(package = \"bushel baskets\")\r\n", "\r\n", "# Make predictions using linear model\r\n", "lm_pred <- lm_wf_fit %>% predict(new_data = hypo_tibble)\r\n", "\r\n", "# Make predictions using polynomial model\r\n", "poly_pred <- poly_wf_fit %>% predict(new_data = hypo_tibble)\r\n", "\r\n", "# Return predictions in a list\r\n", "list(\"linear model prediction\" = lm_pred, \r\n", " \"polynomial model prediction\" = poly_pred)\r\n" ], "outputs": [], "metadata": { "id": "jRPSyfQGxuQv" } }, { "cell_type": "markdown", "source": [ "The `polynomial model` prediction makes sense when you look at the scatter plots of `price` and `package`! And, if this model performs better than the previous one based on the same data, you’ll need to plan for these pricier pumpkins!\n", "\n", "🏆 Great job! You’ve built two regression models in one lesson. In the final section on regression, you’ll explore logistic regression to classify categories.\n", "\n", "## **🚀Challenge**\n", "\n", "Experiment with different variables in this notebook to see how correlation impacts model accuracy.\n", "\n", "## [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/14/)\n", "\n", "## **Review & Self Study**\n", "\n", "In this lesson, we covered Linear Regression. There are other significant types of Regression. Look into Stepwise, Ridge, Lasso, and Elasticnet techniques. A great resource to dive deeper is the [Stanford Statistical Learning course](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning).\n", "\n", "If you’re interested in learning more about the powerful Tidymodels framework, check out these resources:\n", "\n", "- Tidymodels website: [Get started with Tidymodels](https://www.tidymodels.org/start/)\n", "\n", "- Max Kuhn and Julia Silge, [*Tidy Modeling with R*](https://www.tmwr.org/)*.*\n", "\n", "###### **THANK YOU TO:**\n", "\n", "[Allison Horst](https://twitter.com/allison_horst?lang=en) for creating the wonderful illustrations that make R more approachable and fun. You can find more of her work in her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\n" ], "metadata": { "id": "8zOLOWqMxzk5" } }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**Disclaimer**: \nThis document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.\n" ] } ] }