{ "nbformat": 4, "nbformat_minor": 2, "metadata": { "colab": { "name": "lesson_1-R.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "c18d3bd0bd8ae3878597e89dcd1fa5c1", "translation_date": "2025-09-06T15:31:44+00:00", "source_file": "2-Regression/1-Tools/solution/R/lesson_1-R.ipynb", "language_code": "en" } }, "cells": [ { "cell_type": "markdown", "source": [], "metadata": { "id": "YJUHCXqK57yz" } }, { "cell_type": "markdown", "source": [ "## Introduction to Regression - Lesson 1\n", "\n", "#### Putting it into perspective\n", "\n", "✅ There are many types of regression methods, and the one you choose depends on the question you're trying to answer. For example, if you want to predict the likely height of a person based on their age, you would use `linear regression`, as you're looking for a **numerical value**. On the other hand, if you're trying to determine whether a type of cuisine should be classified as vegan or not, you're dealing with a **category assignment**, so you would use `logistic regression`. You'll learn more about logistic regression later. Take a moment to think about some questions you could ask of data and which of these methods might be most suitable.\n", "\n", "In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine you wanted to test a treatment for diabetic patients. Machine Learning models could help you identify which patients might respond better to the treatment based on combinations of variables. Even a very simple regression model, when visualized, could reveal insights about variables that might assist in organizing your theoretical clinical trials.\n", "\n", "With that in mind, let's dive into this task!\n", "\n", "
\n",
" \n",
"
\n",
"\n",
"> `glimpse()` and `slice()` are functions from [`dplyr`](https://dplyr.tidyverse.org/). Dplyr, part of the Tidyverse, is a collection of tools for data manipulation that provides a consistent set of verbs to address common data manipulation tasks.\n",
"\n",
"
\n",
"\n",
"Now that we have the dataset, let's focus on one feature (`bmi`) for this exercise. To do this, we need to select the relevant columns. So, how can we achieve this?\n",
"\n",
"[`dplyr::select()`](https://dplyr.tidyverse.org/reference/select.html) enables us to *choose* (and optionally rename) specific columns in a data frame.\n"
],
"metadata": {
"id": "UwjVT1Hz-c3Z"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Select predictor feature `bmi` and outcome `y`\r\n",
"diabetes_select <- diabetes %>% \r\n",
" select(c(bmi, y))\r\n",
"\r\n",
"# Print the first 5 rows\r\n",
"diabetes_select %>% \r\n",
" slice(1:10)"
],
"outputs": [],
"metadata": {
"id": "RDY1oAKI-m80"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Training and Testing Data\n",
"\n",
"In supervised learning, it's a common approach to *divide* the data into two subsets: a (usually larger) set used to train the model, and a smaller \"reserved\" set used to evaluate the model's performance.\n",
"\n",
"Now that our data is prepared, we can explore whether a machine can assist in identifying a logical way to split the numbers in this dataset. To achieve this, we can use the [rsample](https://tidymodels.github.io/rsample/) package, which is part of the Tidymodels framework. This package allows us to create an object that contains the details of *how* the data should be split, followed by two additional rsample functions to extract the resulting training and testing sets:\n"
],
"metadata": {
"id": "SDk668xK-tc3"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"set.seed(2056)\r\n",
"# Split 67% of the data for training and the rest for tesing\r\n",
"diabetes_split <- diabetes_select %>% \r\n",
" initial_split(prop = 0.67)\r\n",
"\r\n",
"# Extract the resulting train and test sets\r\n",
"diabetes_train <- training(diabetes_split)\r\n",
"diabetes_test <- testing(diabetes_split)\r\n",
"\r\n",
"# Print the first 3 rows of the training set\r\n",
"diabetes_train %>% \r\n",
" slice(1:10)"
],
"outputs": [],
"metadata": {
"id": "EqtHx129-1h-"
}
},
{
"cell_type": "markdown",
"source": [
"## 4. Train a linear regression model with Tidymodels\n",
"\n",
"Now it's time to train our model!\n",
"\n",
"In Tidymodels, models are defined using `parsnip()` by specifying three key aspects:\n",
"\n",
"- The **type** of model distinguishes between options like linear regression, logistic regression, decision tree models, and others.\n",
"\n",
"- The **mode** of the model refers to common tasks like regression or classification; some model types can handle both modes, while others are limited to one.\n",
"\n",
"- The **engine** is the computational tool that will be used to fit the model. These are often R packages, such as **`\"lm\"`** or **`\"ranger\"`**.\n",
"\n",
"This modeling information is stored in a model specification, so let's create one!\n"
],
"metadata": {
"id": "sBOS-XhB-6v7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Build a linear model specification\r\n",
"lm_spec <- \r\n",
" # Type\r\n",
" linear_reg() %>% \r\n",
" # Engine\r\n",
" set_engine(\"lm\") %>% \r\n",
" # Mode\r\n",
" set_mode(\"regression\")\r\n",
"\r\n",
"\r\n",
"# Print the model specification\r\n",
"lm_spec"
],
"outputs": [],
"metadata": {
"id": "20OwEw20--t3"
}
},
{
"cell_type": "markdown",
"source": [
"After a model has been *defined*, it can be `estimated` or `trained` using the [`fit()`](https://parsnip.tidymodels.org/reference/fit.html) function, usually with a formula and some data.\n",
"\n",
"`y ~ .` indicates that we will fit `y` as the target or predicted value, explained by all the predictors/features, i.e., `.` (in this case, we only have one predictor: `bmi`).\n"
],
"metadata": {
"id": "_oDHs89k_CJj"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Build a linear model specification\r\n",
"lm_spec <- linear_reg() %>% \r\n",
" set_engine(\"lm\") %>%\r\n",
" set_mode(\"regression\")\r\n",
"\r\n",
"\r\n",
"# Train a linear regression model\r\n",
"lm_mod <- lm_spec %>% \r\n",
" fit(y ~ ., data = diabetes_train)\r\n",
"\r\n",
"# Print the model\r\n",
"lm_mod"
],
"outputs": [],
"metadata": {
"id": "YlsHqd-q_GJQ"
}
},
{
"cell_type": "markdown",
"source": [
"From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that minimizes the overall error between the actual and predicted variable.\n",
"\n",
"
\n",
"\n",
"## 5. Make predictions on the test set\n",
"\n",
"Now that we've trained a model, we can use it to predict the disease progression y for the test dataset using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). This will help us draw the line separating the data groups.\n"
],
"metadata": {
"id": "kGZ22RQj_Olu"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make predictions for the test set\r\n",
"predictions <- lm_mod %>% \r\n",
" predict(new_data = diabetes_test)\r\n",
"\r\n",
"# Print out some of the predictions\r\n",
"predictions %>% \r\n",
" slice(1:5)"
],
"outputs": [],
"metadata": {
"id": "nXHbY7M2_aao"
}
},
{
"cell_type": "markdown",
"source": [
"Woohoo! 💃🕺 We just trained a model and used it to make predictions!\n",
"\n",
"When making predictions, the tidymodels convention is to always generate a tibble/data frame of results with standardized column names. This ensures that combining the original data with the predictions is straightforward and results in a format that can be easily used for further tasks like plotting.\n",
"\n",
"`dplyr::bind_cols()` efficiently combines multiple data frames by columns.\n"
],
"metadata": {
"id": "R_JstwUY_bIs"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Combine the predictions and the original test set\r\n",
"results <- diabetes_test %>% \r\n",
" bind_cols(predictions)\r\n",
"\r\n",
"\r\n",
"results %>% \r\n",
" slice(1:5)"
],
"outputs": [],
"metadata": {
"id": "RybsMJR7_iI8"
}
},
{
"cell_type": "markdown",
"source": [
"## 6. Plot modelling results\n",
"\n",
"Now, it's time to visualize this 📈. We'll create a scatter plot of all the `y` and `bmi` values from the test set, and then use the predictions to draw a line in the most suitable position, reflecting the model's data groupings.\n",
"\n",
"R offers several systems for creating graphs, but `ggplot2` is one of the most elegant and versatile. It allows you to build graphs by **combining independent components**.\n"
],
"metadata": {
"id": "XJbYbMZW_n_s"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Set a theme for the plot\r\n",
"theme_set(theme_light())\r\n",
"# Create a scatter plot\r\n",
"results %>% \r\n",
" ggplot(aes(x = bmi)) +\r\n",
" # Add a scatter plot\r\n",
" geom_point(aes(y = y), size = 1.6) +\r\n",
" # Add a line plot\r\n",
" geom_line(aes(y = .pred), color = \"blue\", size = 1.5)"
],
"outputs": [],
"metadata": {
"id": "R9tYp3VW_sTn"
}
},
{
"cell_type": "markdown",
"source": [
"> ✅ Take a moment to think about what's happening here. A straight line is passing through many small data points, but what is its purpose exactly? Can you understand how this line can help predict where a new, unseen data point might fall in relation to the plot's y-axis? Try to describe in words the practical application of this model.\n",
"\n",
"Congratulations, you've built your first linear regression model, made a prediction with it, and visualized it in a plot!\n"
],
"metadata": {
"id": "zrPtHIxx_tNI"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**Disclaimer**: \nThis document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.\n"
]
}
]
}