Suppress warnings for pacman and resize images for lesson_1-R.ipynb

pull/289/head
R-icntay 3 years ago
parent 4dbfa4f31d
commit 95faf1b7ca

@ -1,6 +1,6 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "lesson_1-R.ipynb",
@ -19,37 +19,39 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "YJUHCXqK57yz"
},
"source": [
"#Build a regression model: Get started with R and Tidymodels for regression models"
]
],
"metadata": {
"id": "YJUHCXqK57yz"
}
},
{
"cell_type": "markdown",
"source": [
"## Introduction to Regression - Lesson 1\r\n",
"\r\n",
"#### Putting it into perspective\r\n",
"\r\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\r\n",
"\r\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\r\n",
"\r\n",
"That said, let's get started on this task!\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../images/encouRage.jpg\"\r\n",
" width=\"630\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"<!--![Artwork by \\@allison_horst](../images/encouRage.jpg)<br>Artwork by @allison_horst-->"
],
"metadata": {
"id": "LWNNzfqd6feZ"
},
"source": [
"## Introduction to Regression - Lesson 1\n",
"\n",
"#### Putting it into perspective\n",
"\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\n",
"\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\n",
"\n",
"That said, let's get started on this task!\n",
"\n",
"![Artwork by \\@allison_horst](../images/encouRage.jpg)<br>Artwork by @allison_horst"
]
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "FIo2YhO26wI9"
},
"source": [
"## 1. Loading up our tool set\n",
"\n",
@ -64,62 +66,62 @@
"`install.packages(c(\"tidyverse\", \"tidymodels\"))`\n",
"\n",
"The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing."
]
],
"metadata": {
"id": "FIo2YhO26wI9"
}
},
{
"cell_type": "code",
"metadata": {
"id": "cIA9fz9v7Dss",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2df7073b-86b2-4b32-cb86-0da605a0dc11"
},
"execution_count": 2,
"source": [
"if (!require(\"pacman\")) install.packages(\"pacman\")\n",
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"pacman::p_load(tidyverse, tidymodels)"
],
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Loading required package: pacman\n",
"\n"
],
"name": "stderr"
]
}
]
],
"metadata": {
"id": "cIA9fz9v7Dss",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2df7073b-86b2-4b32-cb86-0da605a0dc11"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "gpO_P_6f9WUG"
},
"source": [
"Now, let's load these awesome packages and make them available in our current R session.(This is for mere illustration, `pacman::p_load()` already did that for you)"
]
],
"metadata": {
"id": "gpO_P_6f9WUG"
}
},
{
"cell_type": "code",
"metadata": {
"id": "NLMycgG-9ezO"
},
"execution_count": null,
"source": [
"# load the core Tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# load the core Tidymodels packages\n",
"library(tidymodels)\n"
"# load the core Tidyverse packages\r\n",
"library(tidyverse)\r\n",
"\r\n",
"# load the core Tidymodels packages\r\n",
"library(tidymodels)\r\n"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "NLMycgG-9ezO"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "KM6iXLH996Cl"
},
"source": [
"## 2. The diabetes dataset\n",
"\n",
@ -156,34 +158,34 @@
"Before going any further, let's also introduce something you will encounter often in R code 🥁🥁: the pipe operator `%>%`\n",
"\n",
"The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code."
]
],
"metadata": {
"id": "KM6iXLH996Cl"
}
},
{
"cell_type": "code",
"metadata": {
"id": "Z1geAMhM-bSP"
},
"execution_count": null,
"source": [
"# Import the data set\n",
"diabetes <- read_table2(file = \"https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt\")\n",
"\n",
"\n",
"# Get a glimpse and dimensions of the data\n",
"glimpse(diabetes)\n",
"\n",
"\n",
"# Select the first 5 rows of the data\n",
"diabetes %>% \n",
"# Import the data set\r\n",
"diabetes <- read_table2(file = \"https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt\")\r\n",
"\r\n",
"\r\n",
"# Get a glimpse and dimensions of the data\r\n",
"glimpse(diabetes)\r\n",
"\r\n",
"\r\n",
"# Select the first 5 rows of the data\r\n",
"diabetes %>% \r\n",
" slice(1:5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "Z1geAMhM-bSP"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "UwjVT1Hz-c3Z"
},
"source": [
"`glimpse()` shows us that this data has 442 rows and 11 columns with all the columns being of data type `double` \n",
"\n",
@ -198,65 +200,65 @@
"Now that we have the data, let's narrow down to one feature (`bmi`) to target for this exercise. This will require us to select the desired columns. So, how do we do this?\n",
"\n",
"[`dplyr::select()`](https://dplyr.tidyverse.org/reference/select.html) allows us to *select* (and optionally rename) columns in a data frame."
]
],
"metadata": {
"id": "UwjVT1Hz-c3Z"
}
},
{
"cell_type": "code",
"metadata": {
"id": "RDY1oAKI-m80"
},
"execution_count": null,
"source": [
"# Select predictor feature `bmi` and outcome `y`\n",
"diabetes_select <- diabetes %>% \n",
" select(c(bmi, y))\n",
"\n",
"# Print the first 5 rows\n",
"diabetes_select %>% \n",
"# Select predictor feature `bmi` and outcome `y`\r\n",
"diabetes_select <- diabetes %>% \r\n",
" select(c(bmi, y))\r\n",
"\r\n",
"# Print the first 5 rows\r\n",
"diabetes_select %>% \r\n",
" slice(1:10)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "RDY1oAKI-m80"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "SDk668xK-tc3"
},
"source": [
"## 3. Training and Testing data\n",
"\n",
"It's common practice in supervised learning to *split* the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to see how the model performed.\n",
"\n",
"Now that we have data ready, we can see if a machine can help determine a logical split between the numbers in this dataset. We can use the [rsample](https://tidymodels.github.io/rsample/) package, which is part of the Tidymodels framework, to create an object that contains the information on *how* to split the data, and then two more rsample functions to extract the created training and testing sets:\n"
]
],
"metadata": {
"id": "SDk668xK-tc3"
}
},
{
"cell_type": "code",
"metadata": {
"id": "EqtHx129-1h-"
},
"execution_count": null,
"source": [
"set.seed(2056)\n",
"# Split 67% of the data for training and the rest for tesing\n",
"diabetes_split <- diabetes_select %>% \n",
" initial_split(prop = 0.67)\n",
"\n",
"# Extract the resulting train and test sets\n",
"diabetes_train <- training(diabetes_split)\n",
"diabetes_test <- testing(diabetes_split)\n",
"\n",
"# Print the first 3 rows of the training set\n",
"diabetes_train %>% \n",
"set.seed(2056)\r\n",
"# Split 67% of the data for training and the rest for tesing\r\n",
"diabetes_split <- diabetes_select %>% \r\n",
" initial_split(prop = 0.67)\r\n",
"\r\n",
"# Extract the resulting train and test sets\r\n",
"diabetes_train <- training(diabetes_split)\r\n",
"diabetes_test <- testing(diabetes_split)\r\n",
"\r\n",
"# Print the first 3 rows of the training set\r\n",
"diabetes_train %>% \r\n",
" slice(1:10)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "EqtHx129-1h-"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "sBOS-XhB-6v7"
},
"source": [
"## 4. Train a linear regression model with Tidymodels\n",
"\n",
@ -271,68 +273,68 @@
"- Model **engine** is the computational tool which will be used to fit the model. Often these are R packages, such as **`\"lm\"`** or **`\"ranger\"`**\n",
"\n",
"This modeling information is captured in a model specification, so let's build one!"
]
],
"metadata": {
"id": "sBOS-XhB-6v7"
}
},
{
"cell_type": "code",
"metadata": {
"id": "20OwEw20--t3"
},
"execution_count": null,
"source": [
"# Build a linear model specification\n",
"lm_spec <- \n",
" # Type\n",
" linear_reg() %>% \n",
" # Engine\n",
" set_engine(\"lm\") %>% \n",
" # Mode\n",
" set_mode(\"regression\")\n",
"\n",
"\n",
"# Print the model specification\n",
"# Build a linear model specification\r\n",
"lm_spec <- \r\n",
" # Type\r\n",
" linear_reg() %>% \r\n",
" # Engine\r\n",
" set_engine(\"lm\") %>% \r\n",
" # Mode\r\n",
" set_mode(\"regression\")\r\n",
"\r\n",
"\r\n",
"# Print the model specification\r\n",
"lm_spec"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "20OwEw20--t3"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "_oDHs89k_CJj"
},
"source": [
"After a model has been *specified*, the model can be `estimated` or `trained` using the [`fit()`](https://parsnip.tidymodels.org/reference/fit.html) function, typically using a formula and some data.\n",
"\n",
"`y ~ .` means we'll fit `y` as the predicted quantity/target, explained by all the predictors/features ie, `.` (in this case, we only have one predictor: `bmi` )"
]
],
"metadata": {
"id": "_oDHs89k_CJj"
}
},
{
"cell_type": "code",
"metadata": {
"id": "YlsHqd-q_GJQ"
},
"execution_count": null,
"source": [
"# Build a linear model specification\n",
"lm_spec <- linear_reg() %>% \n",
" set_engine(\"lm\") %>%\n",
" set_mode(\"regression\")\n",
"\n",
"\n",
"# Train a linear regression model\n",
"lm_mod <- lm_spec %>% \n",
" fit(y ~ ., data = diabetes_train)\n",
"\n",
"# Print the model\n",
"# Build a linear model specification\r\n",
"lm_spec <- linear_reg() %>% \r\n",
" set_engine(\"lm\") %>%\r\n",
" set_mode(\"regression\")\r\n",
"\r\n",
"\r\n",
"# Train a linear regression model\r\n",
"lm_mod <- lm_spec %>% \r\n",
" fit(y ~ ., data = diabetes_train)\r\n",
"\r\n",
"# Print the model\r\n",
"lm_mod"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "YlsHqd-q_GJQ"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "kGZ22RQj_Olu"
},
"source": [
"From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that gives us the lowest overall error between the actual and predicted variable.\n",
"<br>\n",
@ -340,97 +342,100 @@
"## 5. Make predictions on the test set\n",
"\n",
"Now that we've trained a model, we can use it to predict the disease progression y for the test dataset using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). This will be used to draw the line between data groups."
]
],
"metadata": {
"id": "kGZ22RQj_Olu"
}
},
{
"cell_type": "code",
"metadata": {
"id": "nXHbY7M2_aao"
},
"execution_count": null,
"source": [
"# Make predictions for the test set\n",
"predictions <- lm_mod %>% \n",
" predict(new_data = diabetes_test)\n",
"\n",
"# Print out some of the predictions\n",
"predictions %>% \n",
"# Make predictions for the test set\r\n",
"predictions <- lm_mod %>% \r\n",
" predict(new_data = diabetes_test)\r\n",
"\r\n",
"# Print out some of the predictions\r\n",
"predictions %>% \r\n",
" slice(1:5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "nXHbY7M2_aao"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "R_JstwUY_bIs"
},
"source": [
"Woohoo! 💃🕺 We just trained a model and used it to make predictions!\n",
"\n",
"When making predictions, the tidymodels convention is to always produce a tibble/data frame of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format for subsequent operations such as plotting.\n",
"\n",
"`dplyr::bind_cols()` efficiently binds multiple data frames column."
]
],
"metadata": {
"id": "R_JstwUY_bIs"
}
},
{
"cell_type": "code",
"metadata": {
"id": "RybsMJR7_iI8"
},
"execution_count": null,
"source": [
"# Combine the predictions and the original test set\n",
"results <- diabetes_test %>% \n",
" bind_cols(predictions)\n",
"\n",
"\n",
"results %>% \n",
"# Combine the predictions and the original test set\r\n",
"results <- diabetes_test %>% \r\n",
" bind_cols(predictions)\r\n",
"\r\n",
"\r\n",
"results %>% \r\n",
" slice(1:5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "RybsMJR7_iI8"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "XJbYbMZW_n_s"
},
"source": [
"## 6. Plot modelling results\n",
"\n",
"Now, its time to see this visually 📈. We'll create a scatter plot of all the `y` and `bmi` values of the test set, then use the predictions to draw a line in the most appropriate place, between the model's data groupings.\n",
"\n",
"R has several systems for making graphs, but `ggplot2` is one of the most elegant and most versatile. This allows you to compose graphs by **combining independent components**."
]
],
"metadata": {
"id": "XJbYbMZW_n_s"
}
},
{
"cell_type": "code",
"metadata": {
"id": "R9tYp3VW_sTn"
},
"execution_count": null,
"source": [
"# Set a theme for the plot\n",
"theme_set(theme_light())\n",
"# Create a scatter plot\n",
"results %>% \n",
" ggplot(aes(x = bmi)) +\n",
" # Add a scatter plot\n",
" geom_point(aes(y = y), size = 1.6) +\n",
" # Add a line plot\n",
"# Set a theme for the plot\r\n",
"theme_set(theme_light())\r\n",
"# Create a scatter plot\r\n",
"results %>% \r\n",
" ggplot(aes(x = bmi)) +\r\n",
" # Add a scatter plot\r\n",
" geom_point(aes(y = y), size = 1.6) +\r\n",
" # Add a line plot\r\n",
" geom_line(aes(y = .pred), color = \"blue\", size = 1.5)"
],
"execution_count": null,
"outputs": []
"outputs": [],
"metadata": {
"id": "R9tYp3VW_sTn"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "zrPtHIxx_tNI"
},
"source": [
"> ✅ Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model.\n",
"\n",
"Congratulations, you built your first linear regression model, created a prediction with it, and displayed it in a plot!\n"
]
],
"metadata": {
"id": "zrPtHIxx_tNI"
}
}
]
}
}
Loading…
Cancel
Save