{
"nbformat": 4,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "Untitled10.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Build a regression model: logistic regression\n",
"
\n"
],
"metadata": {
"id": "fVfEucLYkV9T"
}
},
{
"cell_type": "markdown",
"source": [
"## Build a logistic regression model - Lesson 4\n",
"\n",
"
\n",
" \n",
"
\n",
" \n",
"
\n",
" \n",
"
\n",
" \n",
"
\n", " \n", "\n", "\n", "> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.\n", "\n", "Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.\n", "\n", "It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:" ], "metadata": { "id": "RA_bnMS9mVo8" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Split data into 80% for training and 20% for testing\n", "set.seed(2056)\n", "pumpkins_split <- pumpkins_select %>% \n", " initial_split(prop = 0.8)\n", "\n", "# Extract the data in each split\n", "pumpkins_train <- training(pumpkins_split)\n", "pumpkins_test <- testing(pumpkins_split)\n", "\n", "# Print out the first 5 rows of the training set\n", "pumpkins_train %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "PQdpEYYPmdGW" } }, { "cell_type": "markdown", "source": [ "๐ We are now ready to train a model by fitting the training features to the training label (color).\n", "\n", "We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers.\n", "\n", "There are quite a number of ways to specify a logistic regression model in Tidymodels. See `?logistic_reg()` For now, we'll specify a logistic regression model via the default `stats::glm()` engine." ], "metadata": { "id": "MX9LipSimhn0" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Create a recipe that specifies preprocessing steps for modelling\n", "pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n", " step_integer(all_predictors(), zero_based = TRUE)\n", "\n", "\n", "# Create a logistic model specification\n", "log_reg <- logistic_reg() %>% \n", " set_engine(\"glm\") %>% \n", " set_mode(\"classification\")\n" ], "outputs": [], "metadata": { "id": "0Eo5-SbSmm2-" } }, { "cell_type": "markdown", "source": [ "Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data (prep+bake behind the scenes), fit the model on the preprocessed data and also allow for potential post-processing activities.\n", "\n", "In Tidymodels, this convenient object is called a [`workflow`](https://workflows.tidymodels.org/) and conveniently holds your modeling components." ], "metadata": { "id": "G599GKhXmqWf" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Bundle modelling components in a workflow\n", "log_reg_wf <- workflow() %>% \n", " add_recipe(pumpkins_recipe) %>% \n", " add_model(log_reg)\n", "\n", "# Print out the workflow\n", "log_reg_wf\n" ], "outputs": [], "metadata": { "id": "cRoU0tpbmu1T" } }, { "cell_type": "markdown", "source": [ "After a workflow has been *specified*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function. The workflow will estimate a recipe and preprocess the data before training, so we won't have to manually do that using prep and bake." ], "metadata": { "id": "JnRXKmREnEpd" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Train the model\n", "wf_fit <- log_reg_wf %>% \n", " fit(data = pumpkins_train)\n", "\n", "# Print the trained workflow\n", "wf_fit" ], "outputs": [], "metadata": { "id": "ehFwfkjWnNCb" } }, { "cell_type": "markdown", "source": [ "The model print out shows the coefficients learned during training.\n", "\n", "Now we've trained the model using the training data, we can make predictions on the test data using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Let's start by using the model to predict labels for our test set and the probabilities for each label. When the probability is more than 0.5, the predict class is `ORANGE` else `WHITE`." ], "metadata": { "id": "w01dGNZjnOJQ" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Make predictions for color and corresponding probabilities\n", "results <- pumpkins_test %>% select(color) %>% \n", " bind_cols(wf_fit %>% \n", " predict(new_data = pumpkins_test)) %>%\n", " bind_cols(wf_fit %>%\n", " predict(new_data = pumpkins_test, type = \"prob\"))\n", "\n", "# Compare predictions\n", "results %>% \n", " slice_head(n = 10)" ], "outputs": [], "metadata": { "id": "K8PNjPfTnak2" } }, { "cell_type": "markdown", "source": [ "Very nice! This provides some more insights into how logistic regression works.\n", "\n", "Comparing each prediction with its corresponding \"ground truth\" actual value isn't a very efficient way to determine how well the model is predicting. Fortunately, Tidymodels has a few more tricks up its sleeve: [`yardstick`](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics.\n", "\n", "One performance metric associated with classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). A confusion matrix describes how well a classification model performs. A confusion matrix tabulates how many examples in each class were correctly classified by a model. In our case, it will show you how many orange pumpkins were classified as orange and how many white pumpkins were classified as white; the confusion matrix also shows you how many were classified into the **wrong** categories.\n", "\n", "The [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) function from yardstick calculates this cross-tabulation of observed and predicted classes." ], "metadata": { "id": "N3J-yW0wngKo" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Confusion matrix for prediction results\n", "conf_mat(data = results, truth = color, estimate = .pred_class)" ], "outputs": [], "metadata": { "id": "0RD77Dq1nl2j" } }, { "cell_type": "markdown", "source": [ "Let's interpret the confusion matrix. Our model is asked to classify pumpkins between two binary categories, category `orange` and category `not-orange`\n", "\n", "- If your model predicts a pumpkin as orange and it belongs to category 'orange' in reality we call it a `true positive`, shown by the top left number.\n", "\n", "- If your model predicts a pumpkin as not orange and it belongs to category 'orange' in reality we call it a `false negative`, shown by the bottom left number.\n", "\n", "- If your model predicts a pumpkin as orange and it belongs to category 'not-orange' in reality we call it a `false positive`, shown by the top right number.\n", "\n", "- If your model predicts a pumpkin as not orange and it belongs to category 'not-orange' in reality we call it a `true negative`, shown by the bottom right number.\n", "\n", "\n", "| **Truth** |\n", "|:-----:|\n", "\n", "\n", "| | | |\n", "|---------------|--------|-------|\n", "| **Predicted** | ORANGE | WHITE |\n", "| ORANGE | TP | FP |\n", "| WHITE | FN | TN |" ], "metadata": { "id": "H61sFwdOnoiO" } }, { "cell_type": "markdown", "source": [ "As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.\n", "\n", "The confusion matrix is helpful since it gives rise to other metrics that can help us better evaluate the performance of a classification model. Let's go through some of them:\n", "\n", "๐ Precision: `TP/(TP + FP)` defined as the proportion of predicted positives that are actually positive. Also called [positive predictive value](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")\n", "\n", "๐ Recall: `TP/(TP + FN)` defined as the proportion of positive results out of the number of samples which were actually positive. Also known as `sensitivity`.\n", "\n", "๐ Specificity: `TN/(TN + FP)` defined as the proportion of negative results out of the number of samples which were actually negative.\n", "\n", "๐ Accuracy: `TP + TN/(TP + TN + FP + FN)` The percentage of labels predicted accurately for a sample.\n", "\n", "๐ F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.\n", "\n", "Let's calculate these metrics!" ], "metadata": { "id": "Yc6QUie2oQUr" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Combine metric functions and calculate them all at once\n", "eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n", "eval_metrics(data = results, truth = color, estimate = .pred_class)" ], "outputs": [], "metadata": { "id": "p6rXx_T3oVxX" } }, { "cell_type": "markdown", "source": [ "#### **Visualize the ROC curve of this model**\n", "\n", "For a start, this is not a bad model; its precision, recall, F measure and accuracy are in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables. It also seems that our model was not really able to identify the white pumpkins ๐ง. Could you guess why? One reason could be because of the high prevalence of ORANGE pumpkins in our training set making our model more inclined to predict the majority class.\n", "\n", "Let's do one more visualization to see the so-called [`ROC score`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):" ], "metadata": { "id": "JcenzZo1oaKR" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Make a roc_curve\n", "results %>% \n", " roc_curve(color, .pred_ORANGE) %>% \n", " autoplot()" ], "outputs": [], "metadata": { "id": "BcmkHHHwogRB" } }, { "cell_type": "markdown", "source": [ "ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. ROC curves typically feature `True Positive Rate`/Sensitivity on the Y axis, and `False Positive Rate`/1-Specificity on the X axis. Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly.\n", "\n", "Finally, let's use `yardstick::roc_auc()` to calculate the actual Area Under the Curve. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example." ], "metadata": { "id": "P_an3vc1oqjI" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Calculate area under curve\n", "results %>% \n", " roc_auc(color, .pred_ORANGE)" ], "outputs": [], "metadata": { "id": "SZyy5BT8ovew" } }, { "cell_type": "markdown", "source": [ "The result is around `0.67053`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\n", "\n", "In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).\n", "\n", "But for now, congratulations ๐๐๐! You've completed these regression lessons!\n", "\n", "You R awesome!\n", "\n", "
\n",
" \n",
"