ML-For-Beginners/translations/en/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build a logistic regression model - Lesson 4\n",
    "\n",
    "![Logistic vs. linear regression infographic](../../../../../../translated_images/linear-vs-logistic.ba180bf95e7ee66721ba10ebf2dac2666acbd64a88b003c83928712433a13c7d.en.png)\n",
    "\n",
    "#### **[Pre-lecture quiz](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n",
    "\n",
    "#### Introduction\n",
    "\n",
    "In this final lesson on Regression, one of the fundamental *classic* ML techniques, we will explore Logistic Regression. This method is used to identify patterns for predicting binary categories. For example: Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?\n",
    "\n",
    "In this lesson, you will learn:\n",
    "\n",
    "-   Techniques for logistic regression\n",
    "\n",
    "✅ Enhance your understanding of this type of regression by exploring this [Learn module](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott)\n",
    "\n",
    "## Prerequisite\n",
    "\n",
    "After working with the pumpkin data, we are now familiar enough with it to identify a binary category we can work with: `Color`.\n",
    "\n",
    "Let's create a logistic regression model to predict, based on certain variables, *what color a given pumpkin is likely to be* (orange 🎃 or white 👻).\n",
    "\n",
    "> Why are we discussing binary classification in a lesson series about regression? It's mainly for linguistic convenience, as logistic regression is [actually a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), though it is based on linear principles. You’ll learn about other classification methods in the next lesson series.\n",
    "\n",
    "For this lesson, we'll need the following packages:\n",
    "\n",
    "-   `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to make data science faster, easier, and more enjoyable!\n",
    "\n",
    "-   `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n",
    "\n",
    "-   `janitor`: The [janitor package](https://github.com/sfirke/janitor) offers simple tools for examining and cleaning messy data.\n",
    "\n",
    "-   `ggbeeswarm`: The [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) provides methods for creating beeswarm-style plots using ggplot2.\n",
    "\n",
    "You can install them using:\n",
    "\n",
    "`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n",
    "\n",
    "Alternatively, the script below checks whether you have the necessary packages for this module and installs them for you if they are missing.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
    "\n",
    "pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Define the question**\n",
    "\n",
    "For our purposes, we will frame this as a binary question: 'White' or 'Not White'. Our dataset also includes a 'striped' category, but there are very few instances of it, so we will exclude it. In fact, it disappears once we remove null values from the dataset.\n",
    "\n",
    "> 🎃 Fun fact: White pumpkins are sometimes called 'ghost' pumpkins. They're not very easy to carve, which makes them less popular than the orange ones, but they do look pretty cool! So we could also rephrase our question as: 'Ghost' or 'Not Ghost'. 👻\n",
    "\n",
    "## **About logistic regression**\n",
    "\n",
    "Logistic regression differs from linear regression, which you learned about earlier, in several key ways.\n",
    "\n",
    "#### **Binary classification**\n",
    "\n",
    "Logistic regression doesn't provide the same functionality as linear regression. The former predicts a `binary category` (\"orange or not orange\"), while the latter can predict `continuous values`, such as estimating *how much the price of a pumpkin will increase* based on its origin and harvest time.\n",
    "\n",
    "![Infographic by Dasani Madipalli](../../../../../../translated_images/pumpkin-classifier.562771f104ad5436b87d1c67bca02a42a17841133556559325c0a0e348e5b774.en.png)\n",
    "\n",
    "### Other classifications\n",
    "\n",
    "There are other types of logistic regression, including multinomial and ordinal:\n",
    "\n",
    "- **Multinomial**, which deals with more than two categories - \"Orange, White, and Striped.\"\n",
    "\n",
    "- **Ordinal**, which involves ordered categories. This is useful if we want to logically rank our outcomes, such as pumpkins categorized by a finite set of sizes (mini, sm, med, lg, xl, xxl).\n",
    "\n",
    "![Multinomial vs ordinal regression](../../../../../../translated_images/multinomial-vs-ordinal.36701b4850e37d86c9dd49f7bef93a2f94dbdb8fe03443eb68f0542f97f28f29.en.png)\n",
    "\n",
    "#### **Variables DO NOT have to correlate**\n",
    "\n",
    "Remember how linear regression worked better with highly correlated variables? Logistic regression is different—it doesn't require the variables to be strongly correlated. This is helpful for our dataset, which has relatively weak correlations.\n",
    "\n",
    "#### **You need a lot of clean data**\n",
    "\n",
    "Logistic regression produces more accurate results when you have a larger dataset. Our small dataset isn't ideal for this task, so keep that in mind.\n",
    "\n",
    "✅ Consider the types of data that are well-suited for logistic regression.\n",
    "\n",
    "## Exercise - tidy the data\n",
    "\n",
    "First, clean the data by removing null values and selecting only specific columns:\n",
    "\n",
    "1. Add the following code:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Load the core tidyverse packages\n",
    "library(tidyverse)\n",
    "\n",
    "# Import the data and clean column names\n",
    "pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n",
    "  clean_names()\n",
    "\n",
    "# Select desired columns\n",
    "pumpkins_select <- pumpkins %>% \n",
    "  select(c(city_name, package, variety, origin, item_size, color)) \n",
    "\n",
    "# Drop rows containing missing values and encode color as factor (category)\n",
    "pumpkins_select <- pumpkins_select %>% \n",
    "  drop_na() %>% \n",
    "  mutate(color = factor(color))\n",
    "\n",
    "# View the first few rows\n",
    "pumpkins_select %>% \n",
    "  slice_head(n = 5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can always take a look at your new dataframe by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as shown below:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "pumpkins_select %>% \n",
    "  glimpse()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's confirm that we are indeed working on a binary classification problem:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Subset distinct observations in outcome column\n",
    "pumpkins_select %>% \n",
    "  distinct(color)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualization - categorical plot\n",
    "By now, you have reloaded the pumpkin data and cleaned it to retain a dataset with a few variables, including Color. Let's visualize the dataframe in the notebook using the ggplot library.\n",
    "\n",
    "The ggplot library provides some great tools for visualizing your data. For instance, you can compare the distributions of data for each Variety and Color using a categorical plot.\n",
    "\n",
    "1. Create this type of plot by using the geombar function, applying it to our pumpkin data, and specifying a color mapping for each pumpkin category (orange or white):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "python"
    }
   },
   "outputs": [],
   "source": [
    "# Specify colors for each value of the hue variable\n",
    "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
    "\n",
    "# Create the bar plot\n",
    "ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n",
    "  geom_bar(position = \"dodge\") +\n",
    "  scale_fill_manual(values = palette) +\n",
    "  labs(y = \"Variety\", fill = \"Color\") +\n",
    "  theme_minimal()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By observing the data, you can see how the Color data relates to Variety.\n",
    "\n",
    "✅ Based on this categorical plot, what are some interesting analyses or insights you can think of?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data pre-processing: feature encoding\n",
    "\n",
    "Our pumpkins dataset contains string values for all its columns. While humans find it intuitive to work with categorical data, machines do not. Machine learning algorithms perform better with numerical data. This makes encoding a crucial step in the data pre-processing phase, as it allows us to convert categorical data into numerical data without losing any information. Effective encoding contributes to building a strong model.\n",
    "\n",
    "There are two main types of encoders for feature encoding:\n",
    "\n",
    "1. **Ordinal encoder**: This is suitable for ordinal variables, which are categorical variables with a logical order, like the `item_size` column in our dataset. It creates a mapping where each category is represented by a number corresponding to its order in the column.\n",
    "\n",
    "2. **Categorical encoder**: This is suitable for nominal variables, which are categorical variables without a logical order, like all the features other than `item_size` in our dataset. It uses one-hot encoding, meaning each category is represented by a binary column: the encoded variable equals 1 if the pumpkin belongs to that variety and 0 otherwise.\n",
    "\n",
    "Tidymodels offers another useful package: [recipes](https://recipes.tidymodels.org/)—a package for data preprocessing. We'll define a `recipe` to specify that all predictor columns should be encoded into a set of integers, `prep` it to estimate the necessary quantities and statistics for any operations, and finally `bake` it to apply the computations to new data.\n",
    "\n",
    "> Typically, recipes are used as preprocessors for modeling, where they define the steps to be applied to a dataset to prepare it for modeling. In such cases, it is **highly recommended** to use a `workflow()` instead of manually estimating a recipe using prep and bake. We'll explore this shortly.\n",
    ">\n",
    "> For now, however, we are using recipes + prep + bake to specify the steps to be applied to a dataset to prepare it for data analysis and then extract the preprocessed data with the steps applied.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Preprocess and extract data to allow some data analysis\n",
    "baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%\n",
    "  # Define ordering for item_size column\n",
    "  step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
    "  # Convert factors to numbers using the order defined above (Ordinal encoding)\n",
    "  step_integer(item_size, zero_based = F) %>%\n",
    "  # Encode all other predictors using one hot encoding\n",
    "  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%\n",
    "  prep(data = pumpkin_select) %>%\n",
    "  bake(new_data = NULL)\n",
    "\n",
    "# Display the first few rows of preprocessed data\n",
    "baked_pumpkins %>% \n",
    "  slice_head(n = 5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "✅ What are the advantages of using an ordinal encoder for the Item Size column?\n",
    "\n",
    "### Analyse relationships between variables\n",
    "\n",
    "Now that we have pre-processed our data, we can analyze the relationships between the features and the label to get an idea of how well the model will be able to predict the label based on the features. The best way to perform this type of analysis is by plotting the data.  \n",
    "We'll use the ggplot geom_boxplot_ function again to visualize the relationships between Item Size, Variety, and Color in a categorical plot. To better visualize the data, we'll use the encoded Item Size column and the unencoded Variety column.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Define the color palette\n",
    "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
    "\n",
    "# We need the encoded Item Size column to use it as the x-axis values in the plot\n",
    "pumpkins_select_plot<-pumpkins_select\n",
    "pumpkins_select_plot$item_size <- baked_pumpkins$item_size\n",
    "\n",
    "# Create the grouped box plot\n",
    "ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +\n",
    "  geom_boxplot() +\n",
    "  facet_grid(variety ~ ., scales = \"free_x\") +\n",
    "  scale_fill_manual(values = palette) +\n",
    "  labs(x = \"Item Size\", y = \"\") +\n",
    "  theme_minimal() +\n",
    "  theme(strip.text = element_text(size = 12)) +\n",
    "  theme(axis.text.x = element_text(size = 10)) +\n",
    "  theme(axis.title.x = element_text(size = 12)) +\n",
    "  theme(axis.title.y = element_blank()) +\n",
    "  theme(legend.position = \"bottom\") +\n",
    "  guides(fill = guide_legend(title = \"Color\")) +\n",
    "  theme(panel.spacing = unit(0.5, \"lines\"))+\n",
    "  theme(strip.text.y = element_text(size = 4, hjust = 0)) \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Use a swarm plot\n",
    "\n",
    "Since Color is a binary category (White or Not), it requires '[a specialized approach](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf) to visualization.'\n",
    "\n",
    "Try using a `swarm plot` to display the distribution of color in relation to item_size.\n",
    "\n",
    "We'll use the [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm), which offers methods for creating beeswarm-style plots with ggplot2. Beeswarm plots are a way to arrange points that would normally overlap so that they are positioned next to each other instead.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Create beeswarm plots of color and item_size\n",
    "baked_pumpkins %>% \n",
    "  mutate(color = factor(color)) %>% \n",
    "  ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n",
    "  geom_quasirandom() +\n",
    "  scale_color_brewer(palette = \"Dark2\", direction = -1) +\n",
    "  theme(legend.position = \"none\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we understand the connection between the binary categories of color and the broader group of sizes, let's dive into logistic regression to predict the probable color of a pumpkin.\n",
    "\n",
    "## Build your model\n",
    "\n",
    "Choose the variables you want to include in your classification model and divide the data into training and testing sets. [rsample](https://rsample.tidymodels.org/), a package within Tidymodels, offers tools for efficient data splitting and resampling:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Split data into 80% for training and 20% for testing\n",
    "set.seed(2056)\n",
    "pumpkins_split <- pumpkins_select %>% \n",
    "  initial_split(prop = 0.8)\n",
    "\n",
    "# Extract the data in each split\n",
    "pumpkins_train <- training(pumpkins_split)\n",
    "pumpkins_test <- testing(pumpkins_split)\n",
    "\n",
    "# Print out the first 5 rows of the training set\n",
    "pumpkins_train %>% \n",
    "  slice_head(n = 5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🙌 We are now ready to train a model by fitting the training features to the training label (color).\n",
    "\n",
    "We'll start by creating a recipe that outlines the preprocessing steps needed to prepare our data for modeling, such as encoding categorical variables into integers. Similar to `baked_pumpkins`, we create a `pumpkins_recipe` but do not `prep` and `bake` it, as these steps will be incorporated into a workflow, which you'll see in just a few steps.\n",
    "\n",
    "There are several ways to define a logistic regression model in Tidymodels. Check out `?logistic_reg()` for more details. For now, we'll specify a logistic regression model using the default `stats::glm()` engine.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Create a recipe that specifies preprocessing steps for modelling\n",
    "pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n",
    "  step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
    "  step_integer(item_size, zero_based = F) %>%  \n",
    "  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)\n",
    "\n",
    "# Create a logistic model specification\n",
    "log_reg <- logistic_reg() %>% \n",
    "  set_engine(\"glm\") %>% \n",
    "  set_mode(\"classification\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a recipe and a model specification, we need a way to combine them into an object that will preprocess the data (prep + bake behind the scenes), fit the model using the preprocessed data, and also enable any potential post-processing steps.\n",
    "\n",
    "In Tidymodels, this handy object is called a [`workflow`](https://workflows.tidymodels.org/) and it conveniently organizes your modeling components.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Bundle modelling components in a workflow\n",
    "log_reg_wf <- workflow() %>% \n",
    "  add_recipe(pumpkins_recipe) %>% \n",
    "  add_model(log_reg)\n",
    "\n",
    "# Print out the workflow\n",
    "log_reg_wf\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After a workflow has been *defined*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function. The workflow will estimate a recipe and preprocess the data before training, so we don't need to manually handle that with prep and bake.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Train the model\n",
    "wf_fit <- log_reg_wf %>% \n",
    "  fit(data = pumpkins_train)\n",
    "\n",
    "# Print the trained workflow\n",
    "wf_fit\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model printout shows the coefficients learned during training.\n",
    "\n",
    "Now that we've trained the model using the training data, we can make predictions on the test data using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Let's begin by using the model to predict the labels for our test set and the probabilities for each label. If the probability is greater than 0.5, the predicted class is `WHITE`; otherwise, it is `ORANGE`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Make predictions for color and corresponding probabilities\n",
    "results <- pumpkins_test %>% select(color) %>% \n",
    "  bind_cols(wf_fit %>% \n",
    "              predict(new_data = pumpkins_test)) %>%\n",
    "  bind_cols(wf_fit %>%\n",
    "              predict(new_data = pumpkins_test, type = \"prob\"))\n",
    "\n",
    "# Compare predictions\n",
    "results %>% \n",
    "  slice_head(n = 10)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Very nice! This provides some additional insights into how logistic regression operates.\n",
    "\n",
    "### Better understanding through a confusion matrix\n",
    "\n",
    "Comparing each prediction with its corresponding \"ground truth\" actual value isn't the most efficient way to evaluate how well the model is performing. Luckily, Tidymodels offers some additional tools: [`yardstick`](https://yardstick.tidymodels.org/) - a package designed to assess model performance using various metrics.\n",
    "\n",
    "One commonly used metric for classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). A confusion matrix provides a summary of how well a classification model performs. It shows how many examples in each class were correctly classified by the model. In our case, it will indicate how many orange pumpkins were correctly identified as orange and how many white pumpkins were correctly identified as white. Additionally, the confusion matrix highlights how many examples were misclassified into the **wrong** categories.\n",
    "\n",
    "The [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) function from yardstick computes this cross-tabulation of observed and predicted classes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Confusion matrix for prediction results\n",
    "conf_mat(data = results, truth = color, estimate = .pred_class)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's interpret the confusion matrix. Our model is tasked with classifying pumpkins into two binary categories: `white` and `not-white`.\n",
    "\n",
    "-   If your model predicts a pumpkin as white and it actually belongs to the 'white' category, we call it a `true positive`, represented by the top left number.\n",
    "\n",
    "-   If your model predicts a pumpkin as not white and it actually belongs to the 'white' category, we call it a `false negative`, represented by the bottom left number.\n",
    "\n",
    "-   If your model predicts a pumpkin as white and it actually belongs to the 'not-white' category, we call it a `false positive`, represented by the top right number.\n",
    "\n",
    "-   If your model predicts a pumpkin as not white and it actually belongs to the 'not-white' category, we call it a `true negative`, represented by the bottom right number.\n",
    "\n",
    "| Truth |\n",
    "|:-----:|\n",
    "\n",
    "\n",
    "|               |        |       |\n",
    "|---------------|--------|-------|\n",
    "| **Predicted** | WHITE | ORANGE |\n",
    "| WHITE        | TP     | FP    |\n",
    "| ORANGE         | FN     | TN    |\n",
    "\n",
    "As you might have guessed, it's preferable to have a higher number of true positives and true negatives, and a lower number of false positives and false negatives, as this indicates better model performance.\n",
    "\n",
    "The confusion matrix is useful because it leads to other metrics that help us evaluate the performance of a classification model more effectively. Let's go through some of them:\n",
    "\n",
    "🎓 Precision: `TP/(TP + FP)` Defined as the proportion of predicted positives that are actually positive. Also referred to as [positive predictive value](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\").\n",
    "\n",
    "🎓 Recall: `TP/(TP + FN)` Defined as the proportion of positive results out of the total number of samples that were actually positive. Also known as `sensitivity`.\n",
    "\n",
    "🎓 Specificity: `TN/(TN + FP)` Defined as the proportion of negative results out of the total number of samples that were actually negative.\n",
    "\n",
    "🎓 Accuracy: `(TP + TN)/(TP + TN + FP + FN)` The percentage of labels correctly predicted for a sample.\n",
    "\n",
    "🎓 F Measure: A weighted average of precision and recall, where the best value is 1 and the worst is 0.\n",
    "\n",
    "Let's calculate these metrics!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Combine metric functions and calculate them all at once\n",
    "eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n",
    "eval_metrics(data = results, truth = color, estimate = .pred_class)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize the ROC curve of this model\n",
    "\n",
    "Let's create another visualization to examine the so-called [`ROC curve`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Make a roc_curve\n",
    "results %>% \n",
    "  roc_curve(color, .pred_ORANGE) %>% \n",
    "  autoplot()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ROC curves are often used to visualize the performance of a classifier by comparing its true positives to false positives. Typically, ROC curves display `True Positive Rate`/Sensitivity on the Y-axis and `False Positive Rate`/1-Specificity on the X-axis. Therefore, the steepness of the curve and the distance between the diagonal line and the curve are important: you want a curve that rises sharply and moves away from the diagonal. In our case, there are some false positives at the beginning, but the curve eventually rises and moves away as expected.\n",
    "\n",
    "Finally, let's use `yardstick::roc_auc()` to compute the actual Area Under the Curve. One way to interpret AUC is as the probability that the model assigns a higher score to a randomly chosen positive example than to a randomly chosen negative example.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Calculate area under curve\n",
    "results %>% \n",
    "  roc_auc(color, .pred_ORANGE)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is around `0.975`. Since the AUC ranges from 0 to 1, you want a high score, as a model that is 100% accurate in its predictions will have an AUC of 1. In this case, the model is *pretty good*.\n",
    "\n",
    "In future lessons on classification, you will learn how to improve your model's scores (such as addressing imbalanced data in this scenario).\n",
    "\n",
    "## 🚀Challenge\n",
    "\n",
    "There's a lot more to explore about logistic regression! But the best way to learn is by experimenting. Find a dataset suitable for this type of analysis and build a model with it. What insights do you gain? Tip: check out [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) for interesting datasets.\n",
    "\n",
    "## Review & Self Study\n",
    "\n",
    "Read the first few pages of [this paper from Stanford](https://web.stanford.edu/~jurafsky/slp3/5.pdf) to learn about some practical applications of logistic regression. Reflect on tasks that are better suited for one type of regression versus another, based on what we've studied so far. Which approach would work best?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**Disclaimer**:  \nThis document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": "",
  "kernelspec": {
   "display_name": "R",
   "langauge": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.4.1"
  },
  "coopTranslator": {
   "original_hash": "feaf125f481a89c468fa115bf2aed580",
   "translation_date": "2025-09-06T15:30:55+00:00",
   "source_file": "2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb",
   "language_code": "en"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}