diff --git a/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb b/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb index bb241334..38741c33 100644 --- a/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb +++ b/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb @@ -12,13 +12,13 @@ "\n", "#### Introduction\n", "\n", - "In this final lesson on Regression, one of the basic *classic* ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict `binary` `categories`. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?\n", + "In this final lesson on Regression, one of the basic *classic* ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?\n", "\n", "In this lesson, you will learn:\n", "\n", "- Techniques for logistic regression\n", "\n", - "✅ Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-77952-leestott)\n", + "✅ Deepen your understanding of working with this type of regression in this [Learn module](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott)\n", "\n", "#### **Prerequisite**\n", "\n", @@ -48,7 +48,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n", @@ -74,7 +78,17 @@ "\n", "Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` (\"orange or not orange\") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.\n", "\n", - "![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png){width=\"600\"}\n", + "![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png)\n", + "\n", + "### Other classifications\n", + "\n", + "There are other types of logistic regression, including multinomial and ordinal:\n", + "\n", + "- **Multinomial**, which involves having more than one category - \"Orange, White, and Striped\".\n", + "\n", + "- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).\n", + "\n", + "![Multinomial vs ordinal regression](./images/multinomial-vs-ordinal.png)\n", "\n", "#### **Variables DO NOT have to correlate**\n", "\n", @@ -86,15 +100,21 @@ "\n", "✅ Think about the types of data that would lend themselves well to logistic regression\n", "\n", - "## 1. Tidy the data\n", + "## Exercise - tidy the data\n", + "\n", + "First, clean the data a bit, dropping null values and selecting only some of the columns:\n", "\n", - "Now, the fun begins! Let's start by importing the data, cleaning the data a bit, dropping rows containing missing values and selecting only some of the columns:\n" + "1. Add the following code:\n" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Load the core tidyverse packages\n", @@ -122,14 +142,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Sometimes, we may want some little more information on our data. We can have a look at the `data`, `its structure` and the `data type` of its features by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as below:\n", + "You can always take a peek at your new dataframe, by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as below:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "pumpkins_select %>% \n", @@ -140,15 +164,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Wow! Seems that all our columns are all of type *character*, further alluding that they are all categorical.\n", - "\n", "Let's confirm that we will actually be doing a binary classification problem:\n" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Subset distinct observations in outcome column\n", @@ -160,13 +186,51 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "🥳🥳 That went down well!\n", + "### Visualization - categorical plot\n", + "By now you have loaded up the pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including Color. Let's visualize the dataframe in the notebook using ggplot library.\n", "\n", - "## 2. Explore the data\n", + "The ggplot library offers some neat ways to visualize your data. For example, you can compare distributions of the data for each Variety and Color in a categorical plot.\n", "\n", - "The goal of data exploration is to try to understand the `relationships` between its attributes; in particular, any apparent correlation between the *features* and the *label* your model will try to predict. One way of doing this is by using data visualization.\n", + "1. Create such a plot by using the geombar function, using our pumpkin data, and specifying a color mapping for each pumpkin category (orange or white):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "python" + } + }, + "outputs": [], + "source": [ + "# Specify colors for each value of the hue variable\n", + "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n", + "\n", + "# Create the bar plot\n", + "ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n", + " geom_bar(position = \"dodge\") +\n", + " scale_fill_manual(values = palette) +\n", + " labs(y = \"Variety\", fill = \"Color\") +\n", + " theme_minimal()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By observing the data, you can see how the Color data relates to Variety.\n", + "\n", + "✅ Given this categorical plot, what are some interesting explorations you can envision?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data pre-processing: feature encoding\n", "\n", - "Given our the data types of our columns, we can `encode` them and be on our way to making some visualizations. This simply involves `translating` a column with `categorical values` for example our columns of type *char*, into one or more `numeric columns` that take the place of the original. - Something we did in our [last lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/3-Linear/solution/lesson_3.html).\n", + "Our pumpkins dataset contains string values for all its columns. Working with categorical data is intuitive for humans but not for machines. Machine learning algorithms work well with numbers. That's why encoding is a very important step in the data pre-processing phase, since it enables us to turn categorical data into numerical data, without losing any information. Good encoding leads to building a good model.\n", "\n", "For feature encoding there are two main types of encoders:\n", "\n", @@ -184,7 +248,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Preprocess and extract data to allow some data analysis\n", @@ -207,40 +275,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now, let's make a categorical plot showing the distribution of the predictors with respect to the outcome color!\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Specify colors for each value of the hue variable\n", - "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n", + "✅ What are the advantages of using an ordinal encoder for the Item Size column?\n", "\n", - "# Create the bar plot\n", - "ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n", - " geom_bar(position = \"dodge\") +\n", - " scale_fill_manual(values = palette) +\n", - " labs(y = \"Variety\", fill = \"Color\") +\n", - " theme_minimal()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin.\n", + "### Analyse relationships between variables\n", "\n", - "### **Analysing relationships between features and label**\n" + "Now that we have pre-processed our data, we can analyse the relationships between the features and the label to grasp an idea of how well the model will be able to predict the label given the features. The best way to perform this kind of analysis is plotting the data. \n", + "We'll be using again the ggplot geom_boxplot_ function, to visualize the relationships between Item Size, Variety and Color in a categorical plot. To better plot the data we'll be using the encoded Item Size column and the unencoded Variety column.\n" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Define the color palette\n", @@ -271,11 +321,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's now focus on a specific relationship: Item Size and Color!\n", - "\n", - "#### **Use a swarm plot**\n", + "#### Use a swarm plot\n", "\n", - "Color is a binary category (Orange or Not), it's called `categorical data`. There are other various ways of [visualizing categorical data](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar).\n", + "Since Color is a binary category (White or Not), it needs 'a [specialized approach](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf) to visualization'.\n", "\n", "Try a `swarm plot` to show the distribution of color with respect to the item_size.\n", "\n", @@ -285,7 +333,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Create beeswarm plots of color and item_size\n", @@ -303,17 +355,19 @@ "source": [ "Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.\n", "\n", - "## 3. Build your model\n", - "\n", - "Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.\n", + "## Build your model\n", "\n", - "It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:\n" + "Select the variables you want to use in your classification model and split the data into training and test sets. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:\n" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Split data into 80% for training and 20% for testing\n", @@ -344,7 +398,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Create a recipe that specifies preprocessing steps for modelling\n", @@ -371,7 +429,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Bundle modelling components in a workflow\n", @@ -394,7 +456,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Train the model\n", @@ -417,7 +483,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Make predictions for color and corresponding probabilities\n", @@ -438,6 +508,8 @@ "source": [ "Very nice! This provides some more insights into how logistic regression works.\n", "\n", + "### Better comprehension via a confusion matrix\n", + "\n", "Comparing each prediction with its corresponding \"ground truth\" actual value isn't a very efficient way to determine how well the model is predicting. Fortunately, Tidymodels has a few more tricks up its sleeve: [`yardstick`](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics.\n", "\n", "One performance metric associated with classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). A confusion matrix describes how well a classification model performs. A confusion matrix tabulates how many examples in each class were correctly classified by a model. In our case, it will show you how many orange pumpkins were classified as orange and how many white pumpkins were classified as white; the confusion matrix also shows you how many were classified into the **wrong** categories.\n", @@ -448,7 +520,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Confusion matrix for prediction results\n", @@ -499,7 +575,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Combine metric functions and calculate them all at once\n", @@ -511,17 +591,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### **Visualize the ROC curve of this model**\n", + "## Visualize the ROC curve of this model\n", "\n", - "For a start, this is not a bad model; its precision, recall, F measure and accuracy are in the 90% range so ideally you could use it to predict the color of a pumpkin given a set of variables. It also seems that our model was not really able to identify the white pumpkins 🧐. Could you guess why? One reason could be because of the high prevalence of ORANGE pumpkins in our training set making our model more inclined to predict the majority class.\n", - "\n", - "Let's do one more visualization to see the so-called [`ROC score`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):\n" + "Let's do one more visualization to see the so-called [`ROC curve`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):\n" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Make a roc_curve\n", @@ -542,7 +624,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "vscode": { + "languageId": "r" + } + }, "outputs": [], "source": [ "# Calculate area under curve\n", @@ -554,15 +640,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The result is around `0.947`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\n", + "The result is around `0.975`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\n", "\n", "In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).\n", "\n", - "But for now, congratulations 🎉🎉🎉! You've completed these regression lessons!\n", + "## 🚀Challenge\n", + "\n", + "There's a lot more to unpack regarding logistic regression! But the best way to learn is to experiment. Find a dataset that lends itself to this type of analysis and build a model with it. What do you learn? tip: try [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) for interesting datasets.\n", "\n", - "You R awesome!\n", + "## Review & Self Study\n", "\n", - "![Artwork by \\@allison_horst](../../images/r_learners_sm.jpeg)\n" + "Read the first few pages of [this paper from Stanford](https://web.stanford.edu/~jurafsky/slp3/5.pdf) on some practical uses for logistic regression. Think about tasks that are better suited for one or the other type of regression tasks that we have studied up to this point. What would work best?\n" ] } ],