From f0c7f74b46516d4f480756ff9188b7206ba2da16 Mon Sep 17 00:00:00 2001 From: Vidushi Gupta <55969597+Vidushi-Gupta@users.noreply.github.com> Date: Thu, 15 Jun 2023 15:29:19 +0530 Subject: [PATCH] Added html file --- .../2-Classifiers-1/solution/R/lesson_11.html | 3560 +++++++++++++++++ 1 file changed, 3560 insertions(+) create mode 100644 4-Classification/2-Classifiers-1/solution/R/lesson_11.html diff --git a/4-Classification/2-Classifiers-1/solution/R/lesson_11.html b/4-Classification/2-Classifiers-1/solution/R/lesson_11.html new file mode 100644 index 00000000..8a6c783a --- /dev/null +++ b/4-Classification/2-Classifiers-1/solution/R/lesson_11.html @@ -0,0 +1,3560 @@ + + + + + + + + + + + + + +Build a classification model: Delicious Asian and Indian Cuisines + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +
+

Cuisine classifiers 1

+

In this lesson, we’ll explore a variety of classifiers to predict +a given national cuisine based on a group of ingredients. While +doing so, we’ll learn more about some of the ways that algorithms can be +leveraged for classification tasks.

+ +
+

Preparation

+

This lesson builds up on our previous +lesson where we:

+
    +
  • Made a gentle introduction to classifications using a dataset +about all the brilliant cuisines of Asia and India 😋.

  • +
  • Explored some dplyr +verbs to prep and clean our data.

  • +
  • Made beautiful visualizations using ggplot2.

  • +
  • Demonstrated how to deal with imbalanced data by preprocessing it +using recipes.

  • +
  • Demonstrated how to prep and bake our +recipe to confirm that it will work as supposed to.

  • +
+
+

Prerequisite

+

For this lesson, we’ll require the following packages to clean, prep +and visualize our data:

+
    +
  • tidyverse: The tidyverse is a collection of R packages +designed to makes data science faster, easier and more fun!

  • +
  • tidymodels: The tidymodels framework is a collection of packages +for modeling and machine learning.

  • +
  • DataExplorer: The DataExplorer +package is meant to simplify and automate EDA process and report +generation.

  • +
  • themis: The themis package provides Extra +Recipes Steps for Dealing with Unbalanced Data.

  • +
  • nnet: The nnet +package provides functions for estimating feed-forward neural +networks with a single hidden layer, and for multinomial logistic +regression models.

  • +
+

You can have them installed as:

+

install.packages(c("tidyverse", "tidymodels", "DataExplorer", "here"))

+

Alternatively, the script below checks whether you have the packages +required to complete this module and installs them for you in case they +are missing.

+
suppressWarnings(if (!require("pacman"))install.packages("pacman"))
+
+pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)
+

Now, let’s hit the ground running!

+
+
+
+
+

1. Split the data into training and test sets.

+

We’ll start by picking a few steps from our previous lesson.

+
+

Drop the most common ingredients that create confusion between +distinct cuisines, using dplyr::select().

+

Everyone loves rice, garlic and ginger!

+
# Load the original cuisines data
+df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")
+
## New names:
+## Rows: 2448 Columns: 385
+## ── Column specification
+## ──────────────────────────────────────────────────────── Delimiter: "," chr
+## (1): cuisine dbl (384): ...1, almond, angelica, anise, anise_seed, apple,
+## apple_brandy, a...
+## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
+## Specify the column types or set `show_col_types = FALSE` to quiet this message.
+## • `` -> `...1`
+
# Drop id column, rice, garlic and ginger from our original data set
+df_select <- df %>% 
+  select(-c(1, rice, garlic, ginger)) %>%
+  # Encode cuisine column as categorical
+  mutate(cuisine = factor(cuisine))
+
+# Display new data set
+df_select %>% 
+  slice_head(n = 5)
+
+ +
+
# Display distribution of cuisines
+df_select %>% 
+  count(cuisine) %>% 
+  arrange(desc(n))
+
+ +
+

Perfect! Now, time to split the data such that 70% of the data goes +to training and 30% goes to testing. We’ll also apply a +stratification technique when splitting the data to +maintain the proportion of each cuisine in the training and +validation datasets.

+

rsample, a package in +Tidymodels, provides infrastructure for efficient data splitting and +resampling:

+
# Load the core Tidymodels packages into R session
+library(tidymodels)
+
+# Create split specification
+set.seed(2056)
+cuisines_split <- initial_split(data = df_select,
+                                strata = cuisine,
+                                prop = 0.7)
+
+# Extract the data in each split
+cuisines_train <- training(cuisines_split)
+cuisines_test <- testing(cuisines_split)
+
+# Print the number of cases in each split
+cat("Training cases: ", nrow(cuisines_train), "\n",
+    "Test cases: ", nrow(cuisines_test), sep = "")
+
## Training cases: 1712
+## Test cases: 736
+
# Display the first few rows of the training set
+cuisines_train %>% 
+  slice_head(n = 5)
+
+ +
+
# Display distribution of cuisines in the training set
+cuisines_train %>% 
+  count(cuisine) %>% 
+  arrange(desc(n))
+
+ +
+
+
+
+

2. Deal with imbalanced data

+

As you might have noticed in the original data set as well as in our +training set, there is quite an unequal distribution in the number of +cuisines. Korean cuisines are almost 3 times Thai cuisines. +Imbalanced data often has negative effects on the model performance. +Many models perform best when the number of observations is equal and, +thus, tend to struggle with unbalanced data.

+

There are majorly two ways of dealing with imbalanced data sets:

+
    +
  • adding observations to the minority class: +Over-sampling e.g using a SMOTE algorithm which +synthetically generates new examples of the minority class using nearest +neighbors of these cases.

  • +
  • removing observations from majority class: +Under-sampling

  • +
+

In our previous lesson, we demonstrated how to deal with imbalanced +data sets using a recipe. A recipe can be thought of as a +blueprint that describes what steps should be applied to a data set in +order to get it ready for data analysis. In our case, we want to have an +equal distribution in the number of our cuisines for our +training set. Let’s get right into it.

+
# Load themis package for dealing with imbalanced data
+library(themis)
+
+# Create a recipe for preprocessing training data
+cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>% 
+  step_smote(cuisine)
+
+# Print recipe
+cuisines_recipe
+
## 
+
## ── Recipe ──────────────────────────────────────────────────────────────────────
+
## 
+
## ── Inputs
+
## Number of variables by role
+
## outcome:     1
+## predictor: 380
+
## 
+
## ── Operations
+
## • SMOTE based on: cuisine
+

You can of course go ahead and confirm (using prep+bake) that the +recipe will work as you expect it - all the cuisine labels having +559 observations.

+

Since we’ll be using this recipe as a preprocessor for modeling, a +workflow() will do all the prep and bake for us, so we +won’t have to manually estimate the recipe.

+

Now we are ready to train a model 👩‍💻👨‍💻!

+
+
+

3. Choosing your classifier

+
+Artwork by @allison_horst +
Artwork by @allison_horst
+
+

Now we have to decide which algorithm to use for the job 🤔.

+

In Tidymodels, the parsnip package +provides consistent interface for working with models across different +engines (packages). Please see the parsnip documentation to explore model types & +engines and their corresponding model +arguments. The variety is quite bewildering at first sight. For +instance, the following methods all include classification +techniques:

+
    +
  • C5.0 Rule-Based Classification Models

  • +
  • Flexible Discriminant Models

  • +
  • Linear Discriminant Models

  • +
  • Regularized Discriminant Models

  • +
  • Logistic Regression Models

  • +
  • Multinomial Regression Models

  • +
  • Naive Bayes Models

  • +
  • Support Vector Machines

  • +
  • Nearest Neighbors

  • +
  • Decision Trees

  • +
  • Ensemble methods

  • +
  • Neural Networks

  • +
+

The list goes on!

+
+

What classifier to go with?

+

So, which classifier should you choose? Often, running through +several and looking for a good result is a way to test.

+
+

AutoML solves this problem neatly by running these comparisons in the +cloud, allowing you to choose the best algorithm for your data. Try it +here

+
+

Also the choice of classifier depends on our problem. For instance, +when the outcome can be categorized into +more than two classes, like in our case, you must use a +multiclass classification algorithm as opposed to +binary classification.

+
+
+

A better approach

+

A better way than wildly guessing, however, is to follow the ideas on +this downloadable ML +Cheat sheet. Here, we discover that, for our multiclass problem, we +have some choices:

+
+A section of Microsoft’s Algorithm Cheat Sheet, detailing multiclass classification options +
A section of Microsoft’s Algorithm Cheat Sheet, +detailing multiclass classification options
+
+
+
+

Reasoning

+

Let’s see if we can reason our way through different approaches given +the constraints we have:

+
    +
  • Deep Neural networks are too heavy. Given our +clean, but minimal dataset, and the fact that we are running training +locally via notebooks, deep neural networks are too heavyweight for this +task.

  • +
  • No two-class classifier. We do not use a +two-class classifier, so that rules out one-vs-all.

  • +
  • Decision tree or logistic regression could work. +A decision tree might work, or multinomial regression/multiclass +logistic regression for multiclass data.

  • +
  • Multiclass Boosted Decision Trees solve a different +problem. The multiclass boosted decision tree is most suitable +for nonparametric tasks, e.g. tasks designed to build rankings, so it is +not useful for us.

  • +
+

Also, normally before embarking on more complex machine learning +models e.g ensemble methods, it’s a good idea to build the simplest +possible model to get an idea of what is going on. So for this lesson, +we’ll start with a multinomial logistic regression +model.

+
+

Logistic regression is a technique used when the outcome variable is +categorical (or nominal). For Binary logistic regression the number of +outcome variables is two, whereas the number of outcome variables for +multinomial logistic regression is more than two. See Advanced +Regression Methods for further reading.

+
+
+
+
+

4. Train and evaluate a Multinomial logistic regression model.

+

In Tidymodels, parsnip::multinom_reg(), defines a model +that uses linear predictors to predict multiclass data using the +multinomial distribution. See ?multinom_reg() for the +different ways/engines you can use to fit this model.

+

For this example, we’ll fit a Multinomial regression model via the +default nnet +engine.

+
+

I picked a value for penalty sort of randomly. There are +better ways to choose this value that is, by using +resampling and tuning the model which we’ll +discuss later.

+

See Tidymodels: +Get Started in case you want to learn more on how to tune model +hyperparameters.

+
+
# Create a multinomial regression model specification
+mr_spec <- multinom_reg(penalty = 1) %>% 
+  set_engine("nnet", MaxNWts = 2086) %>% 
+  set_mode("classification")
+
+# Print model specification
+mr_spec
+
## Multinomial Regression Model Specification (classification)
+## 
+## Main Arguments:
+##   penalty = 1
+## 
+## Engine-Specific Arguments:
+##   MaxNWts = 2086
+## 
+## Computational engine: nnet
+

Great job 🥳! Now that we have a recipe and a model specification, we +need to find a way of bundling them together into an object that will +first preprocess the data then fit the model on the preprocessed data +and also allow for potential post-processing activities. In Tidymodels, +this convenient object is called a workflow and +conveniently holds your modeling components! This is what we’d call +pipelines in Python.

+

So let’s bundle everything up into a workflow!📦

+
# Bundle recipe and model specification
+mr_wf <- workflow() %>% 
+  add_recipe(cuisines_recipe) %>% 
+  add_model(mr_spec)
+
+# Print out workflow
+mr_wf
+
## ══ Workflow ════════════════════════════════════════════════════════════════════
+## Preprocessor: Recipe
+## Model: multinom_reg()
+## 
+## ── Preprocessor ────────────────────────────────────────────────────────────────
+## 1 Recipe Step
+## 
+## • step_smote()
+## 
+## ── Model ───────────────────────────────────────────────────────────────────────
+## Multinomial Regression Model Specification (classification)
+## 
+## Main Arguments:
+##   penalty = 1
+## 
+## Engine-Specific Arguments:
+##   MaxNWts = 2086
+## 
+## Computational engine: nnet
+

Workflows 👌👌! A workflow() can be fit +in much the same way a model can. So, time to train a model!

+
# Train a multinomial regression model
+mr_fit <- fit(object = mr_wf, data = cuisines_train)
+
+mr_fit
+
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
+## Preprocessor: Recipe
+## Model: multinom_reg()
+## 
+## ── Preprocessor ────────────────────────────────────────────────────────────────
+## 1 Recipe Step
+## 
+## • step_smote()
+## 
+## ── Model ───────────────────────────────────────────────────────────────────────
+## Call:
+## nnet::multinom(formula = ..y ~ ., data = data, decay = ~1, MaxNWts = ~2086, 
+##     trace = FALSE)
+## 
+## Coefficients:
+##          (Intercept)     almond angelica         anise anise_seed       apple
+## indian    0.19723325  0.2409661        0 -5.004955e-05 -0.1657635 -0.05769734
+## japanese  0.13961959 -0.6262400        0 -1.169155e-04 -0.4893596 -0.08585717
+## korean    0.22377347 -0.1833485        0 -5.560395e-05 -0.2489401 -0.15657804
+## thai     -0.04336577 -0.6106258        0  4.903828e-04 -0.5782866  0.63451105
+##          apple_brandy     apricot armagnac   artemisia artichoke   asparagus
+## indian              0  0.37042636        0 -0.09122797         0 -0.27181970
+## japanese            0  0.28895643        0 -0.12651100         0  0.14054037
+## korean              0 -0.07981259        0  0.55756709         0 -0.66979948
+## thai                0 -0.33160904        0 -0.10725182         0 -0.02602152
+##              avocado       bacon baked_potato balm     banana     barley
+## indian   -0.46624197  0.16008055            0    0 -0.2838796  0.2230625
+## japanese  0.90341344  0.02932727            0    0 -0.4142787  2.0953906
+## korean   -0.06925382 -0.35804134            0    0 -0.2686963 -0.7233404
+## thai     -0.21473955 -0.75594439            0    0  0.6784880 -0.4363320
+##          bartlett_pear      basil        bay       bean         beech
+## indian               0 -0.7128756  0.1011587 -0.8777275 -0.0004380795
+## japanese             0  0.1288697  0.9425626 -0.2380748  0.3373437611
+## korean               0 -0.2445193 -0.4744318 -0.8957870 -0.0048784496
+## thai                 0  1.5365848  0.1333256  0.2196970 -0.0113078024
+##                beef beef_broth   beef_liver         beer        beet
+## indian   -0.7985278  0.2430186 -0.035598065 -0.002173738  0.01005813
+## japanese  0.2241875 -0.3653020 -0.139551027  0.128905553  0.04923911
+## korean    0.5366515 -0.6153237  0.213455197 -0.010828645  0.27325423
+## thai      0.1570012 -0.9364154 -0.008032213 -0.035063746 -0.28279823
+##          bell_pepper bergamot       berry bitter_orange black_bean
+## indian    0.49074330        0  0.58947607   0.191256164 -0.1945233
+## japanese  0.09074167        0 -0.25917977  -0.118915977 -0.3442400
+## korean   -0.57876763        0 -0.07874180  -0.007729435 -0.5220672
+## thai      0.92554006        0 -0.07210196  -0.002983296 -0.4614426
+##          black_currant black_mustard_seed_oil black_pepper black_raspberry
+## indian               0             0.38935801   -0.4453495               0
+## japanese             0            -0.05452887   -0.5440869               0
+## korean               0            -0.03929970    0.8025454               0
+## thai                 0            -0.21498372   -0.9854806               0
+##          black_sesame_seed  black_tea   blackberry blackberry_brandy
+## indian          -0.2759246  0.3079977  0.191256164                 0
+## japanese        -0.6101687 -0.1671913 -0.118915977                 0
+## korean           1.5197674 -0.3036261 -0.007729435                 0
+## thai            -0.1755656 -0.1487033 -0.002983296                 0
+##          blue_cheese    blueberry   bone_oil bourbon_whiskey      brandy
+## indian             0  0.216164294 -0.2276744               0  0.22427587
+## japanese           0 -0.119186087  0.3913019               0 -0.15595599
+## korean             0 -0.007821986  0.2854487               0 -0.02562342
+## thai               0 -0.004947048 -0.0253658               0 -0.05715244
+## 
+## ...
+## and 308 more lines.
+

The output shows the coefficients that the model learned during +training.

+
+

Evaluate the Trained Model

+

It’s time to see how the model performed 📏 by evaluating it on a +test set! Let’s begin by making predictions on the test set.

+
# Make predictions on the test set
+results <- cuisines_test %>% select(cuisine) %>% 
+  bind_cols(mr_fit %>% predict(new_data = cuisines_test))
+
+# Print out results
+results %>% 
+  slice_head(n = 5)
+
+ +
+

Great job! In Tidymodels, evaluating model performance can be done +using yardstick - a +package used to measure the effectiveness of models using performance +metrics. As we did in our logistic regression lesson, let’s begin by +computing a confusion matrix.

+
# Confusion matrix for categorical data
+conf_mat(data = results, truth = cuisine, estimate = .pred_class)
+
##           Truth
+## Prediction chinese indian japanese korean thai
+##   chinese       83      1        8     15   10
+##   indian         4    163        1      2    6
+##   japanese      21      5       73     25    1
+##   korean        15      0       11    191    0
+##   thai          10     11        3      7   70
+

When dealing with multiple classes, it’s generally more intuitive to +visualize this as a heat map, like this:

+
update_geom_defaults(geom = "tile", new = list(color = "black", alpha = 0.7))
+# Visualize confusion matrix
+results %>% 
+  conf_mat(cuisine, .pred_class) %>% 
+  autoplot(type = "heatmap")
+

+

The darker squares in the confusion matrix plot indicate high numbers +of cases, and you can hopefully see a diagonal line of darker squares +indicating cases where the predicted and actual label are the same.

+

Let’s now calculate summary statistics for the confusion matrix.

+
# Summary stats for confusion matrix
+conf_mat(data = results, truth = cuisine, estimate = .pred_class) %>% summary()
+
+ +
+

If we narrow down to some metrics such as accuracy, sensitivity, ppv, +we are not badly off for a start 🥳!

+
+
+
+

4. Digging Deeper

+

Let’s ask one subtle question: What criteria is used to settle for a +given type of cuisine as the predicted outcome?

+

Well, Statistical machine learning algorithms, like logistic +regression, are based on probability; so what actually gets +predicted by a classifier is a probability distribution over a set of +possible outcomes. The class with the highest probability is then chosen +as the most likely outcome for the given observations.

+

Let’s see this in action by making both hard class predictions and +probabilities.

+
# Make hard class prediction and probabilities
+results_prob <- cuisines_test %>%
+  select(cuisine) %>% 
+  bind_cols(mr_fit %>% predict(new_data = cuisines_test)) %>% 
+  bind_cols(mr_fit %>% predict(new_data = cuisines_test, type = "prob"))
+
+# Print out results
+results_prob %>% 
+  slice_head(n = 5)
+
+ +
+

Much better!

+

✅ Can you explain why the model is pretty sure that the first +observation is Thai?

+
+
+

🚀Challenge

+

In this lesson, you used your cleaned data to build a machine +learning model that can predict a national cuisine based on a series of +ingredients. Take some time to read through the many options +Tidymodels provides to classify data and other +ways to fit multinomial regression.

+
+

THANK YOU TO:

+

Allison Horst +for creating the amazing illustrations that make R more welcoming and +engaging. Find more illustrations at her gallery.

+

Cassie Breviu and Jen Looper for creating the +original Python version of this module ♥️

+

Happy Learning,

+

Eric, Gold Microsoft Learn +Student Ambassador.

+
+
+ +

+ + +
+
+ +
+ + + + + + + + + + + + + + + + +