Cuisine classifiers 2

In this second classification lesson, we will explore more ways to classify categorical data. We will also learn about the ramifications for choosing one classifier over the other.

Prerequisite

We assume that you have completed the previous lessons since we will be carrying forward some concepts we learned before.

For this lesson, we’ll require the following packages:

You can have them installed as:

install.packages(c("tidyverse", "tidymodels", "kernlab", "themis", "ranger", "xgboost", "kknn"))

Alternatively, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing.

suppressWarnings(if (!require("pacman"))install.packages("pacman"))

pacman::p_load(tidyverse, tidymodels, themis, kernlab, ranger, xgboost, kknn)
## 
## The downloaded binary packages are in
##  /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages

Now, let’s hit the ground running!

1. A classification map

In our previous lesson, we tried to address the question: how do we choose between multiple models? To a great extent, it depends on the characteristics of the data and the type of problem we want to solve (for instance classification or regression?)

Previously, we learned about the various options you have when classifying data using Microsoft’s cheat sheet. Python’s Machine Learning framework, Scikit-learn, offers a similar but more granular cheat sheet that can further help narrow down your estimators (another term for classifiers):


Tip: visit this map online and click along the path to read documentation.

The Tidymodels reference site also provides an excellent documentation about different types of model.

The plan 🗺️

This map is very helpful once you have a clear grasp of your data, as you can ‘walk’ along its paths to a decision:

  • We have >50 samples

  • We want to predict a category

  • We have labeled data

  • We have fewer than 100K samples

  • ✨ We can choose a Linear SVC

  • If that doesn’t work, since we have numeric data

    • We can try a ✨ KNeighbors Classifier

      • If that doesn’t work, try ✨ SVC and ✨ Ensemble Classifiers

This is a very helpful trail to follow. Now, let’s get right into it using the tidymodels modelling framework: a consistent and flexible collection of R packages developed to encourage good statistical practice 😊.

2. Split the data and deal with imbalanced data set.

From our previous lessons, we learnt that there were a set of common ingredients across our cuisines. Also, there was quite an unequal distribution in the number of cuisines.

We’ll deal with these by

  • Dropping the most common ingredients that create confusion between distinct cuisines, using dplyr::select().

  • Use a recipe that preprocesses the data to get it ready for modelling by applying an over-sampling algorithm.

We already looked at the above in the previous lesson so this should be a breeze 🥳!

# Load the core Tidyverse and Tidymodels packages
library(tidyverse)
library(tidymodels)

# Load the original cuisines data
df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")
## New names:
## Rows: 2448 Columns: 385
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): cuisine dbl (384): ...1, almond, angelica, anise, anise_seed, apple,
## apple_brandy, a...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Drop id column, rice, garlic and ginger from our original data set
df_select <- df %>% 
  select(-c(1, rice, garlic, ginger)) %>%
  # Encode cuisine column as categorical
  mutate(cuisine = factor(cuisine))


# Create data split specification
set.seed(2056)
cuisines_split <- initial_split(data = df_select,
                                strata = cuisine,
                                prop = 0.7)

# Extract the data in each split
cuisines_train <- training(cuisines_split)
cuisines_test <- testing(cuisines_split)

# Display distribution of cuisines in the training set
cuisines_train %>% 
  count(cuisine) %>% 
  arrange(desc(n))

Deal with imbalanced data

Imbalanced data often has negative effects on the model performance. Many models perform best when the number of observations is equal and, thus, tend to struggle with unbalanced data.

There are majorly two ways of dealing with imbalanced data sets:

  • adding observations to the minority class: Over-sampling e.g using a SMOTE algorithm which synthetically generates new examples of the minority class using nearest neighbors of these cases.

  • removing observations from majority class: Under-sampling

In our previous lesson, we demonstrated how to deal with imbalanced data sets using a recipe. A recipe can be thought of as a blueprint that describes what steps should be applied to a data set in order to get it ready for data analysis. In our case, we want to have an equal distribution in the number of our cuisines for our training set. Let’s get right into it.

# Load themis package for dealing with imbalanced data
library(themis)

# Create a recipe for preprocessing training data
cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%
  step_smote(cuisine) 

# Print recipe
cuisines_recipe
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:     1
## predictor: 380
## 
## ── Operations
## • SMOTE based on: cuisine

Now we are ready to train models 👩‍💻👨‍💻!

3. Beyond multinomial regression models

In our previous lesson, we looked at multinomial regression models. Let’s explore some more flexible models for classification.

Support Vector Machines.

In the context of classification, Support Vector Machines is a machine learning technique that tries to find a hyperplane that “best” separates the classes. Let’s look at a simple example:

By User:ZackWeinberg:This file was derived from: https://commons.wikimedia.org/w/index.php?curid=22877598
By User:ZackWeinberg:This file was derived from: https://commons.wikimedia.org/w/index.php?curid=22877598

H1 does not separate the classes. H2 does, but only with a small margin. H3 separates them with the maximal margin.

Linear Support Vector Classifier

Support-Vector clustering (SVC) is a child of the Support-Vector machines family of ML techniques. In SVC, the hyperplane is chosen to correctly separate most of the training observations, but may misclassify a few observations. By allowing some points to be on the wrong side, the SVM becomes more robust to outliers hence better generalization to new data. The parameter that regulates this violation is referred to as cost which has a default value of 1 (see help("svm_poly")).

Let’s create a linear SVC by setting degree = 1 in a polynomial SVM model.

# Make a linear SVC specification
svc_linear_spec <- svm_poly(degree = 1) %>% 
  set_engine("kernlab") %>% 
  set_mode("classification")

# Bundle specification and recipe into a worklow
svc_linear_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(svc_linear_spec)

# Print out workflow
svc_linear_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_poly()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_smote()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Polynomial Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   degree = 1
## 
## Computational engine: kernlab

Now that we have captured the preprocessing steps and model specification into a workflow, we can go ahead and train the linear SVC and evaluate results while at it. For performance metrics, let’s create a metric set that will evaluate: accuracy, sensitivity, Positive Predicted Value and F Measure

augment() will add column(s) for predictions to the given data.

# Train a linear SVC model
svc_linear_fit <- svc_linear_wf %>% 
  fit(data = cuisines_train)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
# Create a metric set
eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)


# Make predictions and Evaluate model performance
svc_linear_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

Support Vector Machine

The support vector machine (SVM) is an extension of the support vector classifier in order to accommodate a non-linear boundary between the classes. In essence, SVMs use the kernel trick to enlarge the feature space to adapt to nonlinear relationships between classes. One popular and extremely flexible kernel function used by SVMs is the Radial basis function. Let’s see how it will perform on our data.

set.seed(2056)

# Make an RBF SVM specification
svm_rbf_spec <- svm_rbf() %>% 
  set_engine("kernlab") %>% 
  set_mode("classification")

# Bundle specification and recipe into a worklow
svm_rbf_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(svm_rbf_spec)


# Train an RBF model
svm_rbf_fit <- svm_rbf_wf %>% 
  fit(data = cuisines_train)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
# Make predictions and Evaluate model performance
svm_rbf_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

Much better 🤩!

✅ Please see:

for further reading.

Nearest Neighbor classifiers

K-nearest neighbor (KNN) is an algorithm in which each observation is predicted based on its similarity to other observations.

Let’s fit one to our data.

# Make a KNN specification
knn_spec <- nearest_neighbor() %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

# Bundle recipe and model specification into a workflow
knn_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(knn_spec)

# Train a boosted tree model
knn_wf_fit <- knn_wf %>% 
  fit(data = cuisines_train)


# Make predictions and Evaluate model performance
knn_wf_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

It appears that this model is not performing that well. Probably changing the model’s arguments (see help("nearest_neighbor") will improve model performance. Be sure to try it out.

✅ Please see:

to learn more about K-Nearest Neighbors classifiers.

Ensemble classifiers

Ensemble algorithms work by combining multiple base estimators to produce an optimal model either by:

bagging: applying an averaging function to a collection of base models

boosting: building a sequence of models that build on one another to improve predictive performance.

Let’s start by trying out a Random Forest model, which builds a large collection of decision trees then applies an averaging function to for a better overall model.

# Make a random forest specification
rf_spec <- rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

# Bundle recipe and model specification into a workflow
rf_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(rf_spec)

# Train a random forest model
rf_wf_fit <- rf_wf %>% 
  fit(data = cuisines_train)


# Make predictions and Evaluate model performance
rf_wf_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

Good job 👏!

Let’s also experiment with a Boosted Tree model.

Boosted Tree defines an ensemble method that creates a series of sequential decision trees where each tree depends on the results of previous trees in an attempt to incrementally reduce the error. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.

There are different ways to fit this model (see help("boost_tree")). In this example, we’ll fit Boosted trees via xgboost engine.

# Make a boosted tree specification
boost_spec <- boost_tree(trees = 200) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

# Bundle recipe and model specification into a workflow
boost_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(boost_spec)

# Train a boosted tree model
boost_wf_fit <- boost_wf %>% 
  fit(data = cuisines_train)


# Make predictions and Evaluate model performance
boost_wf_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

✅ Please see:

to learn more about Ensemble classifiers.

4. Extra - comparing multiple models

We have fitted quite a number of models in this lab 🙌. It can become tedious or onerous to create a lot of workflows from different sets of preprocessors and/or model specifications and then calculate the performance metrics one by one.

Let’s see if we can address this by creating a function that fits a list of workflows on the training set then returns the performance metrics based on the test set. We’ll get to use map() and map_dfr() from the purrr package to apply functions to each element in list.

map() functions allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.

set.seed(2056)

# Create a metric set
eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)

# Define a function that returns performance metrics
compare_models <- function(workflow_list, train_set, test_set){
  
  suppressWarnings(
    # Fit each model to the train_set
    map(workflow_list, fit, data = train_set) %>% 
    # Make predictions on the test set
      map_dfr(augment, new_data = test_set, .id = "model") %>%
    # Select desired columns
      select(model, cuisine, .pred_class) %>% 
    # Evaluate model performance
      group_by(model) %>% 
      eval_metrics(truth = cuisine, estimate = .pred_class) %>% 
      ungroup()
  )
  
} # End of function

Let’s call our function and compare the accuracy across the models.

# Make a list of workflows
workflow_list <- list(
  "svc" = svc_linear_wf,
  "svm" = svm_rbf_wf,
  "knn" = knn_wf,
  "random_forest" = rf_wf,
  "xgboost" = boost_wf)

# Call the function
set.seed(2056)
perf_metrics <- compare_models(workflow_list = workflow_list, train_set = cuisines_train, test_set = cuisines_test)

# Print out performance metrics
perf_metrics %>% 
  group_by(.metric) %>% 
  arrange(desc(.estimate)) %>% 
  slice_head(n=7)
# Compare accuracy
perf_metrics %>% 
  filter(.metric == "accuracy") %>% 
  arrange(desc(.estimate))

workflowset package allow users to create and easily fit a large number of models but is mostly designed to work with resampling techniques such as cross-validation, an approach we are yet to cover.

🚀Challenge

Each of these techniques has a large number of parameters that you can tweak for instance cost in SVMs, neighbors in KNN, mtry (Randomly Selected Predictors) in Random Forest.

Research each one’s default parameters and think about what tweaking these parameters would mean for the model’s quality.

To find out more about a particular model and its parameters, use: help("model") e.g help("rand_forest")

In practice, we usually estimate the best values for these by training many models on a simulated data set and measuring how well all these models perform. This process is called tuning.

Post-lecture quiz

Review & Self Study

There’s a lot of jargon in these lessons, so take a minute to review this list of useful terminology!

THANK YOU TO:

Allison Horst for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her gallery.

Cassie Breviu and Jen Looper for creating the original Python version of this module ♥️

Happy Learning,

Eric, Gold Microsoft Learn Student Ambassador.

Artwork by @allison_horst
Artwork by @allison_horst
