Cuisine classifiers 2
In this second classification lesson, we will explore
more ways
to classify categorical data. We will also learn
about the ramifications for choosing one classifier over the other.
Prerequisite
We assume that you have completed the previous lessons since we will
be carrying forward some concepts we learned before.
For this lesson, we’ll require the following packages:
You can have them installed as:
install.packages(c("tidyverse", "tidymodels", "kernlab", "themis", "ranger", "xgboost", "kknn"))
Alternatively, the script below checks whether you have the packages
required to complete this module and installs them for you in case they
are missing.
suppressWarnings(if (!require("pacman"))install.packages("pacman"))
pacman::p_load(tidyverse, tidymodels, themis, kernlab, ranger, xgboost, kknn)
##
## The downloaded binary packages are in
## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
Now, let’s hit the ground running!
1. A classification map
In our previous
lesson, we tried to address the question: how do we choose between
multiple models? To a great extent, it depends on the characteristics of
the data and the type of problem we want to solve (for instance
classification or regression?)
Previously, we learned about the various options you have when
classifying data using Microsoft’s cheat sheet. Python’s Machine
Learning framework, Scikit-learn, offers a similar but more granular
cheat sheet that can further help narrow down your estimators (another
term for classifiers):

Tip: visit
this map online and click along the path to read documentation.
The Tidymodels
reference site also provides an excellent documentation about
different types of model.
The plan 🗺️
This map is very helpful once you have a clear grasp of your data, as
you can ‘walk’ along its paths to a decision:
We have >50 samples
We want to predict a category
We have labeled data
We have fewer than 100K samples
✨ We can choose a Linear SVC
If that doesn’t work, since we have numeric data
This is a very helpful trail to follow. Now, let’s get right into it
using the tidymodels modelling
framework: a consistent and flexible collection of R packages developed
to encourage good statistical practice 😊.
2. Split the data and deal with imbalanced data set.
From our previous lessons, we learnt that there were a set of common
ingredients across our cuisines. Also, there was quite an unequal
distribution in the number of cuisines.
We’ll deal with these by
Dropping the most common ingredients that create confusion
between distinct cuisines, using dplyr::select()
.
Use a recipe
that preprocesses the data to get it
ready for modelling by applying an over-sampling
algorithm.
We already looked at the above in the previous lesson so this should
be a breeze 🥳!
# Load the core Tidyverse and Tidymodels packages
library(tidyverse)
library(tidymodels)
# Load the original cuisines data
df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")
## New names:
## Rows: 2448 Columns: 385
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): cuisine dbl (384): ...1, almond, angelica, anise, anise_seed, apple,
## apple_brandy, a...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Drop id column, rice, garlic and ginger from our original data set
df_select <- df %>%
select(-c(1, rice, garlic, ginger)) %>%
# Encode cuisine column as categorical
mutate(cuisine = factor(cuisine))
# Create data split specification
set.seed(2056)
cuisines_split <- initial_split(data = df_select,
strata = cuisine,
prop = 0.7)
# Extract the data in each split
cuisines_train <- training(cuisines_split)
cuisines_test <- testing(cuisines_split)
# Display distribution of cuisines in the training set
cuisines_train %>%
count(cuisine) %>%
arrange(desc(n))
Deal with imbalanced data
Imbalanced data often has negative effects on the model performance.
Many models perform best when the number of observations is equal and,
thus, tend to struggle with unbalanced data.
There are majorly two ways of dealing with imbalanced data sets:
adding observations to the minority class:
Over-sampling
e.g using a SMOTE algorithm which
synthetically generates new examples of the minority class using nearest
neighbors of these cases.
removing observations from majority class:
Under-sampling
In our previous lesson, we demonstrated how to deal with imbalanced
data sets using a recipe
. A recipe can be thought of as a
blueprint that describes what steps should be applied to a data set in
order to get it ready for data analysis. In our case, we want to have an
equal distribution in the number of our cuisines for our
training set
. Let’s get right into it.
# Load themis package for dealing with imbalanced data
library(themis)
# Create a recipe for preprocessing training data
cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%
step_smote(cuisine)
# Print recipe
cuisines_recipe
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 380
##
## ── Operations
## • SMOTE based on: cuisine
Now we are ready to train models 👩💻👨💻!
3. Beyond multinomial regression models
In our previous lesson, we looked at multinomial regression models.
Let’s explore some more flexible models for classification.
Support Vector Machines.
In the context of classification,
Support Vector Machines
is a machine learning technique
that tries to find a hyperplane that “best” separates the
classes. Let’s look at a simple example:
H1 does not separate the classes. H2 does, but
only with a small margin. H3 separates them with the maximal
margin.
Linear Support Vector Classifier
Support-Vector clustering (SVC) is a child of the Support-Vector
machines family of ML techniques. In SVC, the hyperplane is chosen to
correctly separate most
of the training observations, but
may misclassify
a few observations. By allowing some points
to be on the wrong side, the SVM becomes more robust to outliers hence
better generalization to new data. The parameter that regulates this
violation is referred to as cost
which has a default value
of 1 (see help("svm_poly")
).
Let’s create a linear SVC by setting degree = 1
in a
polynomial SVM model.
# Make a linear SVC specification
svc_linear_spec <- svm_poly(degree = 1) %>%
set_engine("kernlab") %>%
set_mode("classification")
# Bundle specification and recipe into a worklow
svc_linear_wf <- workflow() %>%
add_recipe(cuisines_recipe) %>%
add_model(svc_linear_spec)
# Print out workflow
svc_linear_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_poly()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_smote()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Polynomial Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## degree = 1
##
## Computational engine: kernlab
Now that we have captured the preprocessing steps and model
specification into a workflow, we can go ahead and train the
linear SVC and evaluate results while at it. For performance metrics,
let’s create a metric set that will evaluate: accuracy
,
sensitivity
, Positive Predicted Value
and
F Measure
augment()
will add column(s) for predictions to the
given data.
# Train a linear SVC model
svc_linear_fit <- svc_linear_wf %>%
fit(data = cuisines_train)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
# Create a metric set
eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)
# Make predictions and Evaluate model performance
svc_linear_fit %>%
augment(new_data = cuisines_test) %>%
eval_metrics(truth = cuisine, estimate = .pred_class)
Support Vector Machine
The support vector machine (SVM) is an extension of the support
vector classifier in order to accommodate a non-linear boundary between
the classes. In essence, SVMs use the kernel trick to enlarge
the feature space to adapt to nonlinear relationships between classes.
One popular and extremely flexible kernel function used by SVMs is the
Radial basis function. Let’s see how it will perform on our
data.
set.seed(2056)
# Make an RBF SVM specification
svm_rbf_spec <- svm_rbf() %>%
set_engine("kernlab") %>%
set_mode("classification")
# Bundle specification and recipe into a worklow
svm_rbf_wf <- workflow() %>%
add_recipe(cuisines_recipe) %>%
add_model(svm_rbf_spec)
# Train an RBF model
svm_rbf_fit <- svm_rbf_wf %>%
fit(data = cuisines_train)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
# Make predictions and Evaluate model performance
svm_rbf_fit %>%
augment(new_data = cuisines_test) %>%
eval_metrics(truth = cuisine, estimate = .pred_class)
Much better 🤩!
✅ Please see:
for further reading.
Nearest Neighbor classifiers
K-nearest neighbor (KNN) is an algorithm in which each
observation is predicted based on its similarity to other
observations.
Let’s fit one to our data.
# Make a KNN specification
knn_spec <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
# Bundle recipe and model specification into a workflow
knn_wf <- workflow() %>%
add_recipe(cuisines_recipe) %>%
add_model(knn_spec)
# Train a boosted tree model
knn_wf_fit <- knn_wf %>%
fit(data = cuisines_train)
# Make predictions and Evaluate model performance
knn_wf_fit %>%
augment(new_data = cuisines_test) %>%
eval_metrics(truth = cuisine, estimate = .pred_class)
It appears that this model is not performing that well. Probably
changing the model’s arguments (see
help("nearest_neighbor")
will improve model performance. Be
sure to try it out.
✅ Please see:
to learn more about K-Nearest Neighbors classifiers.
Ensemble classifiers
Ensemble algorithms work by combining multiple base estimators to
produce an optimal model either by:
bagging
: applying an averaging function to a
collection of base models
boosting
: building a sequence of models that build on
one another to improve predictive performance.
Let’s start by trying out a Random Forest model, which builds a large
collection of decision trees then applies an averaging function to for a
better overall model.
# Make a random forest specification
rf_spec <- rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification")
# Bundle recipe and model specification into a workflow
rf_wf <- workflow() %>%
add_recipe(cuisines_recipe) %>%
add_model(rf_spec)
# Train a random forest model
rf_wf_fit <- rf_wf %>%
fit(data = cuisines_train)
# Make predictions and Evaluate model performance
rf_wf_fit %>%
augment(new_data = cuisines_test) %>%
eval_metrics(truth = cuisine, estimate = .pred_class)
Good job 👏!
Let’s also experiment with a Boosted Tree model.
Boosted Tree defines an ensemble method that creates a series of
sequential decision trees where each tree depends on the results of
previous trees in an attempt to incrementally reduce the error. It
focuses on the weights of incorrectly classified items and adjusts the
fit for the next classifier to correct.
There are different ways to fit this model (see
help("boost_tree")
). In this example, we’ll fit Boosted
trees via xgboost
engine.
# Make a boosted tree specification
boost_spec <- boost_tree(trees = 200) %>%
set_engine("xgboost") %>%
set_mode("classification")
# Bundle recipe and model specification into a workflow
boost_wf <- workflow() %>%
add_recipe(cuisines_recipe) %>%
add_model(boost_spec)
# Train a boosted tree model
boost_wf_fit <- boost_wf %>%
fit(data = cuisines_train)
# Make predictions and Evaluate model performance
boost_wf_fit %>%
augment(new_data = cuisines_test) %>%
eval_metrics(truth = cuisine, estimate = .pred_class)
✅ Please see:
to learn more about Ensemble classifiers.
🚀Challenge
Each of these techniques has a large number of parameters that you
can tweak for instance cost
in SVMs, neighbors
in KNN, mtry
(Randomly Selected Predictors) in Random
Forest.
Research each one’s default parameters and think about what tweaking
these parameters would mean for the model’s quality.
To find out more about a particular model and its parameters, use:
help("model")
e.g help("rand_forest")
In practice, we usually estimate the best values
for these by training many models on a simulated data set
and measuring how well all these models perform. This process is called
tuning.
Review & Self Study
There’s a lot of jargon in these lessons, so take a minute to review
this
list of useful terminology!
THANK YOU TO:
Allison Horst
for creating the amazing illustrations that make R more welcoming and
engaging. Find more illustrations at her gallery.
Cassie Breviu and Jen Looper for creating the
original Python version of this module ♥️
Happy Learning,
Eric, Gold Microsoft Learn
Student Ambassador.
Artwork by @allison_horst
