Cuisine classifiers 1
In this lesson, we’ll explore a variety of classifiers to predict
a given national cuisine based on a group of ingredients. While
doing so, we’ll learn more about some of the ways that algorithms can be
leveraged for classification tasks.
Preparation
This lesson builds up on our previous
lesson where we:
Made a gentle introduction to classifications using a dataset
about all the brilliant cuisines of Asia and India 😋.
Explored some dplyr
verbs to prep and clean our data.
Made beautiful visualizations using ggplot2.
Demonstrated how to deal with imbalanced data by preprocessing it
using recipes.
Demonstrated how to prep
and bake
our
recipe to confirm that it will work as supposed to.
Prerequisite
For this lesson, we’ll require the following packages to clean, prep
and visualize our data:
tidyverse
: The tidyverse is a collection of R packages
designed to makes data science faster, easier and more fun!
tidymodels
: The tidymodels framework is a collection of packages
for modeling and machine learning.
DataExplorer
: The DataExplorer
package is meant to simplify and automate EDA process and report
generation.
themis
: The themis package provides Extra
Recipes Steps for Dealing with Unbalanced Data.
nnet
: The nnet
package provides functions for estimating feed-forward neural
networks with a single hidden layer, and for multinomial logistic
regression models.
You can have them installed as:
install.packages(c("tidyverse", "tidymodels", "DataExplorer", "here"))
Alternatively, the script below checks whether you have the packages
required to complete this module and installs them for you in case they
are missing.
suppressWarnings(if (!require("pacman"))install.packages("pacman"))
pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)
Now, let’s hit the ground running!
1. Split the data into training and test sets.
We’ll start by picking a few steps from our previous lesson.
Drop the most common ingredients that create confusion between
distinct cuisines, using dplyr::select()
.
Everyone loves rice, garlic and ginger!
# Load the original cuisines data
df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")
## New names:
## Rows: 2448 Columns: 385
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): cuisine dbl (384): ...1, almond, angelica, anise, anise_seed, apple,
## apple_brandy, a...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Drop id column, rice, garlic and ginger from our original data set
df_select <- df %>%
select(-c(1, rice, garlic, ginger)) %>%
# Encode cuisine column as categorical
mutate(cuisine = factor(cuisine))
# Display new data set
df_select %>%
slice_head(n = 5)
# Display distribution of cuisines
df_select %>%
count(cuisine) %>%
arrange(desc(n))
Perfect! Now, time to split the data such that 70% of the data goes
to training and 30% goes to testing. We’ll also apply a
stratification
technique when splitting the data to
maintain the proportion of each cuisine
in the training and
validation datasets.
rsample, a package in
Tidymodels, provides infrastructure for efficient data splitting and
resampling:
# Load the core Tidymodels packages into R session
library(tidymodels)
# Create split specification
set.seed(2056)
cuisines_split <- initial_split(data = df_select,
strata = cuisine,
prop = 0.7)
# Extract the data in each split
cuisines_train <- training(cuisines_split)
cuisines_test <- testing(cuisines_split)
# Print the number of cases in each split
cat("Training cases: ", nrow(cuisines_train), "\n",
"Test cases: ", nrow(cuisines_test), sep = "")
## Training cases: 1712
## Test cases: 736
# Display the first few rows of the training set
cuisines_train %>%
slice_head(n = 5)
# Display distribution of cuisines in the training set
cuisines_train %>%
count(cuisine) %>%
arrange(desc(n))
2. Deal with imbalanced data
As you might have noticed in the original data set as well as in our
training set, there is quite an unequal distribution in the number of
cuisines. Korean cuisines are almost 3 times Thai cuisines.
Imbalanced data often has negative effects on the model performance.
Many models perform best when the number of observations is equal and,
thus, tend to struggle with unbalanced data.
There are majorly two ways of dealing with imbalanced data sets:
adding observations to the minority class:
Over-sampling
e.g using a SMOTE algorithm which
synthetically generates new examples of the minority class using nearest
neighbors of these cases.
removing observations from majority class:
Under-sampling
In our previous lesson, we demonstrated how to deal with imbalanced
data sets using a recipe
. A recipe can be thought of as a
blueprint that describes what steps should be applied to a data set in
order to get it ready for data analysis. In our case, we want to have an
equal distribution in the number of our cuisines for our
training set
. Let’s get right into it.
# Load themis package for dealing with imbalanced data
library(themis)
# Create a recipe for preprocessing training data
cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%
step_smote(cuisine)
# Print recipe
cuisines_recipe
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 380
##
## ── Operations
## • SMOTE based on: cuisine
You can of course go ahead and confirm (using prep+bake) that the
recipe will work as you expect it - all the cuisine labels having
559
observations.
Since we’ll be using this recipe as a preprocessor for modeling, a
workflow()
will do all the prep and bake for us, so we
won’t have to manually estimate the recipe.
Now we are ready to train a model 👩💻👨💻!
3. Choosing your classifier
Artwork by @allison_horst
Now we have to decide which algorithm to use for the job 🤔.
In Tidymodels, the parsnip package
provides consistent interface for working with models across different
engines (packages). Please see the parsnip documentation to explore model types &
engines and their corresponding model
arguments. The variety is quite bewildering at first sight. For
instance, the following methods all include classification
techniques:
C5.0 Rule-Based Classification Models
Flexible Discriminant Models
Linear Discriminant Models
Regularized Discriminant Models
Logistic Regression Models
Multinomial Regression Models
Naive Bayes Models
Support Vector Machines
Nearest Neighbors
Decision Trees
Ensemble methods
Neural Networks
The list goes on!
What classifier to go with?
So, which classifier should you choose? Often, running through
several and looking for a good result is a way to test.
AutoML solves this problem neatly by running these comparisons in the
cloud, allowing you to choose the best algorithm for your data. Try it
here
Also the choice of classifier depends on our problem. For instance,
when the outcome can be categorized into
more than two classes
, like in our case, you must use a
multiclass classification algorithm
as opposed to
binary classification.
A better approach
A better way than wildly guessing, however, is to follow the ideas on
this downloadable ML
Cheat sheet. Here, we discover that, for our multiclass problem, we
have some choices:
A section of Microsoft’s Algorithm Cheat Sheet,
detailing multiclass classification options
Reasoning
Let’s see if we can reason our way through different approaches given
the constraints we have:
Deep Neural networks are too heavy. Given our
clean, but minimal dataset, and the fact that we are running training
locally via notebooks, deep neural networks are too heavyweight for this
task.
No two-class classifier. We do not use a
two-class classifier, so that rules out one-vs-all.
Decision tree or logistic regression could work.
A decision tree might work, or multinomial regression/multiclass
logistic regression for multiclass data.
Multiclass Boosted Decision Trees solve a different
problem. The multiclass boosted decision tree is most suitable
for nonparametric tasks, e.g. tasks designed to build rankings, so it is
not useful for us.
Also, normally before embarking on more complex machine learning
models e.g ensemble methods, it’s a good idea to build the simplest
possible model to get an idea of what is going on. So for this lesson,
we’ll start with a multinomial logistic regression
model.
Logistic regression is a technique used when the outcome variable is
categorical (or nominal). For Binary logistic regression the number of
outcome variables is two, whereas the number of outcome variables for
multinomial logistic regression is more than two. See Advanced
Regression Methods for further reading.
4. Train and evaluate a Multinomial logistic regression model.
In Tidymodels, parsnip::multinom_reg()
, defines a model
that uses linear predictors to predict multiclass data using the
multinomial distribution. See ?multinom_reg()
for the
different ways/engines you can use to fit this model.
For this example, we’ll fit a Multinomial regression model via the
default nnet
engine.
I picked a value for penalty
sort of randomly. There are
better ways to choose this value that is, by using
resampling
and tuning
the model which we’ll
discuss later.
See Tidymodels:
Get Started in case you want to learn more on how to tune model
hyperparameters.
# Create a multinomial regression model specification
mr_spec <- multinom_reg(penalty = 1) %>%
set_engine("nnet", MaxNWts = 2086) %>%
set_mode("classification")
# Print model specification
mr_spec
## Multinomial Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 1
##
## Engine-Specific Arguments:
## MaxNWts = 2086
##
## Computational engine: nnet
Great job 🥳! Now that we have a recipe and a model specification, we
need to find a way of bundling them together into an object that will
first preprocess the data then fit the model on the preprocessed data
and also allow for potential post-processing activities. In Tidymodels,
this convenient object is called a workflow
and
conveniently holds your modeling components! This is what we’d call
pipelines in Python.
So let’s bundle everything up into a workflow!📦
# Bundle recipe and model specification
mr_wf <- workflow() %>%
add_recipe(cuisines_recipe) %>%
add_model(mr_spec)
# Print out workflow
mr_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: multinom_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_smote()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Multinomial Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 1
##
## Engine-Specific Arguments:
## MaxNWts = 2086
##
## Computational engine: nnet
Workflows 👌👌! A workflow()
can be fit
in much the same way a model can. So, time to train a model!
# Train a multinomial regression model
mr_fit <- fit(object = mr_wf, data = cuisines_train)
mr_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: multinom_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_smote()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## nnet::multinom(formula = ..y ~ ., data = data, decay = ~1, MaxNWts = ~2086,
## trace = FALSE)
##
## Coefficients:
## (Intercept) almond angelica anise anise_seed apple
## indian 0.19723325 0.2409661 0 -5.004955e-05 -0.1657635 -0.05769734
## japanese 0.13961959 -0.6262400 0 -1.169155e-04 -0.4893596 -0.08585717
## korean 0.22377347 -0.1833485 0 -5.560395e-05 -0.2489401 -0.15657804
## thai -0.04336577 -0.6106258 0 4.903828e-04 -0.5782866 0.63451105
## apple_brandy apricot armagnac artemisia artichoke asparagus
## indian 0 0.37042636 0 -0.09122797 0 -0.27181970
## japanese 0 0.28895643 0 -0.12651100 0 0.14054037
## korean 0 -0.07981259 0 0.55756709 0 -0.66979948
## thai 0 -0.33160904 0 -0.10725182 0 -0.02602152
## avocado bacon baked_potato balm banana barley
## indian -0.46624197 0.16008055 0 0 -0.2838796 0.2230625
## japanese 0.90341344 0.02932727 0 0 -0.4142787 2.0953906
## korean -0.06925382 -0.35804134 0 0 -0.2686963 -0.7233404
## thai -0.21473955 -0.75594439 0 0 0.6784880 -0.4363320
## bartlett_pear basil bay bean beech
## indian 0 -0.7128756 0.1011587 -0.8777275 -0.0004380795
## japanese 0 0.1288697 0.9425626 -0.2380748 0.3373437611
## korean 0 -0.2445193 -0.4744318 -0.8957870 -0.0048784496
## thai 0 1.5365848 0.1333256 0.2196970 -0.0113078024
## beef beef_broth beef_liver beer beet
## indian -0.7985278 0.2430186 -0.035598065 -0.002173738 0.01005813
## japanese 0.2241875 -0.3653020 -0.139551027 0.128905553 0.04923911
## korean 0.5366515 -0.6153237 0.213455197 -0.010828645 0.27325423
## thai 0.1570012 -0.9364154 -0.008032213 -0.035063746 -0.28279823
## bell_pepper bergamot berry bitter_orange black_bean
## indian 0.49074330 0 0.58947607 0.191256164 -0.1945233
## japanese 0.09074167 0 -0.25917977 -0.118915977 -0.3442400
## korean -0.57876763 0 -0.07874180 -0.007729435 -0.5220672
## thai 0.92554006 0 -0.07210196 -0.002983296 -0.4614426
## black_currant black_mustard_seed_oil black_pepper black_raspberry
## indian 0 0.38935801 -0.4453495 0
## japanese 0 -0.05452887 -0.5440869 0
## korean 0 -0.03929970 0.8025454 0
## thai 0 -0.21498372 -0.9854806 0
## black_sesame_seed black_tea blackberry blackberry_brandy
## indian -0.2759246 0.3079977 0.191256164 0
## japanese -0.6101687 -0.1671913 -0.118915977 0
## korean 1.5197674 -0.3036261 -0.007729435 0
## thai -0.1755656 -0.1487033 -0.002983296 0
## blue_cheese blueberry bone_oil bourbon_whiskey brandy
## indian 0 0.216164294 -0.2276744 0 0.22427587
## japanese 0 -0.119186087 0.3913019 0 -0.15595599
## korean 0 -0.007821986 0.2854487 0 -0.02562342
## thai 0 -0.004947048 -0.0253658 0 -0.05715244
##
## ...
## and 308 more lines.
The output shows the coefficients that the model learned during
training.
Evaluate the Trained Model
It’s time to see how the model performed 📏 by evaluating it on a
test set! Let’s begin by making predictions on the test set.
# Make predictions on the test set
results <- cuisines_test %>% select(cuisine) %>%
bind_cols(mr_fit %>% predict(new_data = cuisines_test))
# Print out results
results %>%
slice_head(n = 5)
Great job! In Tidymodels, evaluating model performance can be done
using yardstick - a
package used to measure the effectiveness of models using performance
metrics. As we did in our logistic regression lesson, let’s begin by
computing a confusion matrix.
# Confusion matrix for categorical data
conf_mat(data = results, truth = cuisine, estimate = .pred_class)
## Truth
## Prediction chinese indian japanese korean thai
## chinese 83 1 8 15 10
## indian 4 163 1 2 6
## japanese 21 5 73 25 1
## korean 15 0 11 191 0
## thai 10 11 3 7 70
When dealing with multiple classes, it’s generally more intuitive to
visualize this as a heat map, like this:
update_geom_defaults(geom = "tile", new = list(color = "black", alpha = 0.7))
# Visualize confusion matrix
results %>%
conf_mat(cuisine, .pred_class) %>%
autoplot(type = "heatmap")

The darker squares in the confusion matrix plot indicate high numbers
of cases, and you can hopefully see a diagonal line of darker squares
indicating cases where the predicted and actual label are the same.
Let’s now calculate summary statistics for the confusion matrix.
# Summary stats for confusion matrix
conf_mat(data = results, truth = cuisine, estimate = .pred_class) %>% summary()
If we narrow down to some metrics such as accuracy, sensitivity, ppv,
we are not badly off for a start 🥳!
4. Digging Deeper
Let’s ask one subtle question: What criteria is used to settle for a
given type of cuisine as the predicted outcome?
Well, Statistical machine learning algorithms, like logistic
regression, are based on probability
; so what actually gets
predicted by a classifier is a probability distribution over a set of
possible outcomes. The class with the highest probability is then chosen
as the most likely outcome for the given observations.
Let’s see this in action by making both hard class predictions and
probabilities.
# Make hard class prediction and probabilities
results_prob <- cuisines_test %>%
select(cuisine) %>%
bind_cols(mr_fit %>% predict(new_data = cuisines_test)) %>%
bind_cols(mr_fit %>% predict(new_data = cuisines_test, type = "prob"))
# Print out results
results_prob %>%
slice_head(n = 5)
Much better!
✅ Can you explain why the model is pretty sure that the first
observation is Thai?
🚀Challenge
In this lesson, you used your cleaned data to build a machine
learning model that can predict a national cuisine based on a series of
ingredients. Take some time to read through the many options
Tidymodels provides to classify data and other
ways to fit multinomial regression.
THANK YOU TO:
Allison Horst
for creating the amazing illustrations that make R more welcoming and
engaging. Find more illustrations at her gallery.
Cassie Breviu and Jen Looper for creating the
original Python version of this module ♥️
Happy Learning,
Eric, Gold Microsoft Learn
Student Ambassador.
