# Build a classification model: Delicious Asian and Indian Cuisines


## Cuisine classifiers 2

In this second classification lesson, we will explore `additional methods` for classifying categorical data. We will also discuss the implications of choosing one classifier over another.

### [**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/23/)

### **Prerequisite**

We assume that you have completed the previous lessons since we will be building on concepts introduced earlier.

For this lesson, the following packages will be required:

-   `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [set of R packages](https://www.tidyverse.org/packages) designed to make data science faster, easier, and more enjoyable!

-   `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.

-   `themis`: The [themis package](https://themis.tidymodels.org/) provides additional recipe steps for handling imbalanced data.

You can install them using the following command:

`install.packages(c("tidyverse", "tidymodels", "kernlab", "themis", "ranger", "xgboost", "kknn"))`

Alternatively, the script below checks whether the required packages for this module are installed and installs any missing ones for you.


In [None]:
suppressWarnings(if (!require("pacman"))install.packages("pacman"))

pacman::p_load(tidyverse, tidymodels, themis, kernlab, ranger, xgboost, kknn)

## **1. A classification map**

In our [previous lesson](https://github.com/microsoft/ML-For-Beginners/tree/main/4-Classification/2-Classifiers-1), we explored the question: how do we decide between different models? To a large extent, the choice depends on the characteristics of the data and the type of problem we aim to solve (e.g., classification or regression).

Earlier, we learned about the various options available for classifying data using Microsoft's cheat sheet. Python's Machine Learning framework, Scikit-learn, provides a similar but more detailed cheat sheet that can help further refine your choice of estimators (another term for classifiers):

<p >
   <img src="../../images/map.png"
   width="700"/>
   <figcaption></figcaption>


> Tip: [visit this map online](https://scikit-learn.org/stable/tutorial/machine_learning_map/) and click along the path to read documentation.
>
> The [Tidymodels reference site](https://www.tidymodels.org/find/parsnip/#models) also provides excellent documentation about different types of models.

### **The plan** üó∫Ô∏è

This map is very useful once you have a solid understanding of your data, as you can 'navigate' its paths to make a decision:

-   We have \>50 samples

-   We want to predict a category

-   We have labeled data

-   We have fewer than 100K samples

-   ‚ú® We can choose a Linear SVC

-   If that doesn't work, since we have numeric data

    -   We can try a ‚ú® KNeighbors Classifier

        -   If that doesn't work, try ‚ú® SVC and ‚ú® Ensemble Classifiers

This is a great path to follow. Now, let's dive right into it using the [tidymodels](https://www.tidymodels.org/) modeling framework: a consistent and flexible collection of R packages designed to promote good statistical practices üòä.

## 2. Split the data and handle imbalanced datasets.

From our previous lessons, we learned that there were a set of common ingredients across our cuisines. Additionally, there was a significant imbalance in the distribution of cuisines.

We'll address these issues by:

-   Dropping the most common ingredients that cause confusion between distinct cuisines, using `dplyr::select()`.

-   Using a `recipe` to preprocess the data and prepare it for modeling by applying an `over-sampling` algorithm.

We already covered this in the previous lesson, so this should be a piece of cake ü•≥!


In [None]:
# Load the core Tidyverse and Tidymodels packages
library(tidyverse)
library(tidymodels)

# Load the original cuisines data
df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")

# Drop id column, rice, garlic and ginger from our original data set
df_select <- df %>% 
  select(-c(1, rice, garlic, ginger)) %>%
  # Encode cuisine column as categorical
  mutate(cuisine = factor(cuisine))


# Create data split specification
set.seed(2056)
cuisines_split <- initial_split(data = df_select,
                                strata = cuisine,
                                prop = 0.7)

# Extract the data in each split
cuisines_train <- training(cuisines_split)
cuisines_test <- testing(cuisines_split)

# Display distribution of cuisines in the training set
cuisines_train %>% 
  count(cuisine) %>% 
  arrange(desc(n))

### Addressing Imbalanced Data

Imbalanced data can often negatively impact model performance. Many models work best when the number of observations is balanced, and they tend to struggle when faced with unbalanced data.

There are primarily two approaches to handle imbalanced datasets:

-   Adding observations to the minority class: `Over-sampling`, for example, using the SMOTE algorithm, which synthetically generates new examples for the minority class by leveraging the nearest neighbors of those cases.

-   Removing observations from the majority class: `Under-sampling`

In our previous lesson, we demonstrated how to handle imbalanced datasets using a `recipe`. A recipe can be thought of as a blueprint that outlines the steps to be applied to a dataset to prepare it for analysis. In this case, we aim to achieve an equal distribution of cuisines in our `training set`. Let‚Äôs dive in!


In [None]:
# Load themis package for dealing with imbalanced data
library(themis)

# Create a recipe for preprocessing training data
cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%
  step_smote(cuisine) 

# Print recipe
cuisines_recipe

Now we are ready to train models üë©‚Äçüíªüë®‚Äçüíª!

## 3. Beyond multinomial regression models

In our previous lesson, we explored multinomial regression models. Let's dive into some more flexible models for classification.

### Support Vector Machines

In classification tasks, `Support Vector Machines` is a machine learning technique that aims to find a *hyperplane* that "optimally" separates the classes. Here's a simple example:

<p >
   <img src="../../images/svm.png"
   width="300"/>
   <figcaption>https://commons.wikimedia.org/w/index.php?curid=22877598</figcaption>


H1~ does not separate the classes. H2~ does, but only with a small margin. H3~ separates them with the maximal margin.

#### Linear Support Vector Classifier

Support-Vector clustering (SVC) is part of the Support-Vector machines family of machine learning techniques. In SVC, the hyperplane is selected to correctly separate `most` of the training observations, but `may misclassify` a few observations. By allowing some points to fall on the wrong side, the SVM becomes more robust to outliers, which improves its ability to generalize to new data. The parameter that controls this tolerance is called `cost`, which has a default value of 1 (see `help("svm_poly")`).

Let's create a linear SVC by setting `degree = 1` in a polynomial SVM model.


In [None]:
# Make a linear SVC specification
svc_linear_spec <- svm_poly(degree = 1) %>% 
  set_engine("kernlab") %>% 
  set_mode("classification")

# Bundle specification and recipe into a worklow
svc_linear_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(svc_linear_spec)

# Print out workflow
svc_linear_wf

Now that we have encapsulated the preprocessing steps and model specification into a *workflow*, we can proceed to train the linear SVC and assess the results simultaneously. For performance metrics, let's define a metric set to evaluate: `accuracy`, `sensitivity`, `Positive Predicted Value`, and `F Measure`.

> `augment()` will append column(s) containing predictions to the provided data.


In [None]:
# Train a linear SVC model
svc_linear_fit <- svc_linear_wf %>% 
  fit(data = cuisines_train)

# Create a metric set
eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)


# Make predictions and Evaluate model performance
svc_linear_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

#### Support Vector Machine

The support vector machine (SVM) is an advanced version of the support vector classifier designed to handle non-linear boundaries between classes. Essentially, SVMs utilize the *kernel trick* to expand the feature space, making it possible to model nonlinear relationships between classes. One widely used and highly versatile kernel function employed by SVMs is the *Radial basis function.* Let's explore how it performs with our data.


In [None]:
set.seed(2056)

# Make an RBF SVM specification
svm_rbf_spec <- svm_rbf() %>% 
  set_engine("kernlab") %>% 
  set_mode("classification")

# Bundle specification and recipe into a worklow
svm_rbf_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(svm_rbf_spec)


# Train an RBF model
svm_rbf_fit <- svm_rbf_wf %>% 
  fit(data = cuisines_train)


# Make predictions and Evaluate model performance
svm_rbf_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

Much better ü§©!

> ‚úÖ Please see:
>
> -   [*Support Vector Machines*](https://bradleyboehmke.github.io/HOML/svm.html), Hands-on Machine Learning with R
>
> -   [*Support Vector Machines*](https://www.statlearning.com/), An Introduction to Statistical Learning with Applications in R
>
> for further reading.

### Nearest Neighbor classifiers

*K*-nearest neighbor (KNN) is an algorithm where each observation is predicted based on its *similarity* to other observations.

Let's fit one to our data.


In [None]:
# Make a KNN specification
knn_spec <- nearest_neighbor() %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

# Bundle recipe and model specification into a workflow
knn_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(knn_spec)

# Train a boosted tree model
knn_wf_fit <- knn_wf %>% 
  fit(data = cuisines_train)


# Make predictions and Evaluate model performance
knn_wf_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

It seems that this model is not performing very well. Adjusting the model's arguments (see `help("nearest_neighbor")`) might improve its performance. Make sure to give it a try.

> ‚úÖ Please refer to:
>
> -   [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)
>
> -   [An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)
>
> to learn more about *K*-Nearest Neighbors classifiers.

### Ensemble classifiers

Ensemble algorithms work by combining multiple base estimators to create an optimal model, either by:

`bagging`: using an *averaging function* on a collection of base models

`boosting`: building a sequence of models that improve upon each other to enhance predictive performance.

Let's begin by experimenting with a Random Forest model, which constructs a large collection of decision trees and then applies an averaging function to achieve a better overall model.


In [None]:
# Make a random forest specification
rf_spec <- rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

# Bundle recipe and model specification into a workflow
rf_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(rf_spec)

# Train a random forest model
rf_wf_fit <- rf_wf %>% 
  fit(data = cuisines_train)


# Make predictions and Evaluate model performance
rf_wf_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

Good job üëè!

Let's also experiment with a Boosted Tree model.

Boosted Tree is an ensemble method that builds a sequence of decision trees, where each tree relies on the outcomes of the previous ones to gradually minimize errors. It emphasizes the weights of misclassified items and adjusts the next classifier to improve accuracy.

There are various approaches to fitting this model (refer to `help("boost_tree")`). In this example, we'll fit Boosted Trees using the `xgboost` engine.


In [None]:
# Make a boosted tree specification
boost_spec <- boost_tree(trees = 200) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

# Bundle recipe and model specification into a workflow
boost_wf <- workflow() %>% 
  add_recipe(cuisines_recipe) %>% 
  add_model(boost_spec)

# Train a boosted tree model
boost_wf_fit <- boost_wf %>% 
  fit(data = cuisines_train)


# Make predictions and Evaluate model performance
boost_wf_fit %>% 
  augment(new_data = cuisines_test) %>% 
  eval_metrics(truth = cuisine, estimate = .pred_class)

> ‚úÖ Please see:
>
> -   [Machine Learning for Social Scientists](https://cimentadaj.github.io/ml_socsci/tree-based-methods.html#random-forests)
>
> -   [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)
>
> -   [An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)
>
> -   <https://algotech.netlify.app/blog/xgboost/> - Explores the AdaBoost model which is a good alternative to xgboost.
>
> to learn more about Ensemble classifiers.

## 4. Extra - comparing multiple models

We‚Äôve worked with quite a few models in this lab üôå. Creating workflows for different combinations of preprocessors and/or model specifications, and then calculating performance metrics for each one individually, can quickly become tedious and time-consuming.

Let‚Äôs tackle this by building a function that fits a list of workflows on the training set and then returns the performance metrics based on the test set. To achieve this, we‚Äôll use `map()` and `map_dfr()` from the [purrr](https://purrr.tidyverse.org/) package to apply functions to each element in a list.

> [`map()`](https://purrr.tidyverse.org/reference/map.html) functions let you replace many for loops with code that is more concise and easier to read. The best resource for learning about [`map()`](https://purrr.tidyverse.org/reference/map.html) functions is the [iteration chapter](http://r4ds.had.co.nz/iteration.html) in R for data science.


In [None]:
set.seed(2056)

# Create a metric set
eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)

# Define a function that returns performance metrics
compare_models <- function(workflow_list, train_set, test_set){
  
  suppressWarnings(
    # Fit each model to the train_set
    map(workflow_list, fit, data = train_set) %>% 
    # Make predictions on the test set
      map_dfr(augment, new_data = test_set, .id = "model") %>%
    # Select desired columns
      select(model, cuisine, .pred_class) %>% 
    # Evaluate model performance
      group_by(model) %>% 
      eval_metrics(truth = cuisine, estimate = .pred_class) %>% 
      ungroup()
  )
  
} # End of function

In [None]:
# Make a list of workflows
workflow_list <- list(
  "svc" = svc_linear_wf,
  "svm" = svm_rbf_wf,
  "knn" = knn_wf,
  "random_forest" = rf_wf,
  "xgboost" = boost_wf)

# Call the function
set.seed(2056)
perf_metrics <- compare_models(workflow_list = workflow_list, train_set = cuisines_train, test_set = cuisines_test)

# Print out performance metrics
perf_metrics %>% 
  group_by(.metric) %>% 
  arrange(desc(.estimate)) %>% 
  slice_head(n=7)

# Compare accuracy
perf_metrics %>% 
  filter(.metric == "accuracy") %>% 
  arrange(desc(.estimate))


[**workflowset**](https://workflowsets.tidymodels.org/) package allows users to create and easily fit a large number of models but is primarily designed to work with resampling techniques such as `cross-validation`, an approach we have yet to cover.

## **üöÄChallenge**

Each of these techniques has a variety of parameters that you can adjust, such as `cost` in SVMs, `neighbors` in KNN, and `mtry` (Randomly Selected Predictors) in Random Forest.

Research the default parameters for each and consider what changing these parameters might mean for the quality of the model.

To learn more about a specific model and its parameters, use: `help("model")`, e.g., `help("rand_forest")`.

> In practice, we often *estimate* the *optimal values* for these parameters by training multiple models on a `simulated data set` and evaluating how well each model performs. This process is called **tuning**.

### [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/24/)

### **Review & Self Study**

There‚Äôs a lot of technical terminology in these lessons, so take a moment to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-77952-leestott) of useful terms!

#### THANK YOU TO:

[`Allison Horst`](https://twitter.com/allison_horst/) for creating the wonderful illustrations that make R more approachable and engaging. You can find more of her work in her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).

[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ‚ô•Ô∏è

Happy Learning,

[Eric](https://twitter.com/ericntay), Gold Microsoft Learn Student Ambassador.

<p >
   <img src="../../images/r_learners_sm.jpeg"
   width="569"/>
   <figcaption>Artwork by @allison_horst</figcaption>



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
