# Build a classification model: Delicious Asian and Indian Cuisines


## Introduction to classification: Clean, prep, and visualize your data

In these four lessons, you'll dive into one of the core areas of classic machine learning: *classification*. We'll explore various classification algorithms using a dataset about the diverse and delicious cuisines of Asia and India. Get ready to whet your appetite!

<p>
   <img src="../../images/pinch.png"
   width="600"/>
   <figcaption>Celebrate pan-Asian cuisines in these lessons! Image by Jen Looper</figcaption>

Classification is a type of [supervised learning](https://wikipedia.org/wiki/Supervised_learning) that shares many similarities with regression techniques. In classification, you train a model to predict which `category` an item belongs to. If machine learning is about predicting values or assigning names to things using datasets, classification typically falls into two categories: *binary classification* and *multiclass classification*.

Keep in mind:

-   **Linear regression** helped you predict relationships between variables and make accurate predictions about where a new data point would fall in relation to that line. For example, you could predict numeric values like *the price of a pumpkin in September versus December*.

-   **Logistic regression** helped you identify "binary categories": at a certain price point, *is this pumpkin orange or not-orange*?

Classification uses various algorithms to determine other ways of assigning a label or class to a data point. In this lesson, we'll use cuisine data to see if we can predict the cuisine of origin based on a set of ingredients.

### [**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/19/)

### **Introduction**

Classification is a fundamental task for machine learning researchers and data scientists. From simple binary classification ("is this email spam or not?") to complex image classification and segmentation using computer vision, the ability to sort data into classes and analyze it is invaluable.

To put it in more scientific terms, classification involves creating a predictive model that maps the relationship between input variables and output variables.

<p>
   <img src="../../images/binary-multiclass.png"
   width="600"/>
   <figcaption>Binary vs. multiclass problems for classification algorithms to handle. Infographic by Jen Looper</figcaption>

Before we dive into cleaning, visualizing, and preparing our data for machine learning tasks, let's explore the different ways machine learning can be used to classify data.

Derived from [statistics](https://wikipedia.org/wiki/Statistical_classification), classification in classic machine learning uses features like `smoker`, `weight`, and `age` to predict *the likelihood of developing a certain disease*. As a supervised learning technique similar to the regression exercises you've done before, classification uses labeled data and machine learning algorithms to predict and assign classes (or 'features') of a dataset to specific groups or outcomes.

‚úÖ Take a moment to imagine a dataset about cuisines. What kinds of questions could a multiclass model answer? What about a binary model? For instance, could you predict whether a given cuisine is likely to use fenugreek? Or, if you were handed a grocery bag containing star anise, artichokes, cauliflower, and horseradish, could you determine whether you could create a typical Indian dish?

### **Hello 'classifier'**

The question we want to answer with this cuisine dataset is a **multiclass question**, as we have several possible national cuisines to consider. Based on a set of ingredients, which of these many classes does the data belong to?

Tidymodels provides several algorithms for classifying data, depending on the type of problem you're trying to solve. In the next two lessons, you'll learn about some of these algorithms.

#### **Prerequisite**

For this lesson, we'll need the following packages to clean, prepare, and visualize our data:

-   `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to make data science faster, easier, and more enjoyable.

-   `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.

-   `DataExplorer`: The [DataExplorer package](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html) simplifies and automates the exploratory data analysis (EDA) process and report generation.

-   `themis`: The [themis package](https://themis.tidymodels.org/) provides additional recipe steps for handling unbalanced data.

You can install them using:

`install.packages(c("tidyverse", "tidymodels", "DataExplorer", "here"))`

Alternatively, the script below checks whether the required packages for this module are installed and installs them for you if they're missing.


In [None]:
suppressWarnings(if (!require("pacman"))install.packages("pacman"))

pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)

We'll later load these awesome packages and make them available in our current R session. (This is for mere illustration, `pacman::p_load()` already did that for you)


## Exercise - clean and balance your data

The first task before starting this project is to clean and **balance** your data to achieve better results.

Let's take a look at the data! üïµÔ∏è


In [None]:
# Import data
df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")

# View the first 5 rows
df %>% 
  slice_head(n = 5)


Interesting! From the looks of it, the first column is a kind of `id` column. Let's get a little more information about the data.


In [None]:
# Basic information about the data
df %>%
  introduce()

# Visualize basic information above
df %>% 
  plot_intro(ggtheme = theme_light())

From the output, we can immediately see that we have `2448` rows and `385` columns and `0` missing values. We also have 1 discrete column, *cuisine*.

## Exercise - exploring cuisines

Now the task gets more engaging. Let's analyze the data distribution by cuisine.


In [None]:
# Count observations per cuisine
df %>% 
  count(cuisine) %>% 
  arrange(n)

# Plot the distribution
theme_set(theme_light())
df %>% 
  count(cuisine) %>% 
  ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +
  geom_col(fill = "midnightblue", alpha = 0.7) +
  ylab("cuisine")

There are a limited number of cuisines, but the data distribution is imbalanced. You can address this! Before proceeding, take some time to explore further.

Next, let's separate each cuisine into its own tibble and determine the amount of data available (rows, columns) for each cuisine.

> A [tibble](https://tibble.tidyverse.org/) is a modern version of a data frame.

<p >
   <img src="../../images/dplyr_filter.jpg"
   width="600"/>
   <figcaption>Illustration by @allison_horst</figcaption>


In [None]:
# Create individual tibble for the cuisines
thai_df <- df %>% 
  filter(cuisine == "thai")
japanese_df <- df %>% 
  filter(cuisine == "japanese")
chinese_df <- df %>% 
  filter(cuisine == "chinese")
indian_df <- df %>% 
  filter(cuisine == "indian")
korean_df <- df %>% 
  filter(cuisine == "korean")


# Find out how much data is available per cuisine
cat(" thai df:", dim(thai_df), "\n",
    "japanese df:", dim(japanese_df), "\n",
    "chinese_df:", dim(chinese_df), "\n",
    "indian_df:", dim(indian_df), "\n",
    "korean_df:", dim(korean_df))

## **Exercise - Discovering top ingredients by cuisine using dplyr**

Now you can dive deeper into the data and explore the typical ingredients for each cuisine. You'll need to clean up recurring data that causes confusion between cuisines, so let's tackle this issue.

Create a function `create_ingredient()` in R that returns a dataframe of ingredients. This function will begin by removing an unhelpful column and then sort the ingredients based on their count.

The basic structure of a function in R is:

`myFunction <- function(arglist){`

**`...`**

**`return`**`(value)`

`}`

A concise introduction to R functions can be found [here](https://skirmer.github.io/presentations/functions_with_r.html#1).

Let‚Äôs jump right in! We'll use [dplyr verbs](https://dplyr.tidyverse.org/) that we've been learning in previous lessons. As a quick refresher:

-   `dplyr::select()`: helps you choose which **columns** to keep or exclude.

-   `dplyr::pivot_longer()`: allows you to "lengthen" data, increasing the number of rows while reducing the number of columns.

-   `dplyr::group_by()` and `dplyr::summarise()`: enable you to calculate summary statistics for different groups and organize them into a neat table.

-   `dplyr::filter()`: creates a subset of the data containing only rows that meet your conditions.

-   `dplyr::mutate()`: lets you create or modify columns.

Check out this [*art*-filled learnr tutorial](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome) by Allison Horst, which introduces some handy data wrangling functions in dplyr *(part of the Tidyverse)*.


In [None]:
# Creates a functions that returns the top ingredients by class

create_ingredient <- function(df){
  
  # Drop the id column which is the first colum
  ingredient_df = df %>% select(-1) %>% 
  # Transpose data to a long format
    pivot_longer(!cuisine, names_to = "ingredients", values_to = "count") %>% 
  # Find the top most ingredients for a particular cuisine
    group_by(ingredients) %>% 
    summarise(n_instances = sum(count)) %>% 
    filter(n_instances != 0) %>% 
  # Arrange by descending order
    arrange(desc(n_instances)) %>% 
    mutate(ingredients = factor(ingredients) %>% fct_inorder())
  
  
  return(ingredient_df)
} # End of function

Now we can use the function to get an idea of the top ten most popular ingredients by cuisine. Let's test it with `thai_df`.


In [None]:
# Call create_ingredient and display popular ingredients
thai_ingredient_df <- create_ingredient(df = thai_df)

thai_ingredient_df %>% 
  slice_head(n = 10)

In the previous section, we used `geom_col()`, let's see how you can use `geom_bar` too, to create bar charts. Use `?geom_bar` for further reading.


In [None]:
# Make a bar chart for popular thai cuisines
thai_ingredient_df %>% 
  slice_head(n = 10) %>% 
  ggplot(aes(x = n_instances, y = ingredients)) +
  geom_bar(stat = "identity", width = 0.5, fill = "steelblue") +
  xlab("") + ylab("")

Understood! Please provide the text you'd like me to translate.


In [None]:
# Get popular ingredients for Japanese cuisines and make bar chart
create_ingredient(df = japanese_df) %>% 
  slice_head(n = 10) %>%
  ggplot(aes(x = n_instances, y = ingredients)) +
  geom_bar(stat = "identity", width = 0.5, fill = "darkorange", alpha = 0.8) +
  xlab("") + ylab("")


Could you please provide the markdown file you'd like me to translate?


In [None]:
# Get popular ingredients for Chinese cuisines and make bar chart
create_ingredient(df = chinese_df) %>% 
  slice_head(n = 10) %>%
  ggplot(aes(x = n_instances, y = ingredients)) +
  geom_bar(stat = "identity", width = 0.5, fill = "cyan4", alpha = 0.8) +
  xlab("") + ylab("")

Let's take a look at the Indian cuisines üå∂Ô∏è.


In [None]:
# Get popular ingredients for Indian cuisines and make bar chart
create_ingredient(df = indian_df) %>% 
  slice_head(n = 10) %>%
  ggplot(aes(x = n_instances, y = ingredients)) +
  geom_bar(stat = "identity", width = 0.5, fill = "#041E42FF", alpha = 0.8) +
  xlab("") + ylab("")

Finally, plot the Korean ingredients.


In [None]:
# Get popular ingredients for Korean cuisines and make bar chart
create_ingredient(df = korean_df) %>% 
  slice_head(n = 10) %>%
  ggplot(aes(x = n_instances, y = ingredients)) +
  geom_bar(stat = "identity", width = 0.5, fill = "#852419FF", alpha = 0.8) +
  xlab("") + ylab("")

From the data visualizations, we can now exclude the most common ingredients that cause confusion between different cuisines, using `dplyr::select()`.

Everyone loves rice, garlic, and ginger!


In [None]:
# Drop id column, rice, garlic and ginger from our original data set
df_select <- df %>% 
  select(-c(1, rice, garlic, ginger))

# Display new data set
df_select %>% 
  slice_head(n = 5)

## Preprocessing data using recipes üë©‚Äçüç≥üë®‚Äçüç≥ - Dealing with imbalanced data ‚öñÔ∏è

<p >
   <img src="../../images/recipes.png"
   width="600"/>
   <figcaption>Artwork by @allison_horst</figcaption>

Since this lesson is about cuisines, we need to frame `recipes` in the right context.

Tidymodels offers another handy package: `recipes` - a package designed for data preprocessing.


Let's take a look at the distribution of our cuisines again.


In [None]:
# Distribution of cuisines
old_label_count <- df_select %>% 
  count(cuisine) %>% 
  arrange(desc(n))

old_label_count

As you can see, there is quite an unequal distribution in the number of cuisines. Korean cuisines are almost three times more than Thai cuisines. Imbalanced data often negatively impacts model performance. Consider a binary classification scenario: if most of your data belongs to one class, a machine learning model will tend to predict that class more frequently simply because there is more data available for it. Balancing the data addresses any skewed distribution and helps eliminate this imbalance. Many models perform best when the number of observations is equal, and they often struggle with unbalanced data.

There are primarily two approaches to handling imbalanced data sets:

-   Adding observations to the minority class: `Over-sampling`, for example, using a SMOTE algorithm.

-   Removing observations from the majority class: `Under-sampling`.

Now, let's demonstrate how to handle imbalanced data sets using a `recipe`. A recipe can be thought of as a blueprint that outlines the steps to be applied to a data set to prepare it for data analysis.


In [None]:
# Load themis package for dealing with imbalanced data
library(themis)

# Create a recipe for preprocessing data
cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% 
  step_smote(cuisine)

cuisines_recipe

Let's break down our preprocessing steps.

-   The call to `recipe()` with a formula specifies the *roles* of the variables using the `df_select` data as a reference. For example, the `cuisine` column is assigned the `outcome` role, while the other columns are assigned the `predictor` role.

-   [`step_smote(cuisine)`](https://themis.tidymodels.org/reference/step_smote.html) defines a *step* in the recipe that synthetically generates new examples for the minority class using the nearest neighbors of those cases.

Now, if we want to view the preprocessed data, we need to [**`prep()`**](https://recipes.tidymodels.org/reference/prep.html) and [**`bake()`**](https://recipes.tidymodels.org/reference/bake.html) the recipe.

`prep()`: calculates the necessary parameters from a training set, which can then be applied to other datasets.

`bake()`: applies the operations from a prepped recipe to any dataset.


In [None]:
# Prep and bake the recipe
preprocessed_df <- cuisines_recipe %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  relocate(cuisine)

# Display data
preprocessed_df %>% 
  slice_head(n = 5)

# Quick summary stats
preprocessed_df %>% 
  introduce()

Let's now check the distribution of our cuisines and compare them with the imbalanced data.


In [None]:
# Distribution of cuisines
new_label_count <- preprocessed_df %>% 
  count(cuisine) %>% 
  arrange(desc(n))

list(new_label_count = new_label_count,
     old_label_count = old_label_count)

Yum! The data is nice and clean, balanced, and very delicious üòã!

> Typically, a recipe is used as a preprocessor for modeling, where it specifies the steps to be applied to a dataset to prepare it for modeling. In such cases, a `workflow()` is generally used (as we've seen in previous lessons) instead of manually processing a recipe.
>
> Therefore, you usually don't need to **`prep()`** and **`bake()`** recipes when working with tidymodels. However, these functions are useful tools for verifying that recipes are performing as expected, as in our example.
>
> When you **`bake()`** a prepped recipe with **`new_data = NULL`**, you retrieve the original data you provided when defining the recipe, but with the preprocessing steps applied.

Now, let's save a copy of this data for use in future lessons:


In [None]:
# Save preprocessed data
write_csv(preprocessed_df, "../../../data/cleaned_cuisines_R.csv")

This fresh CSV can now be found in the root data folder.

**üöÄChallenge**

This curriculum contains several interesting datasets. Explore the `data` folders and see if any of them include datasets suitable for binary or multi-class classification. What questions could you ask about this dataset?

## [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/20/)

## **Review & Self Study**

-   Check out [package themis](https://github.com/tidymodels/themis). What other techniques can be used to address imbalanced data?

-   Tidy models [reference website](https://www.tidymodels.org/start/).

-   H. Wickham and G. Grolemund, [*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/).

#### THANK YOU TO:

[`Allison Horst`](https://twitter.com/allison_horst/) for creating the wonderful illustrations that make R more approachable and engaging. You can find more of her work in her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).

[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for developing the original Python version of this module ‚ô•Ô∏è

<p >
   <img src="../../images/r_learners_sm.jpeg"
   width="600"/>
   <figcaption>Artwork by @allison_horst</figcaption>



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
