Merge pull request #313 from R-icntay/main

R resources for lesson 11: Introduction to classifiers
pull/315/head
Jen Looper 3 years ago committed by GitHub
commit 343f6ec5b1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -103,8 +103,8 @@
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
"\n",
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n",
"\r\n",
"pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
],
"outputs": [],
@ -138,12 +138,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Import data\n",
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\n",
"\n",
"# View the first 5 rows\n",
"df %>% \n",
" slice_head(n = 5)\n"
"# Import data\r\n",
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n",
"\r\n",
"# View the first 5 rows\r\n",
"df %>% \r\n",
" slice_head(n = 5)\r\n"
],
"outputs": [],
"metadata": {
@ -163,12 +163,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Basic information about the data\n",
"df %>%\n",
" introduce()\n",
"\n",
"# Visualize basic information above\n",
"df %>% \n",
"# Basic information about the data\r\n",
"df %>%\r\n",
" introduce()\r\n",
"\r\n",
"# Visualize basic information above\r\n",
"df %>% \r\n",
" plot_intro(ggtheme = theme_light())"
],
"outputs": [],
@ -193,17 +193,17 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Count observations per cuisine\n",
"df %>% \n",
" count(cuisine) %>% \n",
" arrange(n)\n",
"\n",
"# Plot the distribution\n",
"theme_set(theme_light())\n",
"df %>% \n",
" count(cuisine) %>% \n",
" ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\n",
"# Count observations per cuisine\r\n",
"df %>% \r\n",
" count(cuisine) %>% \r\n",
" arrange(n)\r\n",
"\r\n",
"# Plot the distribution\r\n",
"theme_set(theme_light())\r\n",
"df %>% \r\n",
" count(cuisine) %>% \r\n",
" ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"cuisine\")"
],
"outputs": [],
@ -214,15 +214,17 @@
{
"cell_type": "markdown",
"source": [
"There are a finite number of cuisines, but the distribution of data is uneven. You can fix that! Before doing so, explore a little more.\n",
"\n",
"Next, let's assign each cuisine into its individual table and find out how much data is available (rows, columns) per cuisine.\n",
"\n",
"<p >\n",
" <img src=\"../../images/dplyr_filter.jpg\"\n",
" width=\"600\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n"
"There are a finite number of cuisines, but the distribution of data is uneven. You can fix that! Before doing so, explore a little more.\r\n",
"\r\n",
"Next, let's assign each cuisine into its individual tibble and find out how much data is available (rows, columns) per cuisine.\r\n",
"\r\n",
"> A [tibble](https://tibble.tidyverse.org/) is a modern data frame.\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../../images/dplyr_filter.jpg\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n"
],
"metadata": {
"id": "vVvyDb1kG2in"
@ -232,24 +234,24 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Create individual tables for the cuisines\n",
"thai_df <- df %>% \n",
" filter(cuisine == \"thai\")\n",
"japanese_df <- df %>% \n",
" filter(cuisine == \"japanese\")\n",
"chinese_df <- df %>% \n",
" filter(cuisine == \"chinese\")\n",
"indian_df <- df %>% \n",
" filter(cuisine == \"indian\")\n",
"korean_df <- df %>% \n",
" filter(cuisine == \"korean\")\n",
"\n",
"\n",
"# Find out how much data is avilable per cuisine\n",
"cat(\" thai df:\", dim(thai_df), \"\\n\",\n",
" \"japanese df:\", dim(japanese_df), \"\\n\",\n",
" \"chinese_df:\", dim(chinese_df), \"\\n\",\n",
" \"indian_df:\", dim(indian_df), \"\\n\",\n",
"# Create individual tibble for the cuisines\r\n",
"thai_df <- df %>% \r\n",
" filter(cuisine == \"thai\")\r\n",
"japanese_df <- df %>% \r\n",
" filter(cuisine == \"japanese\")\r\n",
"chinese_df <- df %>% \r\n",
" filter(cuisine == \"chinese\")\r\n",
"indian_df <- df %>% \r\n",
" filter(cuisine == \"indian\")\r\n",
"korean_df <- df %>% \r\n",
" filter(cuisine == \"korean\")\r\n",
"\r\n",
"\r\n",
"# Find out how much data is avilable per cuisine\r\n",
"cat(\" thai df:\", dim(thai_df), \"\\n\",\r\n",
" \"japanese df:\", dim(japanese_df), \"\\n\",\r\n",
" \"chinese_df:\", dim(chinese_df), \"\\n\",\r\n",
" \"indian_df:\", dim(indian_df), \"\\n\",\r\n",
" \"korean_df:\", dim(korean_df))"
],
"outputs": [],
@ -303,24 +305,24 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Creates a functions that returns the top ingredients by class\n",
"\n",
"create_ingredient <- function(df){\n",
" \n",
" # Drop the id column which is the first colum\n",
" ingredient_df = df %>% select(-1) %>% \n",
" # Transpose data to a long format\n",
" pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \n",
" # Find the top most ingredients for a particular cuisine\n",
" group_by(ingredients) %>% \n",
" summarise(n_instances = sum(count)) %>% \n",
" filter(n_instances != 0) %>% \n",
" # Arrange by descending order\n",
" arrange(desc(n_instances)) %>% \n",
" mutate(ingredients = factor(ingredients) %>% fct_inorder())\n",
" \n",
" \n",
" return(ingredient_df)\n",
"# Creates a functions that returns the top ingredients by class\r\n",
"\r\n",
"create_ingredient <- function(df){\r\n",
" \r\n",
" # Drop the id column which is the first colum\r\n",
" ingredient_df = df %>% select(-1) %>% \r\n",
" # Transpose data to a long format\r\n",
" pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \r\n",
" # Find the top most ingredients for a particular cuisine\r\n",
" group_by(ingredients) %>% \r\n",
" summarise(n_instances = sum(count)) %>% \r\n",
" filter(n_instances != 0) %>% \r\n",
" # Arrange by descending order\r\n",
" arrange(desc(n_instances)) %>% \r\n",
" mutate(ingredients = factor(ingredients) %>% fct_inorder())\r\n",
" \r\n",
" \r\n",
" return(ingredient_df)\r\n",
"} # End of function"
],
"outputs": [],
@ -341,10 +343,10 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Call create_ingredient and display popular ingredients\n",
"thai_ingredient_df <- create_ingredient(df = thai_df)\n",
"\n",
"thai_ingredient_df %>% \n",
"# Call create_ingredient and display popular ingredients\r\n",
"thai_ingredient_df <- create_ingredient(df = thai_df)\r\n",
"\r\n",
"thai_ingredient_df %>% \r\n",
" slice_head(n = 10)"
],
"outputs": [],
@ -365,11 +367,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Make a bar chart for popular thai cuisines\n",
"thai_ingredient_df %>% \n",
" slice_head(n = 10) %>% \n",
" ggplot(aes(x = n_instances, y = ingredients)) +\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\n",
"# Make a bar chart for popular thai cuisines\r\n",
"thai_ingredient_df %>% \r\n",
" slice_head(n = 10) %>% \r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
@ -390,12 +392,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Japanese cuisines and make bar chart\n",
"create_ingredient(df = japanese_df) %>% \n",
" slice_head(n = 10) %>%\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\n",
" xlab(\"\") + ylab(\"\")\n"
"# Get popular ingredients for Japanese cuisines and make bar chart\r\n",
"create_ingredient(df = japanese_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")\r\n"
],
"outputs": [],
"metadata": {
@ -415,11 +417,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Chinese cuisines and make bar chart\n",
"create_ingredient(df = chinese_df) %>% \n",
" slice_head(n = 10) %>%\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\n",
"# Get popular ingredients for Chinese cuisines and make bar chart\r\n",
"create_ingredient(df = chinese_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
@ -440,11 +442,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Indian cuisines and make bar chart\n",
"create_ingredient(df = indian_df) %>% \n",
" slice_head(n = 10) %>%\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\n",
"# Get popular ingredients for Indian cuisines and make bar chart\r\n",
"create_ingredient(df = indian_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
@ -465,11 +467,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Korean cuisines and make bar chart\n",
"create_ingredient(df = korean_df) %>% \n",
" slice_head(n = 10) %>%\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\n",
"# Get popular ingredients for Korean cuisines and make bar chart\r\n",
"create_ingredient(df = korean_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
@ -492,12 +494,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Drop id column, rice, garlic and ginger from our original data set\n",
"df_select <- df %>% \n",
" select(-c(1, rice, garlic, ginger))\n",
"\n",
"# Display new data set\n",
"df_select %>% \n",
"# Drop id column, rice, garlic and ginger from our original data set\r\n",
"df_select <- df %>% \r\n",
" select(-c(1, rice, garlic, ginger))\r\n",
"\r\n",
"# Display new data set\r\n",
"df_select %>% \r\n",
" slice_head(n = 5)"
],
"outputs": [],
@ -508,16 +510,16 @@
{
"cell_type": "markdown",
"source": [
"## Preprocessing data using recipes 👩‍🍳👨‍🍳 - Dealing with imbalanced data ⚖️\n",
"\n",
"<p >\n",
" <img src=\"../../images/recipes.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"Given that this lesson is about cuisines, we have to put `recipes` into context .\n",
"\n",
"Tidymodels provides yet another neat package: `recipes`- a package for preprocessing data.\n"
"## Preprocessing data using recipes 👩‍🍳👨‍🍳 - Dealing with imbalanced data ⚖️\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../../images/recipes.png\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n",
"\r\n",
"Given that this lesson is about cuisines, we have to put `recipes` into context .\r\n",
"\r\n",
"Tidymodels provides yet another neat package: `recipes`- a package for preprocessing data.\r\n"
],
"metadata": {
"id": "kkFd-JxdIaL6"
@ -536,11 +538,11 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Distribution of cuisines\n",
"old_label_count <- df_select %>% \n",
" count(cuisine) %>% \n",
" arrange(desc(n))\n",
"\n",
"# Distribution of cuisines\r\n",
"old_label_count <- df_select %>% \r\n",
" count(cuisine) %>% \r\n",
" arrange(desc(n))\r\n",
"\r\n",
"old_label_count"
],
"outputs": [],
@ -570,13 +572,13 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Load themis package for dealing with imbalanced data\n",
"library(themis)\n",
"\n",
"# Create a recipe for preprocessing data\n",
"cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \n",
" step_smote(cuisine)\n",
"\n",
"# Load themis package for dealing with imbalanced data\r\n",
"library(themis)\r\n",
"\r\n",
"# Create a recipe for preprocessing data\r\n",
"cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \r\n",
" step_smote(cuisine)\r\n",
"\r\n",
"cuisines_recipe"
],
"outputs": [],
@ -607,18 +609,18 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Prep and bake the recipe\n",
"preprocessed_df <- cuisines_recipe %>% \n",
" prep() %>% \n",
" bake(new_data = NULL) %>% \n",
" relocate(cuisine)\n",
"\n",
"# Display data\n",
"preprocessed_df %>% \n",
" slice_head(n = 5)\n",
"\n",
"# Quick summary stats\n",
"preprocessed_df %>% \n",
"# Prep and bake the recipe\r\n",
"preprocessed_df <- cuisines_recipe %>% \r\n",
" prep() %>% \r\n",
" bake(new_data = NULL) %>% \r\n",
" relocate(cuisine)\r\n",
"\r\n",
"# Display data\r\n",
"preprocessed_df %>% \r\n",
" slice_head(n = 5)\r\n",
"\r\n",
"# Quick summary stats\r\n",
"preprocessed_df %>% \r\n",
" introduce()"
],
"outputs": [],
@ -639,12 +641,12 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Distribution of cuisines\n",
"new_label_count <- preprocessed_df %>% \n",
" count(cuisine) %>% \n",
" arrange(desc(n))\n",
"\n",
"list(new_label_count = new_label_count,\n",
"# Distribution of cuisines\r\n",
"new_label_count <- preprocessed_df %>% \r\n",
" count(cuisine) %>% \r\n",
" arrange(desc(n))\r\n",
"\r\n",
"list(new_label_count = new_label_count,\r\n",
" old_label_count = old_label_count)"
],
"outputs": [],
@ -673,7 +675,7 @@
"cell_type": "code",
"execution_count": null,
"source": [
"# Save preprocessed data\n",
"# Save preprocessed data\r\n",
"write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
],
"outputs": [],
@ -684,32 +686,32 @@
{
"cell_type": "markdown",
"source": [
"This fresh CSV can now be found in the root data folder.\n",
"\n",
"**🚀Challenge**\n",
"\n",
"This curriculum contains several interesting datasets. Dig through the `data` folders and see if any contain datasets that would be appropriate for binary or multi-class classification? What questions would you ask of this dataset?\n",
"\n",
"## [**Post-lecture quiz**](https://white-water-09ec41f0f.azurestaticapps.net/quiz/20/)\n",
"\n",
"## **Review & Self Study**\n",
"\n",
"- Check out [package themis](https://github.com/tidymodels/themis). What other techniques could we use to deal with imbalanced data?\n",
"\n",
"- Tidy models [reference website](https://www.tidymodels.org/start/).\n",
"\n",
"- H. Wickham and G. Grolemund, [*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/).\n",
"\n",
"#### THANK YOU TO:\n",
"\n",
"[`Allison Horst`](https://twitter.com/allison_horst/) for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\n",
"\n",
"[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ♥️\n",
"\n",
"<p >\n",
" <img src=\"../../images/r_learners_sm.jpeg\"\n",
" width=\"600\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n"
"This fresh CSV can now be found in the root data folder.\r\n",
"\r\n",
"**🚀Challenge**\r\n",
"\r\n",
"This curriculum contains several interesting datasets. Dig through the `data` folders and see if any contain datasets that would be appropriate for binary or multi-class classification? What questions would you ask of this dataset?\r\n",
"\r\n",
"## [**Post-lecture quiz**](https://white-water-09ec41f0f.azurestaticapps.net/quiz/20/)\r\n",
"\r\n",
"## **Review & Self Study**\r\n",
"\r\n",
"- Check out [package themis](https://github.com/tidymodels/themis). What other techniques could we use to deal with imbalanced data?\r\n",
"\r\n",
"- Tidy models [reference website](https://www.tidymodels.org/start/).\r\n",
"\r\n",
"- H. Wickham and G. Grolemund, [*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/).\r\n",
"\r\n",
"#### THANK YOU TO:\r\n",
"\r\n",
"[`Allison Horst`](https://twitter.com/allison_horst/) for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\r\n",
"\r\n",
"[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ♥️\r\n",
"\r\n",
"<p >\r\n",
" <img src=\"../../images/r_learners_sm.jpeg\"\r\n",
" width=\"600\"/>\r\n",
" <figcaption>Artwork by @allison_horst</figcaption>\r\n"
],
"metadata": {
"id": "WQs5621pMGwf"

@ -34,7 +34,7 @@ Classification is one of the fundamental activities of the machine learning rese
To state the process in a more scientific way, your classification method creates a predictive model that enables you to map the relationship between input variables to output variables.
![Binary vs. multiclass problems for classification algorithms to handle. Infographic by Jen Looper](../images/binary-multiclass.png)
![Binary vs. multiclass problems for classification algorithms to handle. Infographic by Jen Looper](../../images/binary-multiclass.png){width="500"}
Before starting the process of cleaning our data, visualizing it, and prepping it for our ML tasks, let's learn a bit about the various ways machine learning can be leveraged to classify data.
@ -127,7 +127,9 @@ There are a finite number of cuisines, but the distribution of data is uneven. Y
2. Next, let's assign each cuisine into it's individual tibble and find out how much data is available (rows, columns) per cuisine.
![Artwork by \@allison_horst](../images/dplyr_filter.jpg)
> A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not.
![Artwork by \@allison_horst](../../images/dplyr_filter.jpg)
```{r cuisine_df}
# Create individual tibbles for the cuisines
@ -297,7 +299,7 @@ df_select %>%
## Preprocessing data using recipes 👩‍🍳👨‍🍳 - Dealing with imbalanced data ⚖️
![Artwork by \@allison_horst](../images/recipes.png)
![Artwork by \@allison_horst](../../images/recipes.png)
Given that this lesson is about cuisines, we have to put `recipes` into context .

Binary file not shown.

After

Width:  |  Height:  |  Size: 580 KiB

File diff suppressed because one or more lines are too long

@ -0,0 +1,349 @@
---
title: 'Build a classification model: Delicious Asian and Indian Cuisines'
output:
html_document:
df_print: paged
theme: flatly
highlight: breezedark
toc: yes
toc_float: yes
code_download: yes
---
## Cuisine classifiers 1
In this lesson, we'll explore a variety of classifiers to *predict a given national cuisine based on a group of ingredients.* While doing so, we'll learn more about some of the ways that algorithms can be leveraged for classification tasks.
### [**Pre-lecture quiz**](https://white-water-09ec41f0f.azurestaticapps.net/quiz/21/)
### **Preparation**
This lesson builds up on our [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/1-Introduction/solution/lesson_10-R.ipynb) where we:
- Made a gentle introduction to classifications using a dataset about all the brilliant cuisines of Asia and India 😋.
- Explored some [dplyr verbs](https://dplyr.tidyverse.org/) to prep and clean our data.
- Made beautiful visualizations using ggplot2.
- Demonstrated how to deal with imbalanced data by preprocessing it using [recipes](https://recipes.tidymodels.org/articles/Simple_Example.html).
- Demonstrated how to `prep` and `bake` our recipe to confirm that it will work as supposed to.
#### **Prerequisite**
For this lesson, we'll require the following packages to clean, prep and visualize our data:
- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!
- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.
- `DataExplorer`: The [DataExplorer package](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html) is meant to simplify and automate EDA process and report generation.
- `themis`: The [themis package](https://themis.tidymodels.org/) provides Extra Recipes Steps for Dealing with Unbalanced Data.
- `nnet`: The [nnet package](https://cran.r-project.org/web/packages/nnet/nnet.pdf) provides functions for estimating feed-forward neural networks with a single hidden layer, and for multinomial logistic regression models.
You can have them installed as:
`install.packages(c("tidyverse", "tidymodels", "DataExplorer", "here"))`
Alternatively, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing.
```{r, message=F, warning=F}
suppressWarnings(if (!require("pacman"))install.packages("pacman"))
pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)
```
Now, let's hit the ground running!
## 1. Split the data into training and test sets.
We'll start by picking a few steps from our previous lesson.
### Drop the most common ingredients that create confusion between distinct cuisines, using `dplyr::select()`.
Everyone loves rice, garlic and ginger!
```{r recap_drop}
# Load the original cuisines data
df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")
# Drop id column, rice, garlic and ginger from our original data set
df_select <- df %>%
select(-c(1, rice, garlic, ginger)) %>%
# Encode cuisine column as categorical
mutate(cuisine = factor(cuisine))
# Display new data set
df_select %>%
slice_head(n = 5)
# Display distribution of cuisines
df_select %>%
count(cuisine) %>%
arrange(desc(n))
```
Perfect! Now, time to split the data such that 70% of the data goes to training and 30% goes to testing. We'll also apply a `stratification` technique when splitting the data to `maintain the proportion of each cuisine` in the training and validation datasets.
[rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:
```{r data_split}
# Load the core Tidymodels packages into R session
library(tidymodels)
# Create split specification
set.seed(2056)
cuisines_split <- initial_split(data = df_select,
strata = cuisine,
prop = 0.7)
# Extract the data in each split
cuisines_train <- training(cuisines_split)
cuisines_test <- testing(cuisines_split)
# Print the number of cases in each split
cat("Training cases: ", nrow(cuisines_train), "\n",
"Test cases: ", nrow(cuisines_test), sep = "")
# Display the first few rows of the training set
cuisines_train %>%
slice_head(n = 5)
# Display distribution of cuisines in the training set
cuisines_train %>%
count(cuisine) %>%
arrange(desc(n))
```
## 2. Deal with imbalanced data
As you might have noticed in the original data set as well as in our training set, there is quite an unequal distribution in the number of cuisines. Korean cuisines are *almost* 3 times Thai cuisines. Imbalanced data often has negative effects on the model performance. Many models perform best when the number of observations is equal and, thus, tend to struggle with unbalanced data.
There are majorly two ways of dealing with imbalanced data sets:
- adding observations to the minority class: `Over-sampling` e.g using a SMOTE algorithm which synthetically generates new examples of the minority class using nearest neighbors of these cases.
- removing observations from majority class: `Under-sampling`
In our previous lesson, we demonstrated how to deal with imbalanced data sets using a `recipe`. A recipe can be thought of as a blueprint that describes what steps should be applied to a data set in order to get it ready for data analysis. In our case, we want to have an equal distribution in the number of our cuisines for our `training set`. Let's get right into it.
```{r recap_balance}
# Load themis package for dealing with imbalanced data
library(themis)
# Create a recipe for preprocessing training data
cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%
step_smote(cuisine)
# Print recipe
cuisines_recipe
```
You can of course go ahead and confirm (using prep+bake) that the recipe will work as you expect it - all the cuisine labels having `559` observations.
Since we'll be using this recipe as a preprocessor for modeling, a `workflow()` will do all the prep and bake for us, so we won't have to manually estimate the recipe.
Now we are ready to train a model 👩‍💻👨‍💻!
## 3. Choosing your classifier
![Artwork by \@allison_horst](../../images/parsnip.jpg){width="600"}
Now we have to decide which algorithm to use for the job 🤔.
In Tidymodels, the [`parsnip package`](https://parsnip.tidymodels.org/index.html) provides consistent interface for working with models across different engines (packages). Please see the parsnip documentation to explore [model types & engines](https://www.tidymodels.org/find/parsnip/#models) and their corresponding [model arguements](https://www.tidymodels.org/find/parsnip/#model-args). The variety is quite bewildering at first sight. For instance, the following methods all include classification techniques:
- C5.0 Rule-Based Classification Models
- Flexible Discriminant Models
- Linear Discriminant Models
- Regularized Discriminant Models
- Logistic Regression Models
- Multinomial Regression Models
- Naive Bayes Models
- Support Vector Machines
- Nearest Neighbors
- Decision Trees
- Ensemble methods
- Neural Networks
The list goes on!
### **What classifier to go with?**
So, which classifier should you choose? Often, running through several and looking for a good result is a way to test.
> AutoML solves this problem neatly by running these comparisons in the cloud, allowing you to choose the best algorithm for your data. Try it [here](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-15963-cxa)
Also the choice of classifier depends on our problem. For instance, when the outcome can be categorized into `more than two classes`, like in our case, you must use a `multiclass classification algorithm` as opposed to `binary classification.`
### **A better approach**
A better way than wildly guessing, however, is to follow the ideas on this downloadable [ML Cheat sheet](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-15963-cxa). Here, we discover that, for our multiclass problem, we have some choices:
![A section of Microsoft's Algorithm Cheat Sheet, detailing multiclass classification options](../../images/cheatsheet.png){width="500"}
### **Reasoning**
Let's see if we can reason our way through different approaches given the constraints we have:
- **Deep Neural networks are too heavy**. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, deep neural networks are too heavyweight for this task.
- **No two-class classifier**. We do not use a two-class classifier, so that rules out one-vs-all.
- **Decision tree or logistic regression could work**. A decision tree might work, or multinomial regression/multiclass logistic regression for multiclass data.
- **Multiclass Boosted Decision Trees solve a different problem**. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us.
Also, normally before embarking on more complex machine learning models e.g ensemble methods, it's a good idea to build the simplest possible model to get an idea of what is going on. So for this lesson, we'll start with a `multinomial logistic regression` model.
> Logistic regression is a technique used when the outcome variable is categorical (or nominal). For Binary logistic regression the number of outcome variables is two, whereas the number of outcome variables for multinomial logistic regression is more than two. See [Advanced Regression Methods](https://bookdown.org/chua/ber642_advanced_regression/multinomial-logistic-regression.html) for further reading.
## 4. Train and evaluate a Multinomial logistic regression model.
In Tidymodels, `parsnip::multinom_reg()`, defines a model that uses linear predictors to predict multiclass data using the multinomial distribution. See `?multinom_reg()` for the different ways/engines you can use to fit this model.
For this example, we'll fit a Multinomial regression model via the default [nnet](https://cran.r-project.org/web/packages/nnet/nnet.pdf) engine.
> I picked a value for `penalty` sort of randomly. There are better ways to choose this value that is, by using `resampling` and `tuning` the model which we'll discuss later.
>
> See [Tidymodels: Get Started](https://www.tidymodels.org/start/tuning/) in case you want to learn more on how to tune model hyperparameters.
```{r multinorm_reg}
# Create a multinomial regression model specification
mr_spec <- multinom_reg(penalty = 1) %>%
set_engine("nnet", MaxNWts = 2086) %>%
set_mode("classification")
# Print model specification
mr_spec
```
Great job 🥳! Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data then fit the model on the preprocessed data and also allow for potential post-processing activities. In Tidymodels, this convenient object is called a [`workflow`](https://workflows.tidymodels.org/) and conveniently holds your modeling components! This is what we'd call *pipelines* in *Python*.
So let's bundle everything up into a workflow!📦
```{r workflow}
# Bundle recipe and model specification
mr_wf <- workflow() %>%
add_recipe(cuisines_recipe) %>%
add_model(mr_spec)
# Print out workflow
mr_wf
```
Workflows 👌👌! A **`workflow()`** can be fit in much the same way a model can. So, time to train a model!
```{r train}
# Train a multinomial regression model
mr_fit <- fit(object = mr_wf, data = cuisines_train)
mr_fit
```
The output shows the coefficients that the model learned during training.
### Evaluate the Trained Model
It's time to see how the model performed 📏 by evaluating it on a test set! Let's begin by making predictions on the test set.
```{r test}
# Make predictions on the test set
results <- cuisines_test %>% select(cuisine) %>%
bind_cols(mr_fit %>% predict(new_data = cuisines_test))
# Print out results
results %>%
slice_head(n = 5)
```
Great job! In Tidymodels, evaluating model performance can be done using [yardstick](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics. As we did in our logistic regression lesson, let's begin by computing a confusion matrix.
```{r conf_mat}
# Confusion matrix for categorical data
conf_mat(data = results, truth = cuisine, estimate = .pred_class)
```
When dealing with multiple classes, it's generally more intuitive to visualize this as a heat map, like this:
```{r conf_viz}
update_geom_defaults(geom = "tile", new = list(color = "black", alpha = 0.7))
# Visualize confusion matrix
results %>%
conf_mat(cuisine, .pred_class) %>%
autoplot(type = "heatmap")
```
The darker squares in the confusion matrix plot indicate high numbers of cases, and you can hopefully see a diagonal line of darker squares indicating cases where the predicted and actual label are the same.
Let's now calculate summary statistics for the confusion matrix.
```{r conf_stats}
# Summary stats for confusion matrix
conf_mat(data = results, truth = cuisine, estimate = .pred_class) %>% summary()
```
If we narrow down to some metrics such as accuracy, sensitivity, ppv, we are not badly off for a start 🥳!
## 4. Digging Deeper
Let's ask one subtle question: What criteria is used to settle for a given type of cuisine as the predicted outcome?
Well, Statistical machine learning algorithms, like logistic regression, are based on `probability`; so what actually gets predicted by a classifier is a probability distribution over a set of possible outcomes. The class with the highest probability is then chosen as the most likely outcome for the given observations.
Let's see this in action by making both hard class predictions and probabilities.
```{r pred_prob}
# Make hard class prediction and probabilities
results_prob <- cuisines_test %>%
select(cuisine) %>%
bind_cols(mr_fit %>% predict(new_data = cuisines_test)) %>%
bind_cols(mr_fit %>% predict(new_data = cuisines_test, type = "prob"))
# Print out results
results_prob %>%
slice_head(n = 5)
```
Much better!
✅ Can you explain why the model is pretty sure that the first observation is Thai?
## **🚀Challenge**
In this lesson, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the [many options](https://www.tidymodels.org/find/parsnip/#models) Tidymodels provides to classify data and [other ways](https://parsnip.tidymodels.org/articles/articles/Examples.html#multinom_reg-models) to fit multinomial regression.
#### THANK YOU TO:
[`Allison Horst`](https://twitter.com/allison_horst/) for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).
[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ♥️
Happy Learning,
[Eric](https://twitter.com/ericntay), Gold Microsoft Learn Student Ambassador.
Loading…
Cancel
Save