ML-For-Beginners/translations/pcm/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb

{
 "nbformat": 4,
 "nbformat_minor": 2,
 "metadata": {
  "colab": {
   "name": "lesson_10-R.ipynb",
   "provenance": [],
   "collapsed_sections": []
  },
  "kernelspec": {
   "name": "ir",
   "display_name": "R"
  },
  "language_info": {
   "name": "R"
  },
  "coopTranslator": {
   "original_hash": "2621e24705e8100893c9bf84e0fc8aef",
   "translation_date": "2025-11-18T19:24:19+00:00",
   "source_file": "4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb",
   "language_code": "pcm"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Build classification model: Sweet Asian and Indian Foods\n"
   ],
   "metadata": {
    "id": "ItETB4tSFprR"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Introduction to classification: Clean, prep, and visualize your data\n",
    "\n",
    "For dis four lessons, you go learn one of di main tins wey classic machine learning dey focus on - *classification*. We go waka through how to use different classification algorithms with one dataset about all di sweet cuisines wey dey Asia and India. Hope say you don dey hungry!\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../../../../../translated_images/pinch.1b035ec9ba7e0d40.pcm.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>Celebrate pan-Asian cuisines for dis lessons! Image by Jen Looper</figcaption>\n",
    "\n",
    "Classification na one kind [supervised learning](https://wikipedia.org/wiki/Supervised_learning) wey get plenty similarity with regression techniques. For classification, you go train one model to predict which `category` one item belong to. If machine learning na all about predicting values or names to tins by using datasets, then classification dey usually fall into two groups: *binary classification* and *multiclass classification*.\n",
    "\n",
    "Make you remember:\n",
    "\n",
    "-   **Linear regression** help you predict di relationship between variables and make correct predictions on where new datapoint go fall for di line. For example, you fit predict numeric values like *how much pumpkin go cost for September vs. December*.\n",
    "\n",
    "-   **Logistic regression** help you find \"binary categories\": for dis price point, *di pumpkin orange or e no orange*?\n",
    "\n",
    "Classification dey use different algorithms to find other ways to determine di label or class of one data point. Make we use dis cuisine data to see whether, by looking di group of ingredients, we fit know di cuisine origin.\n",
    "\n",
    "### [**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/19/)\n",
    "\n",
    "### **Introduction**\n",
    "\n",
    "Classification na one of di main work wey machine learning researchers and data scientists dey do. From basic classification of binary value (\"dis email na spam or e no be spam?\"), to complex image classification and segmentation using computer vision, e dey always useful to fit sort data into classes and ask questions about am.\n",
    "\n",
    "To talk di process in one more scientific way, your classification method go create one predictive model wey go help you map di relationship between input variables and output variables.\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../../../../../translated_images/binary-multiclass.b56d0c86c81105a6.pcm.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>Binary vs. multiclass problems wey classification algorithms dey handle. Infographic by Jen Looper</figcaption>\n",
    "\n",
    "Before we start to clean our data, visualize am, and prepare am for our ML tasks, make we first learn small about di different ways wey machine learning fit take classify data.\n",
    "\n",
    "From [statistics](https://wikipedia.org/wiki/Statistical_classification), classification using classic machine learning dey use features, like `smoker`, `weight`, and `age` to determine *di chance to get X disease*. As one supervised learning technique wey be like di regression exercises wey you don do before, your data go get labels and di ML algorithms go use di labels to classify and predict classes (or 'features') of one dataset and assign dem to one group or outcome.\n",
    "\n",
    "✅ Take small time to imagine one dataset about cuisines. Wetin one multiclass model fit answer? Wetin one binary model fit answer? Wetin if you wan know whether one particular cuisine dey likely to use fenugreek? Wetin if you wan see if, given one bag of star anise, artichokes, cauliflower, and horseradish, you fit create one typical Indian dish?\n",
    "\n",
    "### **Hello 'classifier'**\n",
    "\n",
    "Di question wey we wan ask from dis cuisine dataset na actually one **multiclass question**, because we get plenty national cuisines to work with. If dem give us one batch of ingredients, which of di many classes di data go fit enter?\n",
    "\n",
    "Tidymodels get different algorithms wey you fit use to classify data, depending on di kind problem wey you wan solve. For di next two lessons, you go learn about some of dis algorithms.\n",
    "\n",
    "#### **Prerequisite**\n",
    "\n",
    "For dis lesson, we go need di following packages to clean, prepare and visualize our data:\n",
    "\n",
    "-   `tidyverse`: Di [tidyverse](https://www.tidyverse.org/) na [collection of R packages](https://www.tidyverse.org/packages) wey dey make data science faster, easier and more fun!\n",
    "\n",
    "-   `tidymodels`: Di [tidymodels](https://www.tidymodels.org/) framework na [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n",
    "\n",
    "-   `DataExplorer`: Di [DataExplorer package](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html) dey simplify and automate EDA process and report generation.\n",
    "\n",
    "-   `themis`: Di [themis package](https://themis.tidymodels.org/) dey provide Extra Recipes Steps to handle Unbalanced Data.\n",
    "\n",
    "You fit install dem like dis:\n",
    "\n",
    "`install.packages(c(\"tidyverse\", \"tidymodels\", \"DataExplorer\", \"here\"))`\n",
    "\n",
    "Or, di script below go check whether you get di packages wey you need to complete dis module and go install dem for you if dem no dey.\n"
   ],
   "metadata": {
    "id": "ri5bQxZ-Fz_0"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n",
    "\r\n",
    "pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
   ],
   "outputs": [],
   "metadata": {
    "id": "KIPxa4elGAPI"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We go later load dis beta packages and make dem dey available for our current R session. (Dis na just for example, `pacman::p_load()` don already do am for you)\n"
   ],
   "metadata": {
    "id": "YkKAxOJvGD4C"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Exercise - clean and balance your data\n",
    "\n",
    "Di first work wey you go do before you start dis project na to clean and **balance** your data so you go fit get beta result.\n",
    "\n",
    "Make we meet di data!🕵️\n"
   ],
   "metadata": {
    "id": "PFkQDlk0GN5O"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Import data\r\n",
    "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n",
    "\r\n",
    "# View the first 5 rows\r\n",
    "df %>% \r\n",
    "  slice_head(n = 5)\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "Qccw7okxGT0S"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Interestin! From wetin e be like, di first column na one kain `id` column. Make we get small more information about di data.\n"
   ],
   "metadata": {
    "id": "XrWnlgSrGVmR"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Basic information about the data\r\n",
    "df %>%\r\n",
    "  introduce()\r\n",
    "\r\n",
    "# Visualize basic information above\r\n",
    "df %>% \r\n",
    "  plot_intro(ggtheme = theme_light())"
   ],
   "outputs": [],
   "metadata": {
    "id": "4UcGmxRxGieA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "From di output, we fit see say we get `2448` rows and `385` columns and `0` missing values. We also get 1 discrete column, *cuisine*.\n",
    "\n",
    "## Exercise - learn about cuisines\n",
    "\n",
    "Now di work don dey more interesting. Make we check how di data dey spread for each cuisine.\n"
   ],
   "metadata": {
    "id": "AaPubl__GmH5"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Count observations per cuisine\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(n)\r\n",
    "\r\n",
    "# Plot the distribution\r\n",
    "theme_set(theme_light())\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n",
    "  geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
    "  ylab(\"cuisine\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "FRsBVy5eGrrv"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "E get limited number of food style (cuisine), but di way data dey spread no balance. You fit fix am! Before you do dat, try check am small.\n",
    "\n",
    "Next, make we put each food style (cuisine) for im own tibble and find out how much data dey available (rows, columns) for each cuisine.\n",
    "\n",
    "> A [tibble](https://tibble.tidyverse.org/) na modern data frame.\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../../../../../translated_images/dplyr_filter.b480b264b03439ff.pcm.jpg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>Artwork by @allison_horst</figcaption>\n"
   ],
   "metadata": {
    "id": "vVvyDb1kG2in"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Create individual tibble for the cuisines\r\n",
    "thai_df <- df %>% \r\n",
    "  filter(cuisine == \"thai\")\r\n",
    "japanese_df <- df %>% \r\n",
    "  filter(cuisine == \"japanese\")\r\n",
    "chinese_df <- df %>% \r\n",
    "  filter(cuisine == \"chinese\")\r\n",
    "indian_df <- df %>% \r\n",
    "  filter(cuisine == \"indian\")\r\n",
    "korean_df <- df %>% \r\n",
    "  filter(cuisine == \"korean\")\r\n",
    "\r\n",
    "\r\n",
    "# Find out how much data is available per cuisine\r\n",
    "cat(\" thai df:\", dim(thai_df), \"\\n\",\r\n",
    "    \"japanese df:\", dim(japanese_df), \"\\n\",\r\n",
    "    \"chinese_df:\", dim(chinese_df), \"\\n\",\r\n",
    "    \"indian_df:\", dim(indian_df), \"\\n\",\r\n",
    "    \"korean_df:\", dim(korean_df))"
   ],
   "outputs": [],
   "metadata": {
    "id": "0TvXUxD3G8Bk"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## **Exercise - Find di Top Ingredients for Each Cuisine wit dplyr**\n",
    "\n",
    "Now, you fit dig deeper inside di data and sabi wetin be di common ingredients for each cuisine. You go need clean out di kind data wey dey cause wahala between cuisines, so make we learn about dis problem.\n",
    "\n",
    "Create one function `create_ingredient()` for R wey go return one ingredient dataframe. Dis function go first remove one column wey no dey useful and arrange di ingredients based on how many times dem show.\n",
    "\n",
    "Di basic structure of function for R be like dis:\n",
    "\n",
    "`myFunction <- function(arglist){`\n",
    "\n",
    "**`...`**\n",
    "\n",
    "**`return`**`(value)`\n",
    "\n",
    "`}`\n",
    "\n",
    "If you wan sabi more about R functions, check dis [tidy introduction](https://skirmer.github.io/presentations/functions_with_r.html#1).\n",
    "\n",
    "Make we start! We go use [dplyr verbs](https://dplyr.tidyverse.org/) wey we don dey learn for di previous lessons. As small reminder:\n",
    "\n",
    "-   `dplyr::select()`: e dey help you choose which **columns** you wan keep or remove.\n",
    "\n",
    "-   `dplyr::pivot_longer()`: e dey help you \"stretch\" di data, increase di number of rows and reduce di number of columns.\n",
    "\n",
    "-   `dplyr::group_by()` and `dplyr::summarise()`: e dey help you find summary statistics for different groups, and arrange dem for better table.\n",
    "\n",
    "-   `dplyr::filter()`: e dey create one small part of di data wey only get rows wey match your condition.\n",
    "\n",
    "-   `dplyr::mutate()`: e dey help you create or change columns.\n",
    "\n",
    "Check dis [*art*-filled learnr tutorial](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome) by Allison Horst, wey dey introduce some useful data wrangling functions for dplyr *(wey be part of Tidyverse)*.\n"
   ],
   "metadata": {
    "id": "K3RF5bSCHC76"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Creates a functions that returns the top ingredients by class\r\n",
    "\r\n",
    "create_ingredient <- function(df){\r\n",
    "  \r\n",
    "  # Drop the id column which is the first colum\r\n",
    "  ingredient_df = df %>% select(-1) %>% \r\n",
    "  # Transpose data to a long format\r\n",
    "    pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \r\n",
    "  # Find the top most ingredients for a particular cuisine\r\n",
    "    group_by(ingredients) %>% \r\n",
    "    summarise(n_instances = sum(count)) %>% \r\n",
    "    filter(n_instances != 0) %>% \r\n",
    "  # Arrange by descending order\r\n",
    "    arrange(desc(n_instances)) %>% \r\n",
    "    mutate(ingredients = factor(ingredients) %>% fct_inorder())\r\n",
    "  \r\n",
    "  \r\n",
    "  return(ingredient_df)\r\n",
    "} # End of function"
   ],
   "outputs": [],
   "metadata": {
    "id": "uB_0JR82HTPa"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Now we fit use di function take sabi di top ten most popular ingredient by cuisine. Make we try am wit `thai_df`\n"
   ],
   "metadata": {
    "id": "h9794WF8HWmc"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Call create_ingredient and display popular ingredients\r\n",
    "thai_ingredient_df <- create_ingredient(df = thai_df)\r\n",
    "\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10)"
   ],
   "outputs": [],
   "metadata": {
    "id": "agQ-1HrcHaEA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "For di previous section, we use `geom_col()`, make we see how you fit use `geom_bar` too, to create bar charts. Use `?geom_bar` for more reading.\n"
   ],
   "metadata": {
    "id": "kHu9ffGjHdcX"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Make a bar chart for popular thai cuisines\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10) %>% \r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "fb3Bx_3DHj6e"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Make we do di same for di Japanese data.\n"
   ],
   "metadata": {
    "id": "RHP_xgdkHnvM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Japanese cuisines and make bar chart\r\n",
    "create_ingredient(df = japanese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "019v8F0XHrRU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Wetin you mean about Chinese food?\n"
   ],
   "metadata": {
    "id": "iIGM7vO8Hu3v"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Chinese cuisines and make bar chart\r\n",
    "create_ingredient(df = chinese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lHd9_gd2HyzU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Make we check Indian food dem 🌶️.\n"
   ],
   "metadata": {
    "id": "ir8qyQbNH1c7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Indian cuisines and make bar chart\r\n",
    "create_ingredient(df = indian_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "ApukQtKjH5FO"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Finally, plot di Korean ingredients.\n"
   ],
   "metadata": {
    "id": "qv30cwY1H-FM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Korean cuisines and make bar chart\r\n",
    "create_ingredient(df = korean_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lumgk9cHIBie"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "From di data visualizations, we fit drop di most common ingredients wey dey cause confusion between different cuisines, using `dplyr::select()`.\n",
    "\n",
    "Everybody like rice, garlic and ginger!\n"
   ],
   "metadata": {
    "id": "iO4veMXuIEta"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Drop id column, rice, garlic and ginger from our original data set\r\n",
    "df_select <- df %>% \r\n",
    "  select(-c(1, rice, garlic, ginger))\r\n",
    "\r\n",
    "# Display new data set\r\n",
    "df_select %>% \r\n",
    "  slice_head(n = 5)"
   ],
   "outputs": [],
   "metadata": {
    "id": "iHJPiG6rIUcK"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## How to take care of data wey no balance ⚖️ - Use recipes 👩‍🍳👨‍🍳\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../../../../../translated_images/recipes.186acfa8ed2e8f00.pcm.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>Artwork by @allison_horst</figcaption>\n",
    "\n",
    "Since dis lesson na about food, we go use `recipes` for di matter.\n",
    "\n",
    "Tidymodels get one beta package wey dem call `recipes` - e dey help to arrange data well well.\n"
   ],
   "metadata": {
    "id": "kkFd-JxdIaL6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Make we check how our food dey spread again.\n"
   ],
   "metadata": {
    "id": "6l2ubtTPJAhY"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "old_label_count <- df_select %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "old_label_count"
   ],
   "outputs": [],
   "metadata": {
    "id": "1e-E9cb7JDVi"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "As you fit see, di number of cuisines no dey balance well. Korean cuisines dey almost 3 times Thai cuisines. Wen data no balance well, e fit affect how di model go perform. Make we think about binary classification. If most of di data dey one class, ML model go dey predict dat class more often, just because e get more data for am. To balance di data, we go adjust any data wey no dey balance well so dat e go remove di imbalance. Many models dey perform better wen di number of observations dey equal, and dem dey struggle wen di data no balance.\n",
    "\n",
    "We get two main ways to handle data wey no balance:\n",
    "\n",
    "-   Add more observations to di minority class: `Over-sampling` e.g use SMOTE algorithm\n",
    "\n",
    "-   Remove some observations from di majority class: `Under-sampling`\n",
    "\n",
    "Make we now show how to handle data wey no balance using `recipe`. Recipe na like blueprint wey dey describe di steps wey we go apply to di data set so e go ready for data analysis.\n"
   ],
   "metadata": {
    "id": "soAw6826JKx9"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Load themis package for dealing with imbalanced data\r\n",
    "library(themis)\r\n",
    "\r\n",
    "# Create a recipe for preprocessing data\r\n",
    "cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \r\n",
    "  step_smote(cuisine)\r\n",
    "\r\n",
    "cuisines_recipe"
   ],
   "outputs": [],
   "metadata": {
    "id": "HS41brUIJVJy"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Make we break down di steps wey we dey use for preprocessing.\n",
    "\n",
    "-   Di call wey we make to `recipe()` wit one formula dey tell di recipe di *roles* wey di variables get, using `df_select` data as di reference. For example, di `cuisine` column don get `outcome` role, while di rest columns don get `predictor` role.\n",
    "\n",
    "-   [`step_smote(cuisine)`](https://themis.tidymodels.org/reference/step_smote.html) dey create one *specification* of one recipe step wey dey generate new examples of di minority class synthetically, using nearest neighbors of di cases.\n",
    "\n",
    "Now, if we wan see di preprocessed data, we go need [**`prep()`**](https://recipes.tidymodels.org/reference/prep.html) and [**`bake()`**](https://recipes.tidymodels.org/reference/bake.html) for our recipe.\n",
    "\n",
    "`prep()`: e dey estimate di parameters wey we need from one training set, wey we fit later use for other data sets.\n",
    "\n",
    "`bake()`: e dey take one prepped recipe and apply di operations to any data set.\n"
   ],
   "metadata": {
    "id": "Yb-7t7XcJaC8"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Prep and bake the recipe\r\n",
    "preprocessed_df <- cuisines_recipe %>% \r\n",
    "  prep() %>% \r\n",
    "  bake(new_data = NULL) %>% \r\n",
    "  relocate(cuisine)\r\n",
    "\r\n",
    "# Display data\r\n",
    "preprocessed_df %>% \r\n",
    "  slice_head(n = 5)\r\n",
    "\r\n",
    "# Quick summary stats\r\n",
    "preprocessed_df %>% \r\n",
    "  introduce()"
   ],
   "outputs": [],
   "metadata": {
    "id": "9QhSgdpxJl44"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Make we check how di food dem take spread and compare am wit di imbalanced data.\n"
   ],
   "metadata": {
    "id": "dmidELh_LdV7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "new_label_count <- preprocessed_df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "list(new_label_count = new_label_count,\r\n",
    "     old_label_count = old_label_count)"
   ],
   "outputs": [],
   "metadata": {
    "id": "aSh23klBLwDz"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Yum! Di data dey nice and clean, e balance well, and e sweet die 😋!\n",
    "\n",
    "> Normally, recipe na wetin dem dey use as preprocessor for modelling, e dey show di steps wey dem go apply for data set so e go ready for modelling. For dat kain case, na `workflow()` dem dey usually use (as we don already see for our previous lessons) instead of to dey estimate recipe by hand.\n",
    ">\n",
    "> So, you no go really need to dey use **`prep()`** and **`bake()`** recipes when you dey use tidymodels, but dem be useful tools wey you fit keep for your toolkit to confirm say di recipes dey do wetin you expect, like for our own case.\n",
    ">\n",
    "> When you **`bake()`** one prepped recipe with **`new_data = NULL`**, you go get di data wey you provide when you dey define di recipe back, but e go don pass through di preprocessing steps.\n",
    "\n",
    "Make we save one copy of dis data now so we fit use am for future lessons:\n"
   ],
   "metadata": {
    "id": "HEu80HZ8L7ae"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Save preprocessed data\r\n",
    "write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "cBmCbIgrMOI6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Dis fresh CSV dey now for di root data folder.\n",
    "\n",
    "**🚀Challenge**\n",
    "\n",
    "Dis curriculum get plenti interestin datasets. Check di `data` folders and see if any dataset dey wey fit work for binary or multi-class classification. Wetin be di kind questions wey you go ask from dis dataset?\n",
    "\n",
    "## [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/20/)\n",
    "\n",
    "## **Review & Self Study**\n",
    "\n",
    "-   Go look [package themis](https://github.com/tidymodels/themis). Wetin be di other techniques wey we fit use to handle imbalanced data?\n",
    "\n",
    "-   Tidy models [reference website](https://www.tidymodels.org/start/).\n",
    "\n",
    "-   H. Wickham and G. Grolemund, [*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/).\n",
    "\n",
    "#### THANK YOU TO:\n",
    "\n",
    "[`Allison Horst`](https://twitter.com/allison_horst/) for di amazing illustrations wey dey make R more friendly and fun. You fit find more illustrations for her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\n",
    "\n",
    "[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for di original Python version of dis module ♥️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../../../../../translated_images/r_learners_sm.cd14eb3581a9f28d.pcm.jpeg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>Artwork by @allison_horst</figcaption>\n"
   ],
   "metadata": {
    "id": "WQs5621pMGwf"
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n\n<!-- CO-OP TRANSLATOR DISCLAIMER START -->\n**Disclaimer**:  \nDis dokyument don use AI translet service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translet. Even as we dey try make am correct, abeg make you sabi say machine translet fit get mistake or no dey accurate well. Di original dokyument for im native language na di one wey you go take as di correct source. For important mata, e good make you use professional human translet. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translet.\n<!-- CO-OP TRANSLATOR DISCLAIMER END -->\n"
   ]
  }
 ]
}