ML-For-Beginners/translations/hk/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb

{
 "nbformat": 4,
 "nbformat_minor": 2,
 "metadata": {
  "colab": {
   "name": "lesson_10-R.ipynb",
   "provenance": [],
   "collapsed_sections": []
  },
  "kernelspec": {
   "name": "ir",
   "display_name": "R"
  },
  "language_info": {
   "name": "R"
  },
  "coopTranslator": {
   "original_hash": "2621e24705e8100893c9bf84e0fc8aef",
   "translation_date": "2025-09-03T20:41:29+00:00",
   "source_file": "4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb",
   "language_code": "hk"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# 建立分類模型：美味的亞洲和印度美食\n"
   ],
   "metadata": {
    "id": "ItETB4tSFprR"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 分類簡介：清理、準備和視覺化數據\n",
    "\n",
    "在這四節課中，你將探索經典機器學習的一個基本重點——*分類*。我們將使用一個關於亞洲和印度各種美食的數據集，逐步了解如何使用不同的分類算法。希望你已經準備好大快朵頤了！\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/pinch.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>在這些課程中一起慶祝泛亞洲美食吧！圖片由 Jen Looper 提供</figcaption>\n",
    "\n",
    "分類是一種[監督式學習](https://wikipedia.org/wiki/Supervised_learning)，與回歸技術有許多相似之處。在分類中，你訓練一個模型來預測某個項目屬於哪個`類別`。如果說機器學習是通過數據集來預測值或名稱，那麼分類通常分為兩類：*二元分類*和*多類分類*。\n",
    "\n",
    "請記住：\n",
    "\n",
    "-   **線性回歸**幫助你預測變量之間的關係，並準確預測新數據點在該線性關係中的位置。例如，你可以預測南瓜在九月和十二月的價格。\n",
    "\n",
    "-   **邏輯回歸**幫助你發現「二元類別」：在這個價格範圍內，*這個南瓜是橙色還是不是橙色*？\n",
    "\n",
    "分類使用各種算法來確定數據點的標籤或類別。讓我們使用這個美食數據集，看看通過觀察一組食材，是否可以確定其美食的來源。\n",
    "\n",
    "### [**課前測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/19/)\n",
    "\n",
    "### **簡介**\n",
    "\n",
    "分類是機器學習研究人員和數據科學家的基本活動之一。從基本的二元值分類（「這封郵件是垃圾郵件還是不是？」），到使用計算機視覺進行的複雜圖像分類和分割，能夠將數據分類並對其進行提問始終是非常有用的。\n",
    "\n",
    "用更科學的方式來描述這個過程，你的分類方法會創建一個預測模型，幫助你將輸入變量與輸出變量之間的關係映射出來。\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/binary-multiclass.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>分類算法處理的二元與多類問題。信息圖由 Jen Looper 提供</figcaption>\n",
    "\n",
    "在開始清理數據、可視化數據並為機器學習任務準備數據之前，讓我們先了解一下機器學習如何用於分類數據的各種方式。\n",
    "\n",
    "分類源於[統計學](https://wikipedia.org/wiki/Statistical_classification)，使用經典機器學習進行分類時，會利用特徵（例如`smoker`、`weight`和`age`）來確定*患某種疾病的可能性*。作為一種與之前進行的回歸練習類似的監督式學習技術，你的數據是帶標籤的，機器學習算法使用這些標籤來分類和預測數據集的類別（或「特徵」），並將它們分配到某個組或結果中。\n",
    "\n",
    "✅ 花點時間想像一個關於美食的數據集。一個多類模型能回答什麼問題？一個二元模型能回答什麼問題？如果你想確定某種美食是否可能使用葫蘆巴，該怎麼辦？如果你想知道，假如收到一袋包含八角、洋薊、花椰菜和辣根的雜貨，你是否能做出一道典型的印度菜呢？\n",
    "\n",
    "### **你好，分類器**\n",
    "\n",
    "我們想要從這個美食數據集中提出的問題實際上是一個**多類問題**，因為我們有多個潛在的國家美食類別可供選擇。給定一組食材，這些數據會屬於哪一類？\n",
    "\n",
    "Tidymodels 提供了多種算法來分類數據，具體取決於你想解決的問題類型。在接下來的兩節課中，你將學習其中幾種算法。\n",
    "\n",
    "#### **前置要求**\n",
    "\n",
    "在這節課中，我們需要以下套件來清理、準備和可視化數據：\n",
    "\n",
    "-   `tidyverse`： [tidyverse](https://www.tidyverse.org/) 是一個[由 R 套件組成的集合](https://www.tidyverse.org/packages)，旨在讓數據科學更快速、更簡單、更有趣！\n",
    "\n",
    "-   `tidymodels`： [tidymodels](https://www.tidymodels.org/) 框架是一個[建模和機器學習的套件集合](https://www.tidymodels.org/packages/)。\n",
    "\n",
    "-   `DataExplorer`： [DataExplorer 套件](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html)旨在簡化和自動化探索性數據分析（EDA）過程和報告生成。\n",
    "\n",
    "-   `themis`： [themis 套件](https://themis.tidymodels.org/) 提供了處理不平衡數據的額外 Recipes 步驟。\n",
    "\n",
    "你可以通過以下方式安裝它們：\n",
    "\n",
    "`install.packages(c(\"tidyverse\", \"tidymodels\", \"DataExplorer\", \"here\"))`\n",
    "\n",
    "或者，以下腳本會檢查你是否已安裝完成本模組所需的套件，並在缺失時為你安裝。\n"
   ],
   "metadata": {
    "id": "ri5bQxZ-Fz_0"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n",
    "\r\n",
    "pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
   ],
   "outputs": [],
   "metadata": {
    "id": "KIPxa4elGAPI"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "我們稍後會載入這些很棒的套件，並使它們在我們目前的 R 工作環境中可用。（這只是為了說明，`pacman::p_load()` 已經為你完成了這個步驟）\n"
   ],
   "metadata": {
    "id": "YkKAxOJvGD4C"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 練習 - 清理及平衡你的數據\n",
    "\n",
    "在開始這個項目之前，第一項任務是清理並**平衡**你的數據，以獲得更好的結果。\n",
    "\n",
    "來認識一下這些數據吧！🕵️\n"
   ],
   "metadata": {
    "id": "PFkQDlk0GN5O"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Import data\r\n",
    "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n",
    "\r\n",
    "# View the first 5 rows\r\n",
    "df %>% \r\n",
    "  slice_head(n = 5)\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "Qccw7okxGT0S"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "有趣！看來第一列是一種 `id` 列。我們來了解更多關於這些數據的信息。\n"
   ],
   "metadata": {
    "id": "XrWnlgSrGVmR"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Basic information about the data\r\n",
    "df %>%\r\n",
    "  introduce()\r\n",
    "\r\n",
    "# Visualize basic information above\r\n",
    "df %>% \r\n",
    "  plot_intro(ggtheme = theme_light())"
   ],
   "outputs": [],
   "metadata": {
    "id": "4UcGmxRxGieA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "從輸出中，我們可以立即看到我們有 `2448` 行和 `385` 列，並且沒有缺失值。我們還有一個離散欄位，*cuisine*。\n",
    "\n",
    "## 練習 - 了解菜系\n",
    "\n",
    "現在工作開始變得更有趣了。讓我們探索每種菜系的數據分佈。\n"
   ],
   "metadata": {
    "id": "AaPubl__GmH5"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Count observations per cuisine\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(n)\r\n",
    "\r\n",
    "# Plot the distribution\r\n",
    "theme_set(theme_light())\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n",
    "  geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
    "  ylab(\"cuisine\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "FRsBVy5eGrrv"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "有各種不同的菜系，但數據的分佈並不平均。你可以改變這種情況！在此之前，先進一步探索一下。\n",
    "\n",
    "接下來，讓我們將每種菜系分配到各自的 tibble，並找出每種菜系的數據量（行數和列數）。\n",
    "\n",
    "> [tibble](https://tibble.tidyverse.org/) 是一種現代化的數據框。\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/dplyr_filter.jpg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>插圖由 @allison_horst 提供</figcaption>\n"
   ],
   "metadata": {
    "id": "vVvyDb1kG2in"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Create individual tibble for the cuisines\r\n",
    "thai_df <- df %>% \r\n",
    "  filter(cuisine == \"thai\")\r\n",
    "japanese_df <- df %>% \r\n",
    "  filter(cuisine == \"japanese\")\r\n",
    "chinese_df <- df %>% \r\n",
    "  filter(cuisine == \"chinese\")\r\n",
    "indian_df <- df %>% \r\n",
    "  filter(cuisine == \"indian\")\r\n",
    "korean_df <- df %>% \r\n",
    "  filter(cuisine == \"korean\")\r\n",
    "\r\n",
    "\r\n",
    "# Find out how much data is available per cuisine\r\n",
    "cat(\" thai df:\", dim(thai_df), \"\\n\",\r\n",
    "    \"japanese df:\", dim(japanese_df), \"\\n\",\r\n",
    "    \"chinese_df:\", dim(chinese_df), \"\\n\",\r\n",
    "    \"indian_df:\", dim(indian_df), \"\\n\",\r\n",
    "    \"korean_df:\", dim(korean_df))"
   ],
   "outputs": [],
   "metadata": {
    "id": "0TvXUxD3G8Bk"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## **練習 - 使用 dplyr 探索不同菜系的主要食材**\n",
    "\n",
    "現在你可以深入研究數據，了解每種菜系的典型食材。你需要清理一些重複的數據，這些數據可能會在菜系之間造成混淆，因此讓我們來了解這個問題。\n",
    "\n",
    "在 R 中創建一個名為 `create_ingredient()` 的函數，該函數會返回一個食材的數據框。這個函數將首先刪除一個無用的列，然後根據食材的數量進行排序。\n",
    "\n",
    "R 中函數的基本結構如下：\n",
    "\n",
    "`myFunction <- function(arglist){`\n",
    "\n",
    "**`...`**\n",
    "\n",
    "**`return`**`(value)`\n",
    "\n",
    "`}`\n",
    "\n",
    "可以在[這裡](https://skirmer.github.io/presentations/functions_with_r.html#1)找到一個簡潔的 R 函數入門介紹。\n",
    "\n",
    "讓我們直接開始吧！我們將使用 [dplyr 動詞](https://dplyr.tidyverse.org/)，這些動詞我們在之前的課程中已經學過。以下是回顧：\n",
    "\n",
    "-   `dplyr::select()`: 幫助你選擇要保留或排除的**列**。\n",
    "\n",
    "-   `dplyr::pivot_longer()`: 幫助你將數據“拉長”，增加行數並減少列數。\n",
    "\n",
    "-   `dplyr::group_by()` 和 `dplyr::summarise()`: 幫助你找到不同組的統計摘要，並將它們放入一個整齊的表格中。\n",
    "\n",
    "-   `dplyr::filter()`: 創建一個僅包含滿足條件的行的數據子集。\n",
    "\n",
    "-   `dplyr::mutate()`: 幫助你創建或修改列。\n",
    "\n",
    "查看這個由 Allison Horst 製作的[充滿藝術感的 learnr 教程](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome)，它介紹了一些在 dplyr *(Tidyverse 的一部分)* 中非常有用的數據整理函數。\n"
   ],
   "metadata": {
    "id": "K3RF5bSCHC76"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Creates a functions that returns the top ingredients by class\r\n",
    "\r\n",
    "create_ingredient <- function(df){\r\n",
    "  \r\n",
    "  # Drop the id column which is the first colum\r\n",
    "  ingredient_df = df %>% select(-1) %>% \r\n",
    "  # Transpose data to a long format\r\n",
    "    pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \r\n",
    "  # Find the top most ingredients for a particular cuisine\r\n",
    "    group_by(ingredients) %>% \r\n",
    "    summarise(n_instances = sum(count)) %>% \r\n",
    "    filter(n_instances != 0) %>% \r\n",
    "  # Arrange by descending order\r\n",
    "    arrange(desc(n_instances)) %>% \r\n",
    "    mutate(ingredients = factor(ingredients) %>% fct_inorder())\r\n",
    "  \r\n",
    "  \r\n",
    "  return(ingredient_df)\r\n",
    "} # End of function"
   ],
   "outputs": [],
   "metadata": {
    "id": "uB_0JR82HTPa"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "現在我們可以使用這個函數來了解按菜系分類的十大最受歡迎食材。讓我們用 `thai_df` 試試看吧。\n"
   ],
   "metadata": {
    "id": "h9794WF8HWmc"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Call create_ingredient and display popular ingredients\r\n",
    "thai_ingredient_df <- create_ingredient(df = thai_df)\r\n",
    "\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10)"
   ],
   "outputs": [],
   "metadata": {
    "id": "agQ-1HrcHaEA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "在上一節中，我們使用了 `geom_col()`，現在讓我們看看如何使用 `geom_bar` 來製作柱狀圖。使用 `?geom_bar` 了解更多資訊。\n"
   ],
   "metadata": {
    "id": "kHu9ffGjHdcX"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Make a bar chart for popular thai cuisines\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10) %>% \r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "fb3Bx_3DHj6e"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "RHP_xgdkHnvM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Japanese cuisines and make bar chart\r\n",
    "create_ingredient(df = japanese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "019v8F0XHrRU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "關於中國菜呢？\n"
   ],
   "metadata": {
    "id": "iIGM7vO8Hu3v"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Chinese cuisines and make bar chart\r\n",
    "create_ingredient(df = chinese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lHd9_gd2HyzU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "ir8qyQbNH1c7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Indian cuisines and make bar chart\r\n",
    "create_ingredient(df = indian_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "ApukQtKjH5FO"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "最後，繪製韓國食材。\n"
   ],
   "metadata": {
    "id": "qv30cwY1H-FM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Korean cuisines and make bar chart\r\n",
    "create_ingredient(df = korean_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lumgk9cHIBie"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "從數據可視化中，我們現在可以使用 `dplyr::select()` 去掉那些在不同菜系之間容易引起混淆的最常見食材。\n",
    "\n",
    "大家都喜歡米飯、大蒜和薑！\n"
   ],
   "metadata": {
    "id": "iO4veMXuIEta"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Drop id column, rice, garlic and ginger from our original data set\r\n",
    "df_select <- df %>% \r\n",
    "  select(-c(1, rice, garlic, ginger))\r\n",
    "\r\n",
    "# Display new data set\r\n",
    "df_select %>% \r\n",
    "  slice_head(n = 5)"
   ],
   "outputs": [],
   "metadata": {
    "id": "iHJPiG6rIUcK"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 使用食譜預處理數據 👩‍🍳👨‍🍳 - 處理不平衡數據 ⚖️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/recipes.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>插圖由 @allison_horst 提供</figcaption>\n",
    "\n",
    "既然這節課是關於烹飪，我們需要將 `recipes` 放到合適的背景中。\n",
    "\n",
    "Tidymodels 提供了另一個非常方便的套件：`recipes`——一個用於預處理數據的套件。\n"
   ],
   "metadata": {
    "id": "kkFd-JxdIaL6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "讓我們再次看看我們菜式的分佈情況。\n"
   ],
   "metadata": {
    "id": "6l2ubtTPJAhY"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "old_label_count <- df_select %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "old_label_count"
   ],
   "outputs": [],
   "metadata": {
    "id": "1e-E9cb7JDVi"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "如你所見，菜式的數量分佈非常不均衡。韓國菜的數量幾乎是泰國菜的三倍。不均衡的數據通常會對模型的表現產生負面影響。試想一個二元分類問題，如果你的大部分數據都屬於某一個類別，機器學習模型就會更頻繁地預測該類別，僅僅因為它有更多的數據可供學習。平衡數據可以修正任何偏斜的數據，幫助消除這種不均衡。許多模型在觀測數量相等時表現最佳，因此在處理不均衡數據時往往會遇到困難。\n",
    "\n",
    "處理不均衡數據集主要有兩種方法：\n",
    "\n",
    "-   為少數類別添加觀測值：`過採樣`，例如使用 SMOTE 演算法\n",
    "\n",
    "-   從多數類別移除觀測值：`欠採樣`\n",
    "\n",
    "現在讓我們演示如何使用 `recipe` 來處理不均衡數據集。`recipe` 可以被視為一個藍圖，描述了應該對數據集應用哪些步驟，以使其準備好進行數據分析。\n"
   ],
   "metadata": {
    "id": "soAw6826JKx9"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Load themis package for dealing with imbalanced data\r\n",
    "library(themis)\r\n",
    "\r\n",
    "# Create a recipe for preprocessing data\r\n",
    "cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \r\n",
    "  step_smote(cuisine)\r\n",
    "\r\n",
    "cuisines_recipe"
   ],
   "outputs": [],
   "metadata": {
    "id": "HS41brUIJVJy"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "讓我們逐步了解預處理的步驟。\n",
    "\n",
    "-   使用公式調用 `recipe()` 時，會根據 `df_select` 數據作為參考，告訴 recipe 變數的*角色*。例如，`cuisine` 列被分配了 `outcome` 角色，而其他列則被分配了 `predictor` 角色。\n",
    "\n",
    "-   [`step_smote(cuisine)`](https://themis.tidymodels.org/reference/step_smote.html) 創建了一個 recipe 步驟的*規範*，使用這些案例的最近鄰居合成生成少數類別的新樣本。\n",
    "\n",
    "現在，如果我們想查看預處理後的數據，就需要使用 [**`prep()`**](https://recipes.tidymodels.org/reference/prep.html) 和 [**`bake()`**](https://recipes.tidymodels.org/reference/bake.html) 來處理我們的 recipe。\n",
    "\n",
    "`prep()`：從訓練集估算所需的參數，這些參數可以稍後應用到其他數據集。\n",
    "\n",
    "`bake()`：使用已準備好的 recipe，並將操作應用到任何數據集。\n"
   ],
   "metadata": {
    "id": "Yb-7t7XcJaC8"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Prep and bake the recipe\r\n",
    "preprocessed_df <- cuisines_recipe %>% \r\n",
    "  prep() %>% \r\n",
    "  bake(new_data = NULL) %>% \r\n",
    "  relocate(cuisine)\r\n",
    "\r\n",
    "# Display data\r\n",
    "preprocessed_df %>% \r\n",
    "  slice_head(n = 5)\r\n",
    "\r\n",
    "# Quick summary stats\r\n",
    "preprocessed_df %>% \r\n",
    "  introduce()"
   ],
   "outputs": [],
   "metadata": {
    "id": "9QhSgdpxJl44"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "現在讓我們檢查我們的菜式分佈，並將其與不平衡的數據進行比較。\n"
   ],
   "metadata": {
    "id": "dmidELh_LdV7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "new_label_count <- preprocessed_df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "list(new_label_count = new_label_count,\r\n",
    "     old_label_count = old_label_count)"
   ],
   "outputs": [],
   "metadata": {
    "id": "aSh23klBLwDz"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "好棒！數據既乾淨又平衡，簡直美味可口 😋！\n",
    "\n",
    "> 通常，配方（recipe）通常用作建模的預處理器，定義了需要對數據集進行哪些步驟以準備好進行建模。在這種情況下，通常會使用 `workflow()`（正如我們在之前的課程中已經看到的），而不是手動估算配方。\n",
    ">\n",
    "> 因此，當使用 tidymodels 時，通常不需要使用 **`prep()`** 和 **`bake()`** 來處理配方，但這些函數在工具箱中是非常有用的，可以用來確認配方是否按照預期運行，就像我們的情況一樣。\n",
    ">\n",
    "> 當你使用 **`new_data = NULL`** 來 **`bake()`** 一個已準備好的配方時，你會得到在定義配方時提供的數據，但這些數據已經經過了預處理步驟。\n",
    "\n",
    "現在讓我們保存一份這些數據的副本，以便在未來的課程中使用：\n"
   ],
   "metadata": {
    "id": "HEu80HZ8L7ae"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Save preprocessed data\r\n",
    "write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "cBmCbIgrMOI6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "這個新的 CSV 現在可以在根目錄的資料夾中找到。\n",
    "\n",
    "**🚀挑戰**\n",
    "\n",
    "這份課程包含多個有趣的數據集。深入探索 `data` 資料夾，看看是否有任何數據集適合用於二元或多類別分類？你會對這些數據集提出什麼問題？\n",
    "\n",
    "## [**課後測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/20/)\n",
    "\n",
    "## **回顧與自學**\n",
    "\n",
    "-   查看 [themis 套件](https://github.com/tidymodels/themis)。我們還可以使用哪些其他技術來處理不平衡數據？\n",
    "\n",
    "-   Tidy models [參考網站](https://www.tidymodels.org/start/)。\n",
    "\n",
    "-   H. Wickham 和 G. Grolemund，[*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/)。\n",
    "\n",
    "#### 特別感謝：\n",
    "\n",
    "[`Allison Horst`](https://twitter.com/allison_horst/) 創作了令人驚嘆的插圖，使 R 更加親切和吸引人。可以在她的 [畫廊](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM) 中找到更多插圖。\n",
    "\n",
    "[Cassie Breviu](https://www.twitter.com/cassieview) 和 [Jen Looper](https://www.twitter.com/jenlooper) 創作了這個模組的原始 Python 版本 ♥️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/r_learners_sm.jpeg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>插圖由 @allison_horst 提供</figcaption>\n"
   ],
   "metadata": {
    "id": "WQs5621pMGwf"
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**免責聲明**：  \n本文件已使用人工智能翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。儘管我們致力於提供準確的翻譯，但請注意，自動翻譯可能包含錯誤或不準確之處。原始語言的文件應被視為權威來源。對於重要資訊，建議使用專業的人類翻譯。我們對因使用此翻譯而引起的任何誤解或錯誤解釋不承擔責任。\n"
   ]
  }
 ]
}