ML-For-Beginners/translations/zh/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb

{
 "nbformat": 4,
 "nbformat_minor": 2,
 "metadata": {
  "colab": {
   "name": "lesson_10-R.ipynb",
   "provenance": [],
   "collapsed_sections": []
  },
  "kernelspec": {
   "name": "ir",
   "display_name": "R"
  },
  "language_info": {
   "name": "R"
  },
  "coopTranslator": {
   "original_hash": "2621e24705e8100893c9bf84e0fc8aef",
   "translation_date": "2025-09-03T20:40:23+00:00",
   "source_file": "4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb",
   "language_code": "zh"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# 构建分类模型：美味的亚洲和印度美食\n"
   ],
   "metadata": {
    "id": "ItETB4tSFprR"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 分类简介：清理、准备和可视化数据\n",
    "\n",
    "在这四节课中，您将探索经典机器学习的一个核心主题——*分类*。我们将使用一个关于亚洲和印度美食的数据集，逐步学习各种分类算法的应用。希望您已经准备好大快朵颐！\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/pinch.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>在这些课程中庆祝泛亚洲美食！图片由 Jen Looper 提供</figcaption>\n",
    "\n",
    "分类是一种[监督学习](https://wikipedia.org/wiki/Supervised_learning)形式，与回归技术有许多相似之处。在分类中，您训练一个模型来预测某个项目属于哪个`类别`。如果机器学习的核心是通过数据集预测事物的值或名称，那么分类通常分为两类：*二元分类*和*多类分类*。\n",
    "\n",
    "请记住：\n",
    "\n",
    "-   **线性回归**帮助您预测变量之间的关系，并准确预测新数据点在该关系线上的位置。例如，您可以预测数值，比如*南瓜在九月和十二月的价格*。\n",
    "\n",
    "-   **逻辑回归**帮助您发现“二元类别”：在这个价格点，*这个南瓜是橙色还是非橙色*？\n",
    "\n",
    "分类使用各种算法来确定数据点的标签或类别。让我们使用这个美食数据集，看看通过观察一组食材，是否可以确定其美食的来源。\n",
    "\n",
    "### [**课前测验**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/19/)\n",
    "\n",
    "### **简介**\n",
    "\n",
    "分类是机器学习研究人员和数据科学家的基本活动之一。从简单的二元值分类（“这封邮件是垃圾邮件还是不是？”），到使用计算机视觉进行复杂的图像分类和分割，能够将数据分类并提出问题总是非常有用。\n",
    "\n",
    "用更科学的方式来说，您的分类方法会创建一个预测模型，使您能够将输入变量与输出变量之间的关系进行映射。\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/binary-multiclass.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>分类算法处理二元问题与多类问题。信息图由 Jen Looper 提供</figcaption>\n",
    "\n",
    "在开始清理数据、可视化数据以及为机器学习任务准备数据之前，让我们先了解一下机器学习如何用于分类数据的各种方式。\n",
    "\n",
    "分类源自[统计学](https://wikipedia.org/wiki/Statistical_classification)，使用经典机器学习进行分类时，会利用特征，例如`吸烟者`、`体重`和`年龄`，来确定*患某种疾病的可能性*。作为一种类似于您之前进行的回归练习的监督学习技术，您的数据是带标签的，机器学习算法使用这些标签来分类和预测数据集的类别（或“特征”），并将其分配到某个组或结果中。\n",
    "\n",
    "✅ 花点时间想象一个关于美食的数据集。多类模型可以回答什么问题？二元模型可以回答什么问题？如果您想确定某种美食是否可能使用葫芦巴怎么办？如果您想知道，假如收到一袋包含八角、洋蓟、花椰菜和辣根的杂货，是否可以制作一道典型的印度菜呢？\n",
    "\n",
    "### **你好，‘分类器’**\n",
    "\n",
    "我们想要从这个美食数据集中提出的问题实际上是一个**多类问题**，因为我们有多个潜在的国家美食类别可以选择。给定一组食材，这些数据会属于哪一个类别？\n",
    "\n",
    "Tidymodels 提供了几种不同的算法来分类数据，具体取决于您想解决的问题类型。在接下来的两节课中，您将学习其中几种算法。\n",
    "\n",
    "#### **前提条件**\n",
    "\n",
    "在本课程中，我们需要以下包来清理、准备和可视化数据：\n",
    "\n",
    "-   `tidyverse`： [tidyverse](https://www.tidyverse.org/) 是一个[集合的 R 包](https://www.tidyverse.org/packages)，旨在让数据科学更快、更简单、更有趣！\n",
    "\n",
    "-   `tidymodels`： [tidymodels](https://www.tidymodels.org/) 框架是一个[建模和机器学习的包集合](https://www.tidymodels.org/packages/)。\n",
    "\n",
    "-   `DataExplorer`： [DataExplorer 包](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html)旨在简化和自动化探索性数据分析过程和报告生成。\n",
    "\n",
    "-   `themis`： [themis 包](https://themis.tidymodels.org/)提供了处理不平衡数据的额外配方步骤。\n",
    "\n",
    "您可以通过以下方式安装它们：\n",
    "\n",
    "`install.packages(c(\"tidyverse\", \"tidymodels\", \"DataExplorer\", \"here\"))`\n",
    "\n",
    "或者，下面的脚本会检查您是否安装了完成本模块所需的包，并在缺少时为您安装。\n"
   ],
   "metadata": {
    "id": "ri5bQxZ-Fz_0"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n",
    "\r\n",
    "pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
   ],
   "outputs": [],
   "metadata": {
    "id": "KIPxa4elGAPI"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "我们稍后会加载这些很棒的包，并使它们在我们当前的 R 会话中可用。（这只是为了说明，`pacman::p_load()` 已经为您完成了这一操作）\n"
   ],
   "metadata": {
    "id": "YkKAxOJvGD4C"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 练习 - 清理并平衡数据\n",
    "\n",
    "在开始这个项目之前，首要任务是清理并**平衡**你的数据，以获得更好的结果。\n",
    "\n",
    "让我们来看看数据吧！🕵️\n"
   ],
   "metadata": {
    "id": "PFkQDlk0GN5O"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Import data\r\n",
    "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n",
    "\r\n",
    "# View the first 5 rows\r\n",
    "df %>% \r\n",
    "  slice_head(n = 5)\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "Qccw7okxGT0S"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "有趣！从表面上看，第一列是一种`id`列。让我们获取一些关于数据的更多信息。\n"
   ],
   "metadata": {
    "id": "XrWnlgSrGVmR"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Basic information about the data\r\n",
    "df %>%\r\n",
    "  introduce()\r\n",
    "\r\n",
    "# Visualize basic information above\r\n",
    "df %>% \r\n",
    "  plot_intro(ggtheme = theme_light())"
   ],
   "outputs": [],
   "metadata": {
    "id": "4UcGmxRxGieA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "从输出中我们可以直接看到，我们有 `2448` 行和 `385` 列，并且没有缺失值。此外，我们还有一个离散列，*cuisine*。\n",
    "\n",
    "## 练习 - 了解菜系\n",
    "\n",
    "现在工作开始变得更有趣了。让我们探索每种菜系的数据分布。\n"
   ],
   "metadata": {
    "id": "AaPubl__GmH5"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Count observations per cuisine\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(n)\r\n",
    "\r\n",
    "# Plot the distribution\r\n",
    "theme_set(theme_light())\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n",
    "  geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
    "  ylab(\"cuisine\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "FRsBVy5eGrrv"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "世界上的菜系种类有限，但数据分布却不均衡。你可以改变这一点！在此之前，先多探索一下吧。\n",
    "\n",
    "接下来，让我们将每种菜系分配到各自的 tibble 中，并找出每种菜系的数据量（行数和列数）。\n",
    "\n",
    "> [tibble](https://tibble.tidyverse.org/) 是一种现代化的数据框。\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/dplyr_filter.jpg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>由 @allison_horst 创作的艺术作品</figcaption>\n"
   ],
   "metadata": {
    "id": "vVvyDb1kG2in"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Create individual tibble for the cuisines\r\n",
    "thai_df <- df %>% \r\n",
    "  filter(cuisine == \"thai\")\r\n",
    "japanese_df <- df %>% \r\n",
    "  filter(cuisine == \"japanese\")\r\n",
    "chinese_df <- df %>% \r\n",
    "  filter(cuisine == \"chinese\")\r\n",
    "indian_df <- df %>% \r\n",
    "  filter(cuisine == \"indian\")\r\n",
    "korean_df <- df %>% \r\n",
    "  filter(cuisine == \"korean\")\r\n",
    "\r\n",
    "\r\n",
    "# Find out how much data is available per cuisine\r\n",
    "cat(\" thai df:\", dim(thai_df), \"\\n\",\r\n",
    "    \"japanese df:\", dim(japanese_df), \"\\n\",\r\n",
    "    \"chinese_df:\", dim(chinese_df), \"\\n\",\r\n",
    "    \"indian_df:\", dim(indian_df), \"\\n\",\r\n",
    "    \"korean_df:\", dim(korean_df))"
   ],
   "outputs": [],
   "metadata": {
    "id": "0TvXUxD3G8Bk"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## **练习 - 使用 dplyr 探索不同菜系的主要食材**\n",
    "\n",
    "现在你可以更深入地挖掘数据，了解每种菜系的典型食材。你需要清理一些会在菜系之间引起混淆的重复数据，所以让我们来学习如何解决这个问题。\n",
    "\n",
    "在 R 中创建一个名为 `create_ingredient()` 的函数，该函数返回一个食材数据框。这个函数将从删除一个无用的列开始，并根据食材的数量对其进行排序。\n",
    "\n",
    "R 中函数的基本结构如下：\n",
    "\n",
    "`myFunction <- function(arglist){`\n",
    "\n",
    "**`...`**\n",
    "\n",
    "**`return`**`(value)`\n",
    "\n",
    "`}`\n",
    "\n",
    "关于 R 函数的简洁介绍可以在[这里](https://skirmer.github.io/presentations/functions_with_r.html#1)找到。\n",
    "\n",
    "让我们直接开始吧！我们将使用之前课程中学习过的 [dplyr 动词](https://dplyr.tidyverse.org/)。回顾一下：\n",
    "\n",
    "-   `dplyr::select()`：帮助你选择要保留或排除的**列**。\n",
    "\n",
    "-   `dplyr::pivot_longer()`：帮助你“拉长”数据，增加行数并减少列数。\n",
    "\n",
    "-   `dplyr::group_by()` 和 `dplyr::summarise()`：帮助你为不同组计算汇总统计数据，并将其整理成一个漂亮的表格。\n",
    "\n",
    "-   `dplyr::filter()`：创建一个仅包含满足条件的行的数据子集。\n",
    "\n",
    "-   `dplyr::mutate()`：帮助你创建或修改列。\n",
    "\n",
    "查看 Allison Horst 的这个充满[*艺术*](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome)的 learnr 教程，它介绍了一些 dplyr（Tidyverse 的一部分）中有用的数据整理函数。\n"
   ],
   "metadata": {
    "id": "K3RF5bSCHC76"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Creates a functions that returns the top ingredients by class\r\n",
    "\r\n",
    "create_ingredient <- function(df){\r\n",
    "  \r\n",
    "  # Drop the id column which is the first colum\r\n",
    "  ingredient_df = df %>% select(-1) %>% \r\n",
    "  # Transpose data to a long format\r\n",
    "    pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \r\n",
    "  # Find the top most ingredients for a particular cuisine\r\n",
    "    group_by(ingredients) %>% \r\n",
    "    summarise(n_instances = sum(count)) %>% \r\n",
    "    filter(n_instances != 0) %>% \r\n",
    "  # Arrange by descending order\r\n",
    "    arrange(desc(n_instances)) %>% \r\n",
    "    mutate(ingredients = factor(ingredients) %>% fct_inorder())\r\n",
    "  \r\n",
    "  \r\n",
    "  return(ingredient_df)\r\n",
    "} # End of function"
   ],
   "outputs": [],
   "metadata": {
    "id": "uB_0JR82HTPa"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "现在我们可以使用这个函数来了解每种菜系中最受欢迎的前十种食材。让我们用 `thai_df` 来试试吧。\n"
   ],
   "metadata": {
    "id": "h9794WF8HWmc"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Call create_ingredient and display popular ingredients\r\n",
    "thai_ingredient_df <- create_ingredient(df = thai_df)\r\n",
    "\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10)"
   ],
   "outputs": [],
   "metadata": {
    "id": "agQ-1HrcHaEA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "在上一节中，我们使用了`geom_col()`，现在让我们看看如何使用`geom_bar`来创建条形图。使用`?geom_bar`了解更多信息。\n"
   ],
   "metadata": {
    "id": "kHu9ffGjHdcX"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Make a bar chart for popular thai cuisines\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10) %>% \r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "fb3Bx_3DHj6e"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "让我们对日语数据做同样的事情\n"
   ],
   "metadata": {
    "id": "RHP_xgdkHnvM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Japanese cuisines and make bar chart\r\n",
    "create_ingredient(df = japanese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "019v8F0XHrRU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "关于中国菜肴呢？\n"
   ],
   "metadata": {
    "id": "iIGM7vO8Hu3v"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Chinese cuisines and make bar chart\r\n",
    "create_ingredient(df = chinese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lHd9_gd2HyzU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "ir8qyQbNH1c7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Indian cuisines and make bar chart\r\n",
    "create_ingredient(df = indian_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "ApukQtKjH5FO"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "最后，绘制韩国食材。\n"
   ],
   "metadata": {
    "id": "qv30cwY1H-FM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Korean cuisines and make bar chart\r\n",
    "create_ingredient(df = korean_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lumgk9cHIBie"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "从数据可视化中，我们现在可以使用 `dplyr::select()` 删除那些在不同菜系之间容易引起混淆的常见食材。\n",
    "\n",
    "大家都喜欢米饭、大蒜和姜！\n"
   ],
   "metadata": {
    "id": "iO4veMXuIEta"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Drop id column, rice, garlic and ginger from our original data set\r\n",
    "df_select <- df %>% \r\n",
    "  select(-c(1, rice, garlic, ginger))\r\n",
    "\r\n",
    "# Display new data set\r\n",
    "df_select %>% \r\n",
    "  slice_head(n = 5)"
   ],
   "outputs": [],
   "metadata": {
    "id": "iHJPiG6rIUcK"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 使用配方预处理数据 👩‍🍳👨‍🍳 - 处理数据不平衡 ⚖️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/recipes.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>图片由 @allison_horst 提供</figcaption>\n",
    "\n",
    "既然这节课是关于美食的，我们就需要将`recipes`放到具体的情境中。\n",
    "\n",
    "Tidymodels 提供了另一个非常实用的包：`recipes`——一个用于数据预处理的包。\n"
   ],
   "metadata": {
    "id": "kkFd-JxdIaL6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "让我们再来看看我们菜肴的分布情况。\n"
   ],
   "metadata": {
    "id": "6l2ubtTPJAhY"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "old_label_count <- df_select %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "old_label_count"
   ],
   "outputs": [],
   "metadata": {
    "id": "1e-E9cb7JDVi"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "如你所见，各种菜系的数量分布非常不均衡。韩国菜的数量几乎是泰国菜的三倍。不平衡的数据通常会对模型性能产生负面影响。想象一个二分类问题，如果你的数据大部分属于一个类别，机器学习模型可能会更频繁地预测这个类别，仅仅因为它的数据更多。平衡数据可以处理任何偏斜的数据，帮助消除这种不平衡。许多模型在观察数量相等时表现最佳，因此在处理不平衡数据时往往会遇到困难。\n",
    "\n",
    "处理不平衡数据集主要有两种方法：\n",
    "\n",
    "-   为少数类别添加观察值：`过采样`，例如使用 SMOTE 算法\n",
    "\n",
    "-   从多数类别中移除观察值：`欠采样`\n",
    "\n",
    "现在我们来演示如何使用一个`配方`来处理不平衡数据集。配方可以被看作是一个蓝图，描述了应该对数据集应用哪些步骤，以使其准备好进行数据分析。\n"
   ],
   "metadata": {
    "id": "soAw6826JKx9"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Load themis package for dealing with imbalanced data\r\n",
    "library(themis)\r\n",
    "\r\n",
    "# Create a recipe for preprocessing data\r\n",
    "cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \r\n",
    "  step_smote(cuisine)\r\n",
    "\r\n",
    "cuisines_recipe"
   ],
   "outputs": [],
   "metadata": {
    "id": "HS41brUIJVJy"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "让我们分解预处理步骤。\n",
    "\n",
    "-   使用公式调用 `recipe()` 告诉配方变量的*角色*，并以 `df_select` 数据作为参考。例如，`cuisine` 列被分配了 `outcome` 角色，而其他列则被分配了 `predictor` 角色。\n",
    "\n",
    "-   [`step_smote(cuisine)`](https://themis.tidymodels.org/reference/step_smote.html) 创建了一个配方步骤的*规范*，通过使用这些案例的最近邻合成生成少数类的新样本。\n",
    "\n",
    "现在，如果我们想查看预处理后的数据，我们需要使用 [**`prep()`**](https://recipes.tidymodels.org/reference/prep.html) 和 [**`bake()`**](https://recipes.tidymodels.org/reference/bake.html) 来处理我们的配方。\n",
    "\n",
    "`prep()`：从训练集估算所需参数，这些参数可以稍后应用于其他数据集。\n",
    "\n",
    "`bake()`：将预处理过的配方应用于任何数据集。\n"
   ],
   "metadata": {
    "id": "Yb-7t7XcJaC8"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Prep and bake the recipe\r\n",
    "preprocessed_df <- cuisines_recipe %>% \r\n",
    "  prep() %>% \r\n",
    "  bake(new_data = NULL) %>% \r\n",
    "  relocate(cuisine)\r\n",
    "\r\n",
    "# Display data\r\n",
    "preprocessed_df %>% \r\n",
    "  slice_head(n = 5)\r\n",
    "\r\n",
    "# Quick summary stats\r\n",
    "preprocessed_df %>% \r\n",
    "  introduce()"
   ],
   "outputs": [],
   "metadata": {
    "id": "9QhSgdpxJl44"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "现在让我们检查我们的菜系分布，并将其与不平衡数据进行比较。\n"
   ],
   "metadata": {
    "id": "dmidELh_LdV7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "new_label_count <- preprocessed_df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "list(new_label_count = new_label_count,\r\n",
    "     old_label_count = old_label_count)"
   ],
   "outputs": [],
   "metadata": {
    "id": "aSh23klBLwDz"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "嗯！数据干净整洁、平衡且非常棒，简直美味 😋！\n",
    "\n",
    "> 通常情况下，配方（recipe）通常被用作建模的预处理器，它定义了需要对数据集应用哪些步骤以使其为建模做好准备。在这种情况下，通常会使用 `workflow()`（正如我们在之前的课程中已经看到的），而不是手动估算配方。\n",
    ">\n",
    "> 因此，当你使用 tidymodels 时，通常不需要手动调用 **`prep()`** 和 **`bake()`** 来处理配方，但这些函数是非常有用的工具，可以用来确认配方是否按照你的预期运行，就像我们现在的情况一样。\n",
    ">\n",
    "> 当你使用 **`new_data = NULL`** 来 **`bake()`** 一个已经预处理好的配方时，你会得到定义配方时提供的数据，但这些数据已经经过了预处理步骤。\n",
    "\n",
    "现在，让我们保存一份这个数据的副本，以便在后续课程中使用：\n"
   ],
   "metadata": {
    "id": "HEu80HZ8L7ae"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Save preprocessed data\r\n",
    "write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "cBmCbIgrMOI6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "这个新的 CSV文件现在可以在根数据文件夹中找到。\n",
    "\n",
    "**🚀挑战**\n",
    "\n",
    "这个课程包含了几个有趣的数据集。浏览 `data` 文件夹，看看是否有适合二分类或多分类的数据集？你会对这个数据集提出哪些问题？\n",
    "\n",
    "## [**课后测验**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/20/)\n",
    "\n",
    "## **复习与自学**\n",
    "\n",
    "-   查看 [themis 包](https://github.com/tidymodels/themis)。我们还能使用哪些技术来处理数据不平衡问题？\n",
    "\n",
    "-   Tidy models [参考网站](https://www.tidymodels.org/start/)。\n",
    "\n",
    "-   H. Wickham 和 G. Grolemund, [*R for Data Science: 数据的可视化、建模、转换、整理和导入*](https://r4ds.had.co.nz/)。\n",
    "\n",
    "#### 感谢：\n",
    "\n",
    "[`Allison Horst`](https://twitter.com/allison_horst/) 创作了令人惊叹的插图，使 R 更加友好和吸引人。可以在她的 [画廊](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM) 中找到更多插图。\n",
    "\n",
    "[Cassie Breviu](https://www.twitter.com/cassieview) 和 [Jen Looper](https://www.twitter.com/jenlooper) 创作了这个模块的原始 Python 版本 ♥️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/r_learners_sm.jpeg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>插图作者 @allison_horst</figcaption>\n"
   ],
   "metadata": {
    "id": "WQs5621pMGwf"
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**免责声明**：  \n本文档使用AI翻译服务 [Co-op Translator](https://github.com/Azure/co-op-translator) 进行翻译。尽管我们努力确保翻译的准确性，但请注意，自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于关键信息，建议使用专业人工翻译。我们不对因使用此翻译而产生的任何误解或误读承担责任。\n"
   ]
  }
 ]
}