You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/hk/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb

720 lines
26 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"nbformat": 4,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "lesson_10-R.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
},
"coopTranslator": {
"original_hash": "2621e24705e8100893c9bf84e0fc8aef",
"translation_date": "2025-09-03T20:41:29+00:00",
"source_file": "4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb",
"language_code": "hk"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# 建立分類模型:美味的亞洲和印度美食\n"
],
"metadata": {
"id": "ItETB4tSFprR"
}
},
{
"cell_type": "markdown",
"source": [
"## 分類簡介:清理、準備和視覺化數據\n",
"\n",
"在這四節課中,你將探索經典機器學習的一個基本重點——*分類*。我們將使用一個關於亞洲和印度各種美食的數據集,逐步了解如何使用不同的分類算法。希望你已經準備好大快朵頤了!\n",
"\n",
"<p >\n",
" <img src=\"../../images/pinch.png\"\n",
" width=\"600\"/>\n",
" <figcaption>在這些課程中一起慶祝泛亞洲美食吧!圖片由 Jen Looper 提供</figcaption>\n",
"\n",
"分類是一種[監督式學習](https://wikipedia.org/wiki/Supervised_learning),與回歸技術有許多相似之處。在分類中,你訓練一個模型來預測某個項目屬於哪個`類別`。如果說機器學習是通過數據集來預測值或名稱,那麼分類通常分為兩類:*二元分類*和*多類分類*。\n",
"\n",
"請記住:\n",
"\n",
"- **線性回歸**幫助你預測變量之間的關係,並準確預測新數據點在該線性關係中的位置。例如,你可以預測南瓜在九月和十二月的價格。\n",
"\n",
"- **邏輯回歸**幫助你發現「二元類別」:在這個價格範圍內,*這個南瓜是橙色還是不是橙色*\n",
"\n",
"分類使用各種算法來確定數據點的標籤或類別。讓我們使用這個美食數據集,看看通過觀察一組食材,是否可以確定其美食的來源。\n",
"\n",
"### [**課前測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/19/)\n",
"\n",
"### **簡介**\n",
"\n",
"分類是機器學習研究人員和數據科學家的基本活動之一。從基本的二元值分類(「這封郵件是垃圾郵件還是不是?」),到使用計算機視覺進行的複雜圖像分類和分割,能夠將數據分類並對其進行提問始終是非常有用的。\n",
"\n",
"用更科學的方式來描述這個過程,你的分類方法會創建一個預測模型,幫助你將輸入變量與輸出變量之間的關係映射出來。\n",
"\n",
"<p >\n",
" <img src=\"../../images/binary-multiclass.png\"\n",
" width=\"600\"/>\n",
" <figcaption>分類算法處理的二元與多類問題。信息圖由 Jen Looper 提供</figcaption>\n",
"\n",
"在開始清理數據、可視化數據並為機器學習任務準備數據之前,讓我們先了解一下機器學習如何用於分類數據的各種方式。\n",
"\n",
"分類源於[統計學](https://wikipedia.org/wiki/Statistical_classification),使用經典機器學習進行分類時,會利用特徵(例如`smoker`、`weight`和`age`)來確定*患某種疾病的可能性*。作為一種與之前進行的回歸練習類似的監督式學習技術,你的數據是帶標籤的,機器學習算法使用這些標籤來分類和預測數據集的類別(或「特徵」),並將它們分配到某個組或結果中。\n",
"\n",
"✅ 花點時間想像一個關於美食的數據集。一個多類模型能回答什麼問題?一個二元模型能回答什麼問題?如果你想確定某種美食是否可能使用葫蘆巴,該怎麼辦?如果你想知道,假如收到一袋包含八角、洋薊、花椰菜和辣根的雜貨,你是否能做出一道典型的印度菜呢?\n",
"\n",
"### **你好,分類器**\n",
"\n",
"我們想要從這個美食數據集中提出的問題實際上是一個**多類問題**,因為我們有多個潛在的國家美食類別可供選擇。給定一組食材,這些數據會屬於哪一類?\n",
"\n",
"Tidymodels 提供了多種算法來分類數據,具體取決於你想解決的問題類型。在接下來的兩節課中,你將學習其中幾種算法。\n",
"\n",
"#### **前置要求**\n",
"\n",
"在這節課中,我們需要以下套件來清理、準備和可視化數據:\n",
"\n",
"- `tidyverse` [tidyverse](https://www.tidyverse.org/) 是一個[由 R 套件組成的集合](https://www.tidyverse.org/packages),旨在讓數據科學更快速、更簡單、更有趣!\n",
"\n",
"- `tidymodels` [tidymodels](https://www.tidymodels.org/) 框架是一個[建模和機器學習的套件集合](https://www.tidymodels.org/packages/)。\n",
"\n",
"- `DataExplorer` [DataExplorer 套件](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html)旨在簡化和自動化探索性數據分析EDA過程和報告生成。\n",
"\n",
"- `themis` [themis 套件](https://themis.tidymodels.org/) 提供了處理不平衡數據的額外 Recipes 步驟。\n",
"\n",
"你可以通過以下方式安裝它們:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"DataExplorer\", \"here\"))`\n",
"\n",
"或者,以下腳本會檢查你是否已安裝完成本模組所需的套件,並在缺失時為你安裝。\n"
],
"metadata": {
"id": "ri5bQxZ-Fz_0"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n",
"\r\n",
"pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
],
"outputs": [],
"metadata": {
"id": "KIPxa4elGAPI"
}
},
{
"cell_type": "markdown",
"source": [
"我們稍後會載入這些很棒的套件,並使它們在我們目前的 R 工作環境中可用。(這只是為了說明,`pacman::p_load()` 已經為你完成了這個步驟)\n"
],
"metadata": {
"id": "YkKAxOJvGD4C"
}
},
{
"cell_type": "markdown",
"source": [
"## 練習 - 清理及平衡你的數據\n",
"\n",
"在開始這個項目之前,第一項任務是清理並**平衡**你的數據,以獲得更好的結果。\n",
"\n",
"來認識一下這些數據吧!🕵️\n"
],
"metadata": {
"id": "PFkQDlk0GN5O"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Import data\r\n",
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n",
"\r\n",
"# View the first 5 rows\r\n",
"df %>% \r\n",
" slice_head(n = 5)\r\n"
],
"outputs": [],
"metadata": {
"id": "Qccw7okxGT0S"
}
},
{
"cell_type": "markdown",
"source": [
"有趣!看來第一列是一種 `id` 列。我們來了解更多關於這些數據的信息。\n"
],
"metadata": {
"id": "XrWnlgSrGVmR"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Basic information about the data\r\n",
"df %>%\r\n",
" introduce()\r\n",
"\r\n",
"# Visualize basic information above\r\n",
"df %>% \r\n",
" plot_intro(ggtheme = theme_light())"
],
"outputs": [],
"metadata": {
"id": "4UcGmxRxGieA"
}
},
{
"cell_type": "markdown",
"source": [
"從輸出中,我們可以立即看到我們有 `2448` 行和 `385` 列,並且沒有缺失值。我們還有一個離散欄位,*cuisine*。\n",
"\n",
"## 練習 - 了解菜系\n",
"\n",
"現在工作開始變得更有趣了。讓我們探索每種菜系的數據分佈。\n"
],
"metadata": {
"id": "AaPubl__GmH5"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Count observations per cuisine\r\n",
"df %>% \r\n",
" count(cuisine) %>% \r\n",
" arrange(n)\r\n",
"\r\n",
"# Plot the distribution\r\n",
"theme_set(theme_light())\r\n",
"df %>% \r\n",
" count(cuisine) %>% \r\n",
" ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"cuisine\")"
],
"outputs": [],
"metadata": {
"id": "FRsBVy5eGrrv"
}
},
{
"cell_type": "markdown",
"source": [
"有各種不同的菜系,但數據的分佈並不平均。你可以改變這種情況!在此之前,先進一步探索一下。\n",
"\n",
"接下來,讓我們將每種菜系分配到各自的 tibble並找出每種菜系的數據量行數和列數。\n",
"\n",
"> [tibble](https://tibble.tidyverse.org/) 是一種現代化的數據框。\n",
"\n",
"<p >\n",
" <img src=\"../../images/dplyr_filter.jpg\"\n",
" width=\"600\"/>\n",
" <figcaption>插圖由 @allison_horst 提供</figcaption>\n"
],
"metadata": {
"id": "vVvyDb1kG2in"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create individual tibble for the cuisines\r\n",
"thai_df <- df %>% \r\n",
" filter(cuisine == \"thai\")\r\n",
"japanese_df <- df %>% \r\n",
" filter(cuisine == \"japanese\")\r\n",
"chinese_df <- df %>% \r\n",
" filter(cuisine == \"chinese\")\r\n",
"indian_df <- df %>% \r\n",
" filter(cuisine == \"indian\")\r\n",
"korean_df <- df %>% \r\n",
" filter(cuisine == \"korean\")\r\n",
"\r\n",
"\r\n",
"# Find out how much data is available per cuisine\r\n",
"cat(\" thai df:\", dim(thai_df), \"\\n\",\r\n",
" \"japanese df:\", dim(japanese_df), \"\\n\",\r\n",
" \"chinese_df:\", dim(chinese_df), \"\\n\",\r\n",
" \"indian_df:\", dim(indian_df), \"\\n\",\r\n",
" \"korean_df:\", dim(korean_df))"
],
"outputs": [],
"metadata": {
"id": "0TvXUxD3G8Bk"
}
},
{
"cell_type": "markdown",
"source": [
"## **練習 - 使用 dplyr 探索不同菜系的主要食材**\n",
"\n",
"現在你可以深入研究數據,了解每種菜系的典型食材。你需要清理一些重複的數據,這些數據可能會在菜系之間造成混淆,因此讓我們來了解這個問題。\n",
"\n",
"在 R 中創建一個名為 `create_ingredient()` 的函數,該函數會返回一個食材的數據框。這個函數將首先刪除一個無用的列,然後根據食材的數量進行排序。\n",
"\n",
"R 中函數的基本結構如下:\n",
"\n",
"`myFunction <- function(arglist){`\n",
"\n",
"**`...`**\n",
"\n",
"**`return`**`(value)`\n",
"\n",
"`}`\n",
"\n",
"可以在[這裡](https://skirmer.github.io/presentations/functions_with_r.html#1)找到一個簡潔的 R 函數入門介紹。\n",
"\n",
"讓我們直接開始吧!我們將使用 [dplyr 動詞](https://dplyr.tidyverse.org/),這些動詞我們在之前的課程中已經學過。以下是回顧:\n",
"\n",
"- `dplyr::select()`: 幫助你選擇要保留或排除的**列**。\n",
"\n",
"- `dplyr::pivot_longer()`: 幫助你將數據“拉長”,增加行數並減少列數。\n",
"\n",
"- `dplyr::group_by()` 和 `dplyr::summarise()`: 幫助你找到不同組的統計摘要,並將它們放入一個整齊的表格中。\n",
"\n",
"- `dplyr::filter()`: 創建一個僅包含滿足條件的行的數據子集。\n",
"\n",
"- `dplyr::mutate()`: 幫助你創建或修改列。\n",
"\n",
"查看這個由 Allison Horst 製作的[充滿藝術感的 learnr 教程](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome),它介紹了一些在 dplyr *(Tidyverse 的一部分)* 中非常有用的數據整理函數。\n"
],
"metadata": {
"id": "K3RF5bSCHC76"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Creates a functions that returns the top ingredients by class\r\n",
"\r\n",
"create_ingredient <- function(df){\r\n",
" \r\n",
" # Drop the id column which is the first colum\r\n",
" ingredient_df = df %>% select(-1) %>% \r\n",
" # Transpose data to a long format\r\n",
" pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \r\n",
" # Find the top most ingredients for a particular cuisine\r\n",
" group_by(ingredients) %>% \r\n",
" summarise(n_instances = sum(count)) %>% \r\n",
" filter(n_instances != 0) %>% \r\n",
" # Arrange by descending order\r\n",
" arrange(desc(n_instances)) %>% \r\n",
" mutate(ingredients = factor(ingredients) %>% fct_inorder())\r\n",
" \r\n",
" \r\n",
" return(ingredient_df)\r\n",
"} # End of function"
],
"outputs": [],
"metadata": {
"id": "uB_0JR82HTPa"
}
},
{
"cell_type": "markdown",
"source": [
"現在我們可以使用這個函數來了解按菜系分類的十大最受歡迎食材。讓我們用 `thai_df` 試試看吧。\n"
],
"metadata": {
"id": "h9794WF8HWmc"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Call create_ingredient and display popular ingredients\r\n",
"thai_ingredient_df <- create_ingredient(df = thai_df)\r\n",
"\r\n",
"thai_ingredient_df %>% \r\n",
" slice_head(n = 10)"
],
"outputs": [],
"metadata": {
"id": "agQ-1HrcHaEA"
}
},
{
"cell_type": "markdown",
"source": [
"在上一節中,我們使用了 `geom_col()`,現在讓我們看看如何使用 `geom_bar` 來製作柱狀圖。使用 `?geom_bar` 了解更多資訊。\n"
],
"metadata": {
"id": "kHu9ffGjHdcX"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make a bar chart for popular thai cuisines\r\n",
"thai_ingredient_df %>% \r\n",
" slice_head(n = 10) %>% \r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
"metadata": {
"id": "fb3Bx_3DHj6e"
}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "RHP_xgdkHnvM"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Japanese cuisines and make bar chart\r\n",
"create_ingredient(df = japanese_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")\r\n"
],
"outputs": [],
"metadata": {
"id": "019v8F0XHrRU"
}
},
{
"cell_type": "markdown",
"source": [
"關於中國菜呢?\n"
],
"metadata": {
"id": "iIGM7vO8Hu3v"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Chinese cuisines and make bar chart\r\n",
"create_ingredient(df = chinese_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
"metadata": {
"id": "lHd9_gd2HyzU"
}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "ir8qyQbNH1c7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Indian cuisines and make bar chart\r\n",
"create_ingredient(df = indian_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
"metadata": {
"id": "ApukQtKjH5FO"
}
},
{
"cell_type": "markdown",
"source": [
"最後,繪製韓國食材。\n"
],
"metadata": {
"id": "qv30cwY1H-FM"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Korean cuisines and make bar chart\r\n",
"create_ingredient(df = korean_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
"metadata": {
"id": "lumgk9cHIBie"
}
},
{
"cell_type": "markdown",
"source": [
"從數據可視化中,我們現在可以使用 `dplyr::select()` 去掉那些在不同菜系之間容易引起混淆的最常見食材。\n",
"\n",
"大家都喜歡米飯、大蒜和薑!\n"
],
"metadata": {
"id": "iO4veMXuIEta"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Drop id column, rice, garlic and ginger from our original data set\r\n",
"df_select <- df %>% \r\n",
" select(-c(1, rice, garlic, ginger))\r\n",
"\r\n",
"# Display new data set\r\n",
"df_select %>% \r\n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "iHJPiG6rIUcK"
}
},
{
"cell_type": "markdown",
"source": [
"## 使用食譜預處理數據 👩‍🍳👨‍🍳 - 處理不平衡數據 ⚖️\n",
"\n",
"<p >\n",
" <img src=\"../../images/recipes.png\"\n",
" width=\"600\"/>\n",
" <figcaption>插圖由 @allison_horst 提供</figcaption>\n",
"\n",
"既然這節課是關於烹飪,我們需要將 `recipes` 放到合適的背景中。\n",
"\n",
"Tidymodels 提供了另一個非常方便的套件:`recipes`——一個用於預處理數據的套件。\n"
],
"metadata": {
"id": "kkFd-JxdIaL6"
}
},
{
"cell_type": "markdown",
"source": [
"讓我們再次看看我們菜式的分佈情況。\n"
],
"metadata": {
"id": "6l2ubtTPJAhY"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Distribution of cuisines\r\n",
"old_label_count <- df_select %>% \r\n",
" count(cuisine) %>% \r\n",
" arrange(desc(n))\r\n",
"\r\n",
"old_label_count"
],
"outputs": [],
"metadata": {
"id": "1e-E9cb7JDVi"
}
},
{
"cell_type": "markdown",
"source": [
"如你所見,菜式的數量分佈非常不均衡。韓國菜的數量幾乎是泰國菜的三倍。不均衡的數據通常會對模型的表現產生負面影響。試想一個二元分類問題,如果你的大部分數據都屬於某一個類別,機器學習模型就會更頻繁地預測該類別,僅僅因為它有更多的數據可供學習。平衡數據可以修正任何偏斜的數據,幫助消除這種不均衡。許多模型在觀測數量相等時表現最佳,因此在處理不均衡數據時往往會遇到困難。\n",
"\n",
"處理不均衡數據集主要有兩種方法:\n",
"\n",
"- 為少數類別添加觀測值:`過採樣`,例如使用 SMOTE 演算法\n",
"\n",
"- 從多數類別移除觀測值:`欠採樣`\n",
"\n",
"現在讓我們演示如何使用 `recipe` 來處理不均衡數據集。`recipe` 可以被視為一個藍圖,描述了應該對數據集應用哪些步驟,以使其準備好進行數據分析。\n"
],
"metadata": {
"id": "soAw6826JKx9"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load themis package for dealing with imbalanced data\r\n",
"library(themis)\r\n",
"\r\n",
"# Create a recipe for preprocessing data\r\n",
"cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \r\n",
" step_smote(cuisine)\r\n",
"\r\n",
"cuisines_recipe"
],
"outputs": [],
"metadata": {
"id": "HS41brUIJVJy"
}
},
{
"cell_type": "markdown",
"source": [
"讓我們逐步了解預處理的步驟。\n",
"\n",
"- 使用公式調用 `recipe()` 時,會根據 `df_select` 數據作為參考,告訴 recipe 變數的*角色*。例如,`cuisine` 列被分配了 `outcome` 角色,而其他列則被分配了 `predictor` 角色。\n",
"\n",
"- [`step_smote(cuisine)`](https://themis.tidymodels.org/reference/step_smote.html) 創建了一個 recipe 步驟的*規範*,使用這些案例的最近鄰居合成生成少數類別的新樣本。\n",
"\n",
"現在,如果我們想查看預處理後的數據,就需要使用 [**`prep()`**](https://recipes.tidymodels.org/reference/prep.html) 和 [**`bake()`**](https://recipes.tidymodels.org/reference/bake.html) 來處理我們的 recipe。\n",
"\n",
"`prep()`:從訓練集估算所需的參數,這些參數可以稍後應用到其他數據集。\n",
"\n",
"`bake()`:使用已準備好的 recipe並將操作應用到任何數據集。\n"
],
"metadata": {
"id": "Yb-7t7XcJaC8"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Prep and bake the recipe\r\n",
"preprocessed_df <- cuisines_recipe %>% \r\n",
" prep() %>% \r\n",
" bake(new_data = NULL) %>% \r\n",
" relocate(cuisine)\r\n",
"\r\n",
"# Display data\r\n",
"preprocessed_df %>% \r\n",
" slice_head(n = 5)\r\n",
"\r\n",
"# Quick summary stats\r\n",
"preprocessed_df %>% \r\n",
" introduce()"
],
"outputs": [],
"metadata": {
"id": "9QhSgdpxJl44"
}
},
{
"cell_type": "markdown",
"source": [
"現在讓我們檢查我們的菜式分佈,並將其與不平衡的數據進行比較。\n"
],
"metadata": {
"id": "dmidELh_LdV7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Distribution of cuisines\r\n",
"new_label_count <- preprocessed_df %>% \r\n",
" count(cuisine) %>% \r\n",
" arrange(desc(n))\r\n",
"\r\n",
"list(new_label_count = new_label_count,\r\n",
" old_label_count = old_label_count)"
],
"outputs": [],
"metadata": {
"id": "aSh23klBLwDz"
}
},
{
"cell_type": "markdown",
"source": [
"好棒!數據既乾淨又平衡,簡直美味可口 😋!\n",
"\n",
"> 通常配方recipe通常用作建模的預處理器定義了需要對數據集進行哪些步驟以準備好進行建模。在這種情況下通常會使用 `workflow()`(正如我們在之前的課程中已經看到的),而不是手動估算配方。\n",
">\n",
"> 因此,當使用 tidymodels 時,通常不需要使用 **`prep()`** 和 **`bake()`** 來處理配方,但這些函數在工具箱中是非常有用的,可以用來確認配方是否按照預期運行,就像我們的情況一樣。\n",
">\n",
"> 當你使用 **`new_data = NULL`** 來 **`bake()`** 一個已準備好的配方時,你會得到在定義配方時提供的數據,但這些數據已經經過了預處理步驟。\n",
"\n",
"現在讓我們保存一份這些數據的副本,以便在未來的課程中使用:\n"
],
"metadata": {
"id": "HEu80HZ8L7ae"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Save preprocessed data\r\n",
"write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
],
"outputs": [],
"metadata": {
"id": "cBmCbIgrMOI6"
}
},
{
"cell_type": "markdown",
"source": [
"這個新的 CSV 現在可以在根目錄的資料夾中找到。\n",
"\n",
"**🚀挑戰**\n",
"\n",
"這份課程包含多個有趣的數據集。深入探索 `data` 資料夾,看看是否有任何數據集適合用於二元或多類別分類?你會對這些數據集提出什麼問題?\n",
"\n",
"## [**課後測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/20/)\n",
"\n",
"## **回顧與自學**\n",
"\n",
"- 查看 [themis 套件](https://github.com/tidymodels/themis)。我們還可以使用哪些其他技術來處理不平衡數據?\n",
"\n",
"- Tidy models [參考網站](https://www.tidymodels.org/start/)。\n",
"\n",
"- H. Wickham 和 G. Grolemund[*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/)。\n",
"\n",
"#### 特別感謝:\n",
"\n",
"[`Allison Horst`](https://twitter.com/allison_horst/) 創作了令人驚嘆的插圖,使 R 更加親切和吸引人。可以在她的 [畫廊](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM) 中找到更多插圖。\n",
"\n",
"[Cassie Breviu](https://www.twitter.com/cassieview) 和 [Jen Looper](https://www.twitter.com/jenlooper) 創作了這個模組的原始 Python 版本 ♥️\n",
"\n",
"<p >\n",
" <img src=\"../../images/r_learners_sm.jpeg\"\n",
" width=\"600\"/>\n",
" <figcaption>插圖由 @allison_horst 提供</figcaption>\n"
],
"metadata": {
"id": "WQs5621pMGwf"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免責聲明** \n本文件已使用人工智能翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。儘管我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。原始語言的文件應被視為權威來源。對於重要資訊,建議使用專業的人類翻譯。我們對因使用此翻譯而引起的任何誤解或錯誤解釋不承擔責任。\n"
]
}
]
}