You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/mo/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb

716 lines
26 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"nbformat": 4,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "lesson_10-R.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
},
"coopTranslator": {
"original_hash": "2621e24705e8100893c9bf84e0fc8aef",
"translation_date": "2025-08-29T23:58:17+00:00",
"source_file": "4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb",
"language_code": "mo"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "ItETB4tSFprR"
}
},
{
"cell_type": "markdown",
"source": [
"## 分類介紹:清理、準備和視覺化您的數據\n",
"\n",
"在這四節課中,您將探索經典機器學習的一個基本重點——*分類*。我們將使用關於亞洲和印度美食的數據集,逐步了解各種分類算法的應用。希望您已經準備好大快朵頤!\n",
"\n",
"<p >\n",
" <img src=\"../../images/pinch.png\"\n",
" width=\"600\"/>\n",
" <figcaption>在這些課程中一起慶祝泛亞洲美食吧!圖片由 Jen Looper 提供</figcaption>\n",
"\n",
"分類是一種[監督式學習](https://wikipedia.org/wiki/Supervised_learning),與回歸技術有許多相似之處。在分類中,您訓練模型以預測某個項目屬於哪個`類別`。如果機器學習的核心是通過數據集來預測事物的值或名稱,那麼分類通常分為兩類:*二元分類*和*多類分類*。\n",
"\n",
"請記住:\n",
"\n",
"- **線性回歸**幫助您預測變量之間的關係,並準確預測新數據點在該線性關係中的位置。例如,您可以預測數值,例如*南瓜在九月和十二月的價格*。\n",
"\n",
"- **邏輯回歸**幫助您發現“二元類別”:在這個價格範圍內,*這個南瓜是橙色還是非橙色*\n",
"\n",
"分類使用各種算法來確定數據點的標籤或類別。讓我們使用這些美食數據,看看是否可以通過觀察一組食材來確定其美食的來源。\n",
"\n",
"### [**課前測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/19/)\n",
"\n",
"### **介紹**\n",
"\n",
"分類是機器學習研究者和數據科學家的基本活動之一。從基本的二元值分類(“這封電子郵件是垃圾郵件還是非垃圾郵件?”),到使用計算機視覺進行複雜的圖像分類和分割,能夠將數據分成類別並提出問題始終是非常有用的。\n",
"\n",
"用更科學的方式來描述這個過程,您的分類方法會創建一個預測模型,使您能夠將輸入變量與輸出變量之間的關係進行映射。\n",
"\n",
"<p >\n",
" <img src=\"../../images/binary-multiclass.png\"\n",
" width=\"600\"/>\n",
" <figcaption>分類算法處理二元與多類問題。信息圖由 Jen Looper 提供</figcaption>\n",
"\n",
"在開始清理數據、視覺化數據以及為機器學習任務準備數據之前,讓我們先了解一下機器學習如何用於分類數據的各種方式。\n",
"\n",
"源自[統計學](https://wikipedia.org/wiki/Statistical_classification),使用經典機器學習進行分類會利用特徵,例如`吸煙者`、`體重`和`年齡`來確定*患某種疾病的可能性*。作為一種類似於您之前進行的回歸練習的監督式學習技術,您的數據是有標籤的,機器學習算法使用這些標籤來分類和預測數據集的類別(或“特徵”),並將它們分配到某個群組或結果中。\n",
"\n",
"✅ 花點時間想像一個關於美食的數據集。一個多類模型能回答什麼問題?一個二元模型又能回答什麼問題?如果您想確定某種美食是否可能使用葫蘆巴葉,該怎麼辦?如果您想知道,假如收到一袋包含八角、洋薊、花椰菜和辣根的雜貨,是否可以創造出一道典型的印度菜?\n",
"\n",
"### **你好,分類器**\n",
"\n",
"我們想要問這個美食數據集的問題實際上是一個**多類問題**,因為我們有多個潛在的國家美食可以選擇。給定一批食材,這些數據會屬於哪一類?\n",
"\n",
"Tidymodels 提供了多種算法來分類數據,具體取決於您想要解決的問題類型。在接下來的兩節課中,您將學習其中幾種算法。\n",
"\n",
"#### **前置條件**\n",
"\n",
"在本課程中,我們需要以下套件來清理、準備和視覺化數據:\n",
"\n",
"- `tidyverse`[tidyverse](https://www.tidyverse.org/) 是一個[由 R 套件組成的集合](https://www.tidyverse.org/packages),旨在讓數據科學更快速、更簡單、更有趣!\n",
"\n",
"- `tidymodels`[tidymodels](https://www.tidymodels.org/) 框架是一個[建模和機器學習的套件集合](https://www.tidymodels.org/packages/)。\n",
"\n",
"- `DataExplorer`[DataExplorer 套件](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html)旨在簡化和自動化探索性數據分析EDA過程及報告生成。\n",
"\n",
"- `themis`[themis 套件](https://themis.tidymodels.org/)提供了額外的配方步驟,用於處理不平衡數據。\n",
"\n",
"您可以通過以下方式安裝它們:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"DataExplorer\", \"here\"))`\n",
"\n",
"或者,以下腳本會檢查您是否擁有完成本模組所需的套件,並在缺少時為您安裝。\n"
],
"metadata": {
"id": "ri5bQxZ-Fz_0"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n",
"\r\n",
"pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
],
"outputs": [],
"metadata": {
"id": "KIPxa4elGAPI"
}
},
{
"cell_type": "markdown",
"source": [
"稍後我們將載入這些很棒的套件並使其在我們目前的 R 工作環境中可用。(這僅是為了說明,`pacman::p_load()` 已經為您完成了這項工作)\n"
],
"metadata": {
"id": "YkKAxOJvGD4C"
}
},
{
"cell_type": "markdown",
"source": [
"## 練習 - 清理並平衡您的數據\n",
"\n",
"在開始這個項目之前,第一個任務是清理並**平衡**您的數據,以獲得更好的結果\n",
"\n",
"來認識一下這些數據吧!🕵️\n"
],
"metadata": {
"id": "PFkQDlk0GN5O"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Import data\r\n",
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n",
"\r\n",
"# View the first 5 rows\r\n",
"df %>% \r\n",
" slice_head(n = 5)\r\n"
],
"outputs": [],
"metadata": {
"id": "Qccw7okxGT0S"
}
},
{
"cell_type": "markdown",
"source": [
"有趣!從外觀來看,第一列是一種類似 `id` 的列。讓我們多了解一些關於這些數據的信息。\n"
],
"metadata": {
"id": "XrWnlgSrGVmR"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Basic information about the data\r\n",
"df %>%\r\n",
" introduce()\r\n",
"\r\n",
"# Visualize basic information above\r\n",
"df %>% \r\n",
" plot_intro(ggtheme = theme_light())"
],
"outputs": [],
"metadata": {
"id": "4UcGmxRxGieA"
}
},
{
"cell_type": "markdown",
"source": [
"從輸出中,我們可以立即看到我們有 `2448` 行和 `385` 列,並且沒有缺失值。我們還有一個離散欄位,*cuisine*。\n",
"\n",
"## 練習 - 了解料理類型\n",
"\n",
"現在工作開始變得更有趣了。讓我們探索每種料理類型的數據分佈。\n"
],
"metadata": {
"id": "AaPubl__GmH5"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Count observations per cuisine\r\n",
"df %>% \r\n",
" count(cuisine) %>% \r\n",
" arrange(n)\r\n",
"\r\n",
"# Plot the distribution\r\n",
"theme_set(theme_light())\r\n",
"df %>% \r\n",
" count(cuisine) %>% \r\n",
" ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"cuisine\")"
],
"outputs": [],
"metadata": {
"id": "FRsBVy5eGrrv"
}
},
{
"cell_type": "markdown",
"source": [
"有許多種不同的料理,但數據的分佈並不均勻。你可以改變這種情況!在此之前,先多探索一下吧。\n",
"\n",
"接下來,讓我們將每種料理分配到各自的 tibble 中,並找出每種料理的數據量(行數和列數)。\n",
"\n",
"> [tibble](https://tibble.tidyverse.org/) 是一種現代化的資料框。\n",
"\n",
"<p >\n",
" <img src=\"../../images/dplyr_filter.jpg\"\n",
" width=\"600\"/>\n",
" <figcaption>插圖由 @allison_horst 提供</figcaption>\n"
],
"metadata": {
"id": "vVvyDb1kG2in"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create individual tibble for the cuisines\r\n",
"thai_df <- df %>% \r\n",
" filter(cuisine == \"thai\")\r\n",
"japanese_df <- df %>% \r\n",
" filter(cuisine == \"japanese\")\r\n",
"chinese_df <- df %>% \r\n",
" filter(cuisine == \"chinese\")\r\n",
"indian_df <- df %>% \r\n",
" filter(cuisine == \"indian\")\r\n",
"korean_df <- df %>% \r\n",
" filter(cuisine == \"korean\")\r\n",
"\r\n",
"\r\n",
"# Find out how much data is available per cuisine\r\n",
"cat(\" thai df:\", dim(thai_df), \"\\n\",\r\n",
" \"japanese df:\", dim(japanese_df), \"\\n\",\r\n",
" \"chinese_df:\", dim(chinese_df), \"\\n\",\r\n",
" \"indian_df:\", dim(indian_df), \"\\n\",\r\n",
" \"korean_df:\", dim(korean_df))"
],
"outputs": [],
"metadata": {
"id": "0TvXUxD3G8Bk"
}
},
{
"cell_type": "markdown",
"source": [
"## **練習 - 使用 dplyr 探索各種料理的主要食材**\n",
"\n",
"現在你可以深入研究數據,了解每種料理的典型食材。你需要清理掉一些重複的數據,這些數據可能會在不同料理之間造成混淆,因此讓我們來學習如何解決這個問題。\n",
"\n",
"在 R 中建立一個名為 `create_ingredient()` 的函數,該函數會返回一個食材的數據框。這個函數將從刪除一個無用的欄位開始,並根據食材的出現次數進行排序。\n",
"\n",
"R 中函數的基本結構如下:\n",
"\n",
"`myFunction <- function(arglist){`\n",
"\n",
"**`...`**\n",
"\n",
"**`return`**`(value)`\n",
"\n",
"`}`\n",
"\n",
"可以在 [這裡](https://skirmer.github.io/presentations/functions_with_r.html#1) 找到一個簡潔的 R 函數入門介紹。\n",
"\n",
"讓我們直接開始吧!我們將使用 [dplyr 動詞](https://dplyr.tidyverse.org/),這些動詞我們在之前的課程中已經學過。以下是回顧:\n",
"\n",
"- `dplyr::select()`: 幫助你選擇要保留或排除的 **欄位**。\n",
"\n",
"- `dplyr::pivot_longer()`: 幫助你將數據 \"拉長\",增加行數並減少欄位數。\n",
"\n",
"- `dplyr::group_by()` 和 `dplyr::summarise()`: 幫助你找到不同群組的統計摘要,並將結果整理成一個漂亮的表格。\n",
"\n",
"- `dplyr::filter()`: 創建一個數據子集,只包含符合條件的行。\n",
"\n",
"- `dplyr::mutate()`: 幫助你創建或修改欄位。\n",
"\n",
"查看這個由 Allison Horst 製作的 [*藝術感滿滿* 的 learnr 教程](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome),它介紹了一些 dplyr *(Tidyverse 的一部分)* 中有用的數據整理函數。\n"
],
"metadata": {
"id": "K3RF5bSCHC76"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Creates a functions that returns the top ingredients by class\r\n",
"\r\n",
"create_ingredient <- function(df){\r\n",
" \r\n",
" # Drop the id column which is the first colum\r\n",
" ingredient_df = df %>% select(-1) %>% \r\n",
" # Transpose data to a long format\r\n",
" pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \r\n",
" # Find the top most ingredients for a particular cuisine\r\n",
" group_by(ingredients) %>% \r\n",
" summarise(n_instances = sum(count)) %>% \r\n",
" filter(n_instances != 0) %>% \r\n",
" # Arrange by descending order\r\n",
" arrange(desc(n_instances)) %>% \r\n",
" mutate(ingredients = factor(ingredients) %>% fct_inorder())\r\n",
" \r\n",
" \r\n",
" return(ingredient_df)\r\n",
"} # End of function"
],
"outputs": [],
"metadata": {
"id": "uB_0JR82HTPa"
}
},
{
"cell_type": "markdown",
"source": [
"現在我們可以使用這個函數來了解各種料理中最受歡迎的前十名食材。讓我們用 `thai_df` 來試試看吧!\n"
],
"metadata": {
"id": "h9794WF8HWmc"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Call create_ingredient and display popular ingredients\r\n",
"thai_ingredient_df <- create_ingredient(df = thai_df)\r\n",
"\r\n",
"thai_ingredient_df %>% \r\n",
" slice_head(n = 10)"
],
"outputs": [],
"metadata": {
"id": "agQ-1HrcHaEA"
}
},
{
"cell_type": "markdown",
"source": [
"在上一節中,我們使用了 `geom_col()`,現在讓我們看看如何使用 `geom_bar` 來製作柱狀圖。使用 `?geom_bar` 進一步閱讀。\n"
],
"metadata": {
"id": "kHu9ffGjHdcX"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make a bar chart for popular thai cuisines\r\n",
"thai_ingredient_df %>% \r\n",
" slice_head(n = 10) %>% \r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
"metadata": {
"id": "fb3Bx_3DHj6e"
}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "RHP_xgdkHnvM"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Japanese cuisines and make bar chart\r\n",
"create_ingredient(df = japanese_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")\r\n"
],
"outputs": [],
"metadata": {
"id": "019v8F0XHrRU"
}
},
{
"cell_type": "markdown",
"source": [
"關於中國菜餚呢?\n"
],
"metadata": {
"id": "iIGM7vO8Hu3v"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Chinese cuisines and make bar chart\r\n",
"create_ingredient(df = chinese_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
"metadata": {
"id": "lHd9_gd2HyzU"
}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "ir8qyQbNH1c7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Indian cuisines and make bar chart\r\n",
"create_ingredient(df = indian_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
"metadata": {
"id": "ApukQtKjH5FO"
}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "qv30cwY1H-FM"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Get popular ingredients for Korean cuisines and make bar chart\r\n",
"create_ingredient(df = korean_df) %>% \r\n",
" slice_head(n = 10) %>%\r\n",
" ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
" geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\r\n",
" xlab(\"\") + ylab(\"\")"
],
"outputs": [],
"metadata": {
"id": "lumgk9cHIBie"
}
},
{
"cell_type": "markdown",
"source": [
"從數據視覺化中,我們現在可以刪除那些在不同菜系之間容易引起混淆的最常見食材,使用 `dplyr::select()`。\n",
"\n",
"大家都喜歡米飯、大蒜和薑!\n"
],
"metadata": {
"id": "iO4veMXuIEta"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Drop id column, rice, garlic and ginger from our original data set\r\n",
"df_select <- df %>% \r\n",
" select(-c(1, rice, garlic, ginger))\r\n",
"\r\n",
"# Display new data set\r\n",
"df_select %>% \r\n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "iHJPiG6rIUcK"
}
},
{
"cell_type": "markdown",
"source": [
"## 使用配方預處理數據 👩‍🍳👨‍🍳 - 處理不平衡數據 ⚖️\n",
"\n",
"<p >\n",
" <img src=\"../../images/recipes.png\"\n",
" width=\"600\"/>\n",
" <figcaption>圖片由 @allison_horst 提供</figcaption>\n",
"\n",
"既然這節課是關於料理的,我們就得將 `recipes` 放入情境中。\n",
"\n",
"Tidymodels 提供了另一個很棒的套件:`recipes`——一個用於預處理數據的套件。\n"
],
"metadata": {
"id": "kkFd-JxdIaL6"
}
},
{
"cell_type": "markdown",
"source": [
"讓我們再次看看我們料理的分佈情況。\n"
],
"metadata": {
"id": "6l2ubtTPJAhY"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Distribution of cuisines\r\n",
"old_label_count <- df_select %>% \r\n",
" count(cuisine) %>% \r\n",
" arrange(desc(n))\r\n",
"\r\n",
"old_label_count"
],
"outputs": [],
"metadata": {
"id": "1e-E9cb7JDVi"
}
},
{
"cell_type": "markdown",
"source": [
"如你所見,不同菜系的數量分佈非常不均。韓國菜的數量幾乎是泰國菜的三倍。不平衡的數據往往會對模型的表現產生負面影響。想像一個二元分類問題,如果你的大部分數據都屬於某一類,那麼機器學習模型可能會更頻繁地預測該類別,僅僅因為該類別的數據更多。平衡數據的過程可以調整這種偏斜,幫助消除不平衡。許多模型在觀測數量相等時表現最佳,因此在面對不平衡數據時往往會遇到困難。\n",
"\n",
"處理不平衡數據集主要有兩種方法:\n",
"\n",
"- 增加少數類別的觀測數:`過採樣`,例如使用 SMOTE 演算法\n",
"\n",
"- 減少多數類別的觀測數:`欠採樣`\n",
"\n",
"現在,我們來演示如何使用一個 `recipe` 處理不平衡數據集。一個 recipe 可以被視為一個藍圖,描述了應該對數據集應用哪些步驟,以便為數據分析做好準備。\n"
],
"metadata": {
"id": "soAw6826JKx9"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load themis package for dealing with imbalanced data\r\n",
"library(themis)\r\n",
"\r\n",
"# Create a recipe for preprocessing data\r\n",
"cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \r\n",
" step_smote(cuisine)\r\n",
"\r\n",
"cuisines_recipe"
],
"outputs": [],
"metadata": {
"id": "HS41brUIJVJy"
}
},
{
"cell_type": "markdown",
"source": [
"讓我們來分解預處理的步驟。\n",
"\n",
"- 使用帶有公式的 `recipe()` 呼叫,會根據 `df_select` 資料作為參考,告訴 recipe 各變數的*角色*。例如,`cuisine` 欄位被指定為 `outcome` 角色,而其他欄位則被指定為 `predictor` 角色。\n",
"\n",
"- [`step_smote(cuisine)`](https://themis.tidymodels.org/reference/step_smote.html) 創建了一個 recipe 步驟的*規範*,該步驟使用這些案例的最近鄰,合成生成少數類別的新樣本。\n",
"\n",
"現在,如果我們想查看預處理後的資料,就需要 [**`prep()`**](https://recipes.tidymodels.org/reference/prep.html) 和 [**`bake()`**](https://recipes.tidymodels.org/reference/bake.html) 我們的 recipe。\n",
"\n",
"`prep()`:從訓練集估算所需的參數,這些參數之後可以應用到其他資料集。\n",
"\n",
"`bake()`:將已準備好的 recipe 應用到任何資料集上,並執行相關操作。\n"
],
"metadata": {
"id": "Yb-7t7XcJaC8"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Prep and bake the recipe\r\n",
"preprocessed_df <- cuisines_recipe %>% \r\n",
" prep() %>% \r\n",
" bake(new_data = NULL) %>% \r\n",
" relocate(cuisine)\r\n",
"\r\n",
"# Display data\r\n",
"preprocessed_df %>% \r\n",
" slice_head(n = 5)\r\n",
"\r\n",
"# Quick summary stats\r\n",
"preprocessed_df %>% \r\n",
" introduce()"
],
"outputs": [],
"metadata": {
"id": "9QhSgdpxJl44"
}
},
{
"cell_type": "markdown",
"source": [
"現在讓我們檢查我們的菜餚分佈,並將其與不平衡的數據進行比較。\n"
],
"metadata": {
"id": "dmidELh_LdV7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Distribution of cuisines\r\n",
"new_label_count <- preprocessed_df %>% \r\n",
" count(cuisine) %>% \r\n",
" arrange(desc(n))\r\n",
"\r\n",
"list(new_label_count = new_label_count,\r\n",
" old_label_count = old_label_count)"
],
"outputs": [],
"metadata": {
"id": "aSh23klBLwDz"
}
},
{
"cell_type": "markdown",
"source": [
"嗯!這些數據既乾淨又平衡,真是美味可口 😋!\n",
"\n",
"> 通常配方recipe通常用作建模的預處理器它定義了應對數據集進行哪些步驟以使其準備好進行建模。在這種情況下通常使用 `workflow()`(正如我們在之前的課程中已經看到的),而不是手動估算配方。\n",
">\n",
"> 因此,當使用 tidymodels 時,通常不需要使用 **`prep()`** 和 **`bake()`** 來處理配方,但這些函數在工具箱中是很有用的,可以用來確認配方是否按預期運作,就像我們的情況一樣。\n",
">\n",
"> 當你使用 **`new_data = NULL`** 來 **`bake()`** 一個已準備好的配方時,你會得到定義配方時提供的數據,但這些數據已經經過了預處理步驟。\n",
"\n",
"現在讓我們保存一份這些數據的副本,以便在未來的課程中使用:\n"
],
"metadata": {
"id": "HEu80HZ8L7ae"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Save preprocessed data\r\n",
"write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
],
"outputs": [],
"metadata": {
"id": "cBmCbIgrMOI6"
}
},
{
"cell_type": "markdown",
"source": [
"這個新的 CSV 現在可以在根目錄的資料夾中找到。\n",
"\n",
"**🚀挑戰**\n",
"\n",
"這份課程包含了幾個有趣的數據集。瀏覽 `data` 資料夾,看看是否有任何數據集適合用於二元或多類別分類?你會對這些數據集提出哪些問題?\n",
"\n",
"## [**課後測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/20/)\n",
"\n",
"## **複習與自學**\n",
"\n",
"- 查看 [themis 套件](https://github.com/tidymodels/themis)。我們還可以使用哪些其他技術來處理不平衡數據?\n",
"\n",
"- Tidy models [參考網站](https://www.tidymodels.org/start/)。\n",
"\n",
"- H. Wickham 和 G. Grolemund, [*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/)。\n",
"\n",
"#### 特別感謝:\n",
"\n",
"[`Allison Horst`](https://twitter.com/allison_horst/) 創作了這些令人驚嘆的插圖,使 R 更加親切且有趣。可以在她的 [畫廊](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM) 中找到更多插圖。\n",
"\n",
"[Cassie Breviu](https://www.twitter.com/cassieview) 和 [Jen Looper](https://www.twitter.com/jenlooper) 創建了這個模組的原始 Python 版本 ♥️\n",
"\n",
"<p >\n",
" <img src=\"../../images/r_learners_sm.jpeg\"\n",
" width=\"600\"/>\n",
" <figcaption>插圖由 @allison_horst 提供</figcaption>\n"
],
"metadata": {
"id": "WQs5621pMGwf"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免責聲明** \n本文件已使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。雖然我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。原始文件的母語版本應被視為權威來源。對於關鍵資訊,建議尋求專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或錯誤解釋不承擔責任。\n"
]
}
]
}