{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "lesson_12-R.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "fab50046ca413a38939d579f8432274f", "translation_date": "2025-08-29T23:50:53+00:00", "source_file": "4-Classification/3-Classifiers-2/solution/R/lesson_12-R.ipynb", "language_code": "mo" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "jsFutf_ygqSx" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "HD54bEefgtNO" }, "source": [ "## 美食分類器 2\n", "\n", "在第二堂分類課中,我們將探索`更多方法`來分類類別型數據。我們還會了解選擇不同分類器所帶來的影響。\n", "\n", "### [**課前測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/23/)\n", "\n", "### **前置條件**\n", "\n", "我們假設你已完成之前的課程,因為我們將延續之前學到的一些概念。\n", "\n", "在這堂課中,我們需要以下套件:\n", "\n", "- `tidyverse`: [tidyverse](https://www.tidyverse.org/) 是一個[由 R 套件組成的集合](https://www.tidyverse.org/packages),旨在讓數據科學更快速、更簡單、更有趣!\n", "\n", "- `tidymodels`: [tidymodels](https://www.tidymodels.org/) 框架是一個[套件集合](https://www.tidymodels.org/packages),用於建模和機器學習。\n", "\n", "- `themis`: [themis 套件](https://themis.tidymodels.org/) 提供額外的配方步驟,用於處理不平衡數據。\n", "\n", "你可以使用以下指令安裝它們:\n", "\n", "`install.packages(c(\"tidyverse\", \"tidymodels\", \"kernlab\", \"themis\", \"ranger\", \"xgboost\", \"kknn\"))`\n", "\n", "或者,以下腳本會檢查你是否已安裝完成此模組所需的套件,並在缺少時為你安裝。\n" ] }, { "cell_type": "code", "metadata": { "id": "vZ57IuUxgyQt" }, "source": [ "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n", "\n", "pacman::p_load(tidyverse, tidymodels, themis, kernlab, ranger, xgboost, kknn)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "z22M-pj4g07x" }, "source": [ "## **1. 一張分類地圖**\n", "\n", "在我們的[上一課](https://github.com/microsoft/ML-For-Beginners/tree/main/4-Classification/2-Classifiers-1)中,我們試圖解答這個問題:如何在多個模型之間進行選擇?在很大程度上,這取決於數據的特性以及我們想要解決的問題類型(例如分類或回歸?)\n", "\n", "之前,我們學習了如何使用 Microsoft 的速查表來分類數據的各種選項。Python 的機器學習框架 Scikit-learn 提供了一個類似但更細緻的速查表,可以進一步幫助縮小估算器(分類器的另一個術語)的範圍:\n", "\n", "

\n", " \n", "

\n" ] }, { "cell_type": "markdown", "metadata": { "id": "u1i3xRIVg7vG" }, "source": [ "> 提示:[在線查看此地圖](https://scikit-learn.org/stable/tutorial/machine_learning_map/),並沿著路徑點擊以閱讀相關文檔。\n", ">\n", "> [Tidymodels 參考網站](https://www.tidymodels.org/find/parsnip/#models)也提供了關於不同模型類型的優秀文檔。\n", "\n", "### **計劃** 🗺️\n", "\n", "當你對數據有清晰的理解時,這張地圖非常有幫助,因為你可以沿著它的路徑“走”到一個決策:\n", "\n", "- 我們有超過 50 個樣本\n", "\n", "- 我們想要預測一個類別\n", "\n", "- 我們有標註過的數據\n", "\n", "- 我們的樣本少於 100K\n", "\n", "- ✨ 我們可以選擇使用 Linear SVC\n", "\n", "- 如果這不起作用,因為我們有數值型數據\n", "\n", " - 我們可以嘗試 ✨ KNeighbors Classifier\n", "\n", " - 如果這不起作用,嘗試 ✨ SVC 和 ✨ Ensemble Classifiers\n", "\n", "這是一條非常有幫助的路徑。現在,讓我們使用 [tidymodels](https://www.tidymodels.org/) 建模框架直接開始吧:這是一個一致且靈活的 R 套件集合,旨在鼓勵良好的統計實踐 😊。\n", "\n", "## 2. 分割數據並處理不平衡的數據集\n", "\n", "在之前的課程中,我們了解到在不同的菜系中有一組常見的成分。此外,菜系的數量分佈也非常不均。\n", "\n", "我們將通過以下方式處理這些問題:\n", "\n", "- 使用 `dplyr::select()` 刪除那些在不同菜系之間造成混淆的最常見成分。\n", "\n", "- 使用一個 `recipe` 預處理數據,通過應用 `over-sampling` 算法使其準備好進行建模。\n", "\n", "我們在之前的課程中已經看過上述內容,所以這應該會非常輕鬆 🥳!\n" ] }, { "cell_type": "code", "metadata": { "id": "6tj_rN00hClA" }, "source": [ "# Load the core Tidyverse and Tidymodels packages\n", "library(tidyverse)\n", "library(tidymodels)\n", "\n", "# Load the original cuisines data\n", "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\n", "\n", "# Drop id column, rice, garlic and ginger from our original data set\n", "df_select <- df %>% \n", " select(-c(1, rice, garlic, ginger)) %>%\n", " # Encode cuisine column as categorical\n", " mutate(cuisine = factor(cuisine))\n", "\n", "\n", "# Create data split specification\n", "set.seed(2056)\n", "cuisines_split <- initial_split(data = df_select,\n", " strata = cuisine,\n", " prop = 0.7)\n", "\n", "# Extract the data in each split\n", "cuisines_train <- training(cuisines_split)\n", "cuisines_test <- testing(cuisines_split)\n", "\n", "# Display distribution of cuisines in the training set\n", "cuisines_train %>% \n", " count(cuisine) %>% \n", " arrange(desc(n))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zFin5yw3hHb1" }, "source": [ "### 處理不平衡數據\n", "\n", "不平衡數據通常會對模型性能產生負面影響。許多模型在觀測數量相等時表現最佳,因此在面對不平衡數據時往往會遇到困難。\n", "\n", "處理不平衡數據集主要有兩種方法:\n", "\n", "- 增加少數類別的觀測數量:`過採樣`,例如使用 SMOTE 演算法,該演算法通過少數類別案例的最近鄰居合成生成新的樣本。\n", "\n", "- 移除多數類別的觀測數量:`欠採樣`\n", "\n", "在之前的課程中,我們展示了如何使用 `recipe` 處理不平衡數據集。`recipe` 可以被視為一種藍圖,描述了應該對數據集應用哪些步驟以使其準備好進行數據分析。在我們的案例中,我們希望在 `訓練集` 中的菜系數量分佈是均等的。讓我們直接開始吧!\n" ] }, { "cell_type": "code", "metadata": { "id": "cRzTnHolhLWd" }, "source": [ "# Load themis package for dealing with imbalanced data\n", "library(themis)\n", "\n", "# Create a recipe for preprocessing training data\n", "cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%\n", " step_smote(cuisine) \n", "\n", "# Print recipe\n", "cuisines_recipe" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "KxOQ2ORhhO81" }, "source": [ "現在我們準備開始訓練模型了 👩‍💻👨‍💻!\n", "\n", "## 3. 超越多項式迴歸模型\n", "\n", "在之前的課程中,我們探討了多項式迴歸模型。現在讓我們來探索一些更靈活的分類模型。\n", "\n", "### 支援向量機\n", "\n", "在分類的背景下,`支援向量機`是一種機器學習技術,旨在尋找一個*超平面*,以「最佳」方式分隔不同的類別。讓我們來看一個簡單的例子:\n", "\n", "

\n", " \n", "

https://commons.wikimedia.org/w/index.php?curid=22877598
\n" ] }, { "cell_type": "markdown", "metadata": { "id": "C4Wsd0vZhXYu" }, "source": [ "H1~ 不會分隔類別。H2~ 會分隔,但僅有小的間距。H3~ 則以最大的間距分隔類別。\n", "\n", "#### 線性支持向量分類器\n", "\n", "支持向量聚類(SVC)是支持向量機(SVM)家族中的一種機器學習技術。在 SVC 中,超平面被選擇用來正確分隔`大部分`的訓練觀測值,但`可能會錯誤分類`一些觀測值。通過允許某些點位於錯誤的一側,SVM 對異常值的抵抗力更強,因此對新數據的泛化能力更好。調節這種違規的參數稱為`cost`,其默認值為 1(請參閱 `help(\"svm_poly\")`)。\n", "\n", "讓我們通過在多項式 SVM 模型中設置 `degree = 1` 來創建一個線性 SVC。\n" ] }, { "cell_type": "code", "metadata": { "id": "vJpp6nuChlBz" }, "source": [ "# Make a linear SVC specification\n", "svc_linear_spec <- svm_poly(degree = 1) %>% \n", " set_engine(\"kernlab\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle specification and recipe into a worklow\n", "svc_linear_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(svc_linear_spec)\n", "\n", "# Print out workflow\n", "svc_linear_wf" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "rDs8cWNkhoqu" }, "source": [ "現在我們已經將預處理步驟和模型規範整合到一個*工作流程*中,接下來可以進行線性 SVC 的訓練,同時評估結果。為了衡量性能指標,我們將建立一個指標集來評估:`accuracy`、`sensitivity`、`Positive Predicted Value` 和 `F Measure`。\n", "\n", "> `augment()` 會在給定的數據中新增預測結果的欄位。\n" ] }, { "cell_type": "code", "metadata": { "id": "81wiqcwuhrnq" }, "source": [ "# Train a linear SVC model\n", "svc_linear_fit <- svc_linear_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "# Create a metric set\n", "eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "svc_linear_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "0UFQvHf-huo3" }, "source": [ "#### 支援向量機\n", "\n", "支援向量機(SVM)是支援向量分類器的延伸,用於處理類別之間的非線性邊界。本質上,SVM 使用*核技巧*來擴展特徵空間,以適應類別之間的非線性關係。一種受歡迎且極具彈性的核函數是*徑向基函數*。讓我們來看看它在我們的數據上會有怎樣的表現。\n" ] }, { "cell_type": "code", "metadata": { "id": "-KX4S8mzhzmp" }, "source": [ "set.seed(2056)\n", "\n", "# Make an RBF SVM specification\n", "svm_rbf_spec <- svm_rbf() %>% \n", " set_engine(\"kernlab\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle specification and recipe into a worklow\n", "svm_rbf_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(svm_rbf_spec)\n", "\n", "\n", "# Train an RBF model\n", "svm_rbf_fit <- svm_rbf_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "svm_rbf_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "QBFSa7WSh4HQ" }, "source": [ "太棒了 🤩!\n", "\n", "> ✅ 請參考:\n", ">\n", "> - [*支持向量機*](https://bradleyboehmke.github.io/HOML/svm.html),《Hands-on Machine Learning with R》\n", ">\n", "> - [*支持向量機*](https://www.statlearning.com/),《An Introduction to Statistical Learning with Applications in R》\n", ">\n", "> 進一步閱讀。\n", "\n", "### 最近鄰分類器\n", "\n", "*K*-最近鄰(KNN)是一種演算法,根據每個觀測值與其他觀測值的*相似性*來進行預測。\n", "\n", "讓我們將它應用到我們的數據中吧。\n" ] }, { "cell_type": "code", "metadata": { "id": "k4BxxBcdh9Ka" }, "source": [ "# Make a KNN specification\n", "knn_spec <- nearest_neighbor() %>% \n", " set_engine(\"kknn\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle recipe and model specification into a workflow\n", "knn_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(knn_spec)\n", "\n", "# Train a boosted tree model\n", "knn_wf_fit <- knn_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "knn_wf_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "HaegQseriAcj" }, "source": [ "看起來這個模型的表現不太理想。可能透過更改模型的參數(請參考 `help(\"nearest_neighbor\")`)可以提升模型的表現。記得試試看。\n", "\n", "> ✅ 請參考:\n", ">\n", "> - [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)\n", ">\n", "> - [An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)\n", ">\n", "> 了解更多關於 *K*-最近鄰分類器的知識。\n", "\n", "### 集成分類器\n", "\n", "集成算法透過結合多個基礎估算器來生成最佳模型,其方法包括:\n", "\n", "`bagging`:對一組基礎模型應用*平均函數*\n", "\n", "`boosting`:建立一系列模型,彼此之間相互依賴,以提升預測性能。\n", "\n", "我們先從嘗試隨機森林模型開始。隨機森林模型會建立大量的決策樹,然後應用平均函數以生成更好的整體模型。\n" ] }, { "cell_type": "code", "metadata": { "id": "49DPoVs6iK1M" }, "source": [ "# Make a random forest specification\n", "rf_spec <- rand_forest() %>% \n", " set_engine(\"ranger\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle recipe and model specification into a workflow\n", "rf_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(rf_spec)\n", "\n", "# Train a random forest model\n", "rf_wf_fit <- rf_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "rf_wf_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "RGVYwC_aiUWc" }, "source": [ "做得好 👏!\n", "\n", "我們也來嘗試使用提升樹模型。\n", "\n", "提升樹是一種集成方法,它建立一系列連續的決策樹,每棵樹都依賴於前一棵樹的結果,試圖逐步減少錯誤。它專注於那些被錯誤分類項目的權重,並調整下一個分類器的擬合以進行修正。\n", "\n", "有多種方式可以擬合此模型(請參閱 `help(\"boost_tree\")`)。在這個例子中,我們將通過 `xgboost` 引擎來擬合提升樹。\n" ] }, { "cell_type": "code", "metadata": { "id": "Py1YWo-micWs" }, "source": [ "# Make a boosted tree specification\n", "boost_spec <- boost_tree(trees = 200) %>% \n", " set_engine(\"xgboost\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle recipe and model specification into a workflow\n", "boost_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(boost_spec)\n", "\n", "# Train a boosted tree model\n", "boost_wf_fit <- boost_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "boost_wf_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zNQnbuejigZM" }, "source": [ "> ✅ 請參考:\n", ">\n", "> - [社會科學家的機器學習](https://cimentadaj.github.io/ml_socsci/tree-based-methods.html#random-forests)\n", ">\n", "> - [R 的實作機器學習](https://bradleyboehmke.github.io/HOML/)\n", ">\n", "> - [統計學習入門:R 應用](https://www.statlearning.com/)\n", ">\n", "> - - 探討 AdaBoost 模型,它是 xgboost 的一個不錯替代方案。\n", ">\n", "> 了解更多關於集成分類器的內容。\n", "\n", "## 4. 額外部分 - 比較多個模型\n", "\n", "在這次的實驗中,我們已經擬合了相當多的模型 🙌。如果要從不同的預處理器和/或模型規範中建立大量工作流程,然後逐一計算性能指標,可能會變得非常繁瑣或費力。\n", "\n", "讓我們看看是否可以通過創建一個函數來解決這個問題。該函數可以在訓練集上擬合一系列工作流程,然後基於測試集返回性能指標。我們將使用 [purrr](https://purrr.tidyverse.org/) 套件中的 `map()` 和 `map_dfr()` 來對列表中的每個元素應用函數。\n", "\n", "> [`map()`](https://purrr.tidyverse.org/reference/map.html) 函數可以讓你用更簡潔且易讀的代碼替代許多 for 迴圈。學習 [`map()`](https://purrr.tidyverse.org/reference/map.html) 函數的最佳地方是 R for Data Science 中的 [迭代章節](http://r4ds.had.co.nz/iteration.html)。\n" ] }, { "cell_type": "code", "metadata": { "id": "Qzb7LyZnimd2" }, "source": [ "set.seed(2056)\n", "\n", "# Create a metric set\n", "eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)\n", "\n", "# Define a function that returns performance metrics\n", "compare_models <- function(workflow_list, train_set, test_set){\n", " \n", " suppressWarnings(\n", " # Fit each model to the train_set\n", " map(workflow_list, fit, data = train_set) %>% \n", " # Make predictions on the test set\n", " map_dfr(augment, new_data = test_set, .id = \"model\") %>%\n", " # Select desired columns\n", " select(model, cuisine, .pred_class) %>% \n", " # Evaluate model performance\n", " group_by(model) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class) %>% \n", " ungroup()\n", " )\n", " \n", "} # End of function" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Fwa712sNisDA" }, "source": [] }, { "cell_type": "code", "metadata": { "id": "3i4VJOi2iu-a" }, "source": [ "# Make a list of workflows\n", "workflow_list <- list(\n", " \"svc\" = svc_linear_wf,\n", " \"svm\" = svm_rbf_wf,\n", " \"knn\" = knn_wf,\n", " \"random_forest\" = rf_wf,\n", " \"xgboost\" = boost_wf)\n", "\n", "# Call the function\n", "set.seed(2056)\n", "perf_metrics <- compare_models(workflow_list = workflow_list, train_set = cuisines_train, test_set = cuisines_test)\n", "\n", "# Print out performance metrics\n", "perf_metrics %>% \n", " group_by(.metric) %>% \n", " arrange(desc(.estimate)) %>% \n", " slice_head(n=7)\n", "\n", "# Compare accuracy\n", "perf_metrics %>% \n", " filter(.metric == \"accuracy\") %>% \n", " arrange(desc(.estimate))\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "KuWK_lEli4nW" }, "source": [ "[**workflowset**](https://workflowsets.tidymodels.org/) 套件讓使用者能夠建立並輕鬆擬合大量模型,但主要是設計用於與像是 `交叉驗證` 這類的重抽樣技術搭配使用,我們尚未涵蓋這部分內容。\n", "\n", "## **🚀挑戰**\n", "\n", "每種技術都有許多參數可以調整,例如 SVM 的 `cost`、KNN 的 `neighbors`、隨機森林的 `mtry`(隨機選擇的預測變數)。\n", "\n", "研究每種模型的預設參數,並思考調整這些參數對模型品質的影響。\n", "\n", "若想了解特定模型及其參數的更多資訊,可以使用:`help(\"model\")`,例如 `help(\"rand_forest\")`\n", "\n", "> 實務上,我們通常會透過在 `模擬數據集` 上訓練多個模型並測量這些模型的表現來*估計*這些參數的*最佳值*。這個過程稱為 **調參**。\n", "\n", "### [**課後測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/24/)\n", "\n", "### **複習與自學**\n", "\n", "這些課程中有許多術語,因此花點時間查看[這份術語清單](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-77952-leestott),幫助你更好地理解!\n", "\n", "#### 特別感謝:\n", "\n", "[`Allison Horst`](https://twitter.com/allison_horst/) 創作了這些令人驚豔的插圖,讓 R 更加親切且有趣。可以在她的[畫廊](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM)找到更多插圖。\n", "\n", "[Cassie Breviu](https://www.twitter.com/cassieview) 和 [Jen Looper](https://www.twitter.com/jenlooper) 創建了這個模組的原始 Python 版本 ♥️\n", "\n", "祝學習愉快,\n", "\n", "[Eric](https://twitter.com/ericntay),Gold Microsoft Learn 學生大使。\n", "\n", "

\n", " \n", "

插圖由 @allison_horst 提供
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免責聲明**: \n本文件已使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。雖然我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。原始文件的母語版本應被視為權威來源。對於關鍵資訊,建議使用專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或誤釋不承擔責任。\n" ] } ] }