{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "lesson_12-R.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "fab50046ca413a38939d579f8432274f", "translation_date": "2025-09-03T20:31:31+00:00", "source_file": "4-Classification/3-Classifiers-2/solution/R/lesson_12-R.ipynb", "language_code": "zh" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "jsFutf_ygqSx" }, "source": [ "# 构建分类模型:美味的亚洲和印度美食\n" ] }, { "cell_type": "markdown", "metadata": { "id": "HD54bEefgtNO" }, "source": [ "## 美食分类器 2\n", "\n", "在第二节分类课程中,我们将探索`更多方法`来分类类别数据。同时,我们还会学习选择不同分类器所带来的影响。\n", "\n", "### [**课前测验**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/23/)\n", "\n", "### **前置知识**\n", "\n", "我们假设你已经完成了之前的课程,因为我们会继续使用之前学到的一些概念。\n", "\n", "在本课程中,我们需要以下软件包:\n", "\n", "- `tidyverse`: [tidyverse](https://www.tidyverse.org/) 是一个[由 R 包组成的集合](https://www.tidyverse.org/packages),旨在让数据科学更快、更简单、更有趣!\n", "\n", "- `tidymodels`: [tidymodels](https://www.tidymodels.org/) 框架是一个[由 R 包组成的集合](https://www.tidymodels.org/packages),用于建模和机器学习。\n", "\n", "- `themis`: [themis 包](https://themis.tidymodels.org/) 提供了额外的配方步骤,用于处理不平衡数据。\n", "\n", "你可以通过以下命令安装它们:\n", "\n", "`install.packages(c(\"tidyverse\", \"tidymodels\", \"kernlab\", \"themis\", \"ranger\", \"xgboost\", \"kknn\"))`\n", "\n", "或者,下面的脚本会检查你是否已经安装了完成本模块所需的软件包,并在缺少时为你安装它们。\n" ] }, { "cell_type": "code", "metadata": { "id": "vZ57IuUxgyQt" }, "source": [ "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n", "\n", "pacman::p_load(tidyverse, tidymodels, themis, kernlab, ranger, xgboost, kknn)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "z22M-pj4g07x" }, "source": [ "## **1. 分类图**\n", "\n", "在我们[上一节课](https://github.com/microsoft/ML-For-Beginners/tree/main/4-Classification/2-Classifiers-1)中,我们尝试解决一个问题:如何在多个模型之间进行选择?在很大程度上,这取决于数据的特性以及我们想要解决的问题类型(例如分类或回归)。\n", "\n", "之前,我们学习了使用微软的速查表对数据进行分类的各种选项。Python的机器学习框架Scikit-learn提供了一个类似但更细化的速查表,可以进一步帮助缩小你的估算器(分类器的另一种说法)的选择范围:\n", "\n", "

\n", " \n", "

\n" ] }, { "cell_type": "markdown", "metadata": { "id": "u1i3xRIVg7vG" }, "source": [ "> 提示:[在线查看这张地图](https://scikit-learn.org/stable/tutorial/machine_learning_map/),并沿着路径点击以阅读相关文档。\n", ">\n", "> [Tidymodels参考网站](https://www.tidymodels.org/find/parsnip/#models)也提供了关于不同模型类型的优秀文档。\n", "\n", "### **计划** 🗺️\n", "\n", "这张地图在你清楚了解数据后非常有用,因为你可以沿着路径“走”到一个决策:\n", "\n", "- 我们有超过50个样本\n", "\n", "- 我们想预测一个类别\n", "\n", "- 我们有标注数据\n", "\n", "- 我们的样本少于10万\n", "\n", "- ✨ 我们可以选择线性SVC\n", "\n", "- 如果这不起作用,因为我们有数值数据\n", "\n", " - 我们可以尝试 ✨ KNeighbors分类器\n", "\n", " - 如果这不起作用,尝试 ✨ SVC 和 ✨ 集成分类器\n", "\n", "这是一条非常有用的路径。现在,让我们使用 [tidymodels](https://www.tidymodels.org/) 建模框架直接开始吧:一个一致且灵活的R包集合,旨在鼓励良好的统计实践 😊。\n", "\n", "## 2. 划分数据并处理不平衡数据集\n", "\n", "从之前的课程中,我们了解到不同菜系之间有一组常见的成分。此外,菜系的数量分布也非常不均衡。\n", "\n", "我们将通过以下方式处理这些问题:\n", "\n", "- 使用 `dplyr::select()` 删除那些在不同菜系之间造成混淆的最常见成分。\n", "\n", "- 使用一个 `recipe` 来预处理数据,使其通过应用 `过采样` 算法为建模做好准备。\n", "\n", "我们在之前的课程中已经看过这些内容,所以这应该会很轻松 🥳!\n" ] }, { "cell_type": "code", "metadata": { "id": "6tj_rN00hClA" }, "source": [ "# Load the core Tidyverse and Tidymodels packages\n", "library(tidyverse)\n", "library(tidymodels)\n", "\n", "# Load the original cuisines data\n", "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\n", "\n", "# Drop id column, rice, garlic and ginger from our original data set\n", "df_select <- df %>% \n", " select(-c(1, rice, garlic, ginger)) %>%\n", " # Encode cuisine column as categorical\n", " mutate(cuisine = factor(cuisine))\n", "\n", "\n", "# Create data split specification\n", "set.seed(2056)\n", "cuisines_split <- initial_split(data = df_select,\n", " strata = cuisine,\n", " prop = 0.7)\n", "\n", "# Extract the data in each split\n", "cuisines_train <- training(cuisines_split)\n", "cuisines_test <- testing(cuisines_split)\n", "\n", "# Display distribution of cuisines in the training set\n", "cuisines_train %>% \n", " count(cuisine) %>% \n", " arrange(desc(n))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zFin5yw3hHb1" }, "source": [ "### 处理数据不平衡问题\n", "\n", "数据不平衡通常会对模型性能产生负面影响。许多模型在观察数量相等时表现最佳,因此在处理不平衡数据时往往会遇到困难。\n", "\n", "处理数据不平衡问题主要有两种方法:\n", "\n", "- 为少数类别添加观察值:`过采样`,例如使用 SMOTE 算法,该算法通过少数类别的近邻合成生成新的样本。\n", "\n", "- 从多数类别中移除观察值:`欠采样`\n", "\n", "在之前的课程中,我们演示了如何使用 `recipe` 来处理数据不平衡问题。`recipe` 可以被看作是一个蓝图,描述了应该对数据集应用哪些步骤以使其准备好进行数据分析。在我们的案例中,我们希望在 `训练集` 中实现菜系数量的均匀分布。让我们直接开始吧。\n" ] }, { "cell_type": "code", "metadata": { "id": "cRzTnHolhLWd" }, "source": [ "# Load themis package for dealing with imbalanced data\n", "library(themis)\n", "\n", "# Create a recipe for preprocessing training data\n", "cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%\n", " step_smote(cuisine) \n", "\n", "# Print recipe\n", "cuisines_recipe" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "KxOQ2ORhhO81" }, "source": [ "现在我们准备开始训练模型了 👩‍💻👨‍💻!\n", "\n", "## 3. 超越多项式回归模型\n", "\n", "在之前的课程中,我们学习了多项式回归模型。现在让我们探索一些更灵活的分类模型。\n", "\n", "### 支持向量机\n", "\n", "在分类的背景下,`支持向量机`是一种机器学习技术,它试图找到一个*超平面*来“最佳”地分隔不同的类别。让我们来看一个简单的例子:\n", "\n", "

\n", " \n", "

https://commons.wikimedia.org/w/index.php?curid=22877598
\n" ] }, { "cell_type": "markdown", "metadata": { "id": "C4Wsd0vZhXYu" }, "source": [ "H1~ 不会分隔类。H2~ 会分隔类,但仅有小的间距。H3~ 会以最大间距分隔类。\n", "\n", "#### 线性支持向量分类器\n", "\n", "支持向量聚类(SVC)是支持向量机(SVM)机器学习技术家族中的一种方法。在 SVC 中,超平面被选择为正确分隔`大多数`训练样本,但`可能会错误分类`一些样本。通过允许某些点位于错误的一侧,SVM 对异常值的鲁棒性更强,因此对新数据的泛化能力更好。调节这种违反规则的参数称为`cost`,其默认值为 1(参见 `help(\"svm_poly\")`)。\n", "\n", "让我们通过在多项式 SVM 模型中设置 `degree = 1` 来创建一个线性 SVC。\n" ] }, { "cell_type": "code", "metadata": { "id": "vJpp6nuChlBz" }, "source": [ "# Make a linear SVC specification\n", "svc_linear_spec <- svm_poly(degree = 1) %>% \n", " set_engine(\"kernlab\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle specification and recipe into a worklow\n", "svc_linear_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(svc_linear_spec)\n", "\n", "# Print out workflow\n", "svc_linear_wf" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "rDs8cWNkhoqu" }, "source": [ "现在我们已经将预处理步骤和模型规范整合到一个*工作流*中,可以继续训练线性SVC并在此过程中评估结果。对于性能指标,我们可以创建一个指标集来评估:`准确率`、`敏感性`、`正预测值`和`F值`。\n", "\n", "> `augment()` 会向给定数据添加预测结果的列。\n" ] }, { "cell_type": "code", "metadata": { "id": "81wiqcwuhrnq" }, "source": [ "# Train a linear SVC model\n", "svc_linear_fit <- svc_linear_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "# Create a metric set\n", "eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "svc_linear_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "0UFQvHf-huo3" }, "source": [ "#### 支持向量机\n", "\n", "支持向量机(SVM)是支持向量分类器的扩展,用于处理类别之间的非线性边界。本质上,SVM通过使用*核技巧*来扩大特征空间,以适应类别之间的非线性关系。SVM使用的一种流行且极其灵活的核函数是*径向基函数*。让我们看看它在我们的数据上表现如何。\n" ] }, { "cell_type": "code", "metadata": { "id": "-KX4S8mzhzmp" }, "source": [ "set.seed(2056)\n", "\n", "# Make an RBF SVM specification\n", "svm_rbf_spec <- svm_rbf() %>% \n", " set_engine(\"kernlab\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle specification and recipe into a worklow\n", "svm_rbf_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(svm_rbf_spec)\n", "\n", "\n", "# Train an RBF model\n", "svm_rbf_fit <- svm_rbf_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "svm_rbf_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "QBFSa7WSh4HQ" }, "source": [ "太棒了 🤩!\n", "\n", "> ✅ 请参阅:\n", ">\n", "> - [*支持向量机*](https://bradleyboehmke.github.io/HOML/svm.html),《Hands-on Machine Learning with R》\n", ">\n", "> - [*支持向量机*](https://www.statlearning.com/),《An Introduction to Statistical Learning with Applications in R》\n", ">\n", "> 了解更多内容。\n", "\n", "### 最近邻分类器\n", "\n", "*K*-最近邻(KNN)是一种算法,根据每个观测值与其他观测值的*相似性*来进行预测。\n", "\n", "让我们将其应用到我们的数据中。\n" ] }, { "cell_type": "code", "metadata": { "id": "k4BxxBcdh9Ka" }, "source": [ "# Make a KNN specification\n", "knn_spec <- nearest_neighbor() %>% \n", " set_engine(\"kknn\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle recipe and model specification into a workflow\n", "knn_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(knn_spec)\n", "\n", "# Train a boosted tree model\n", "knn_wf_fit <- knn_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "knn_wf_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "HaegQseriAcj" }, "source": [ "看起来这个模型的表现不是很好。可能通过更改模型的参数(请参阅 `help(\"nearest_neighbor\")`)可以提升模型的性能。一定要尝试一下。\n", "\n", "> ✅ 请参考:\n", ">\n", "> - [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)\n", ">\n", "> - [An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)\n", ">\n", "> 了解更多关于 *K*-最近邻分类器的信息。\n", "\n", "### 集成分类器\n", "\n", "集成算法通过结合多个基础估计器来构建一个优化模型,其方法包括:\n", "\n", "`bagging`:对一组基础模型应用*平均函数*\n", "\n", "`boosting`:构建一系列模型,彼此之间相互依赖,以提升预测性能。\n", "\n", "我们先尝试一个随机森林模型,它通过构建大量决策树并应用平均函数来生成一个更优的整体模型。\n" ] }, { "cell_type": "code", "metadata": { "id": "49DPoVs6iK1M" }, "source": [ "# Make a random forest specification\n", "rf_spec <- rand_forest() %>% \n", " set_engine(\"ranger\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle recipe and model specification into a workflow\n", "rf_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(rf_spec)\n", "\n", "# Train a random forest model\n", "rf_wf_fit <- rf_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "rf_wf_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "RGVYwC_aiUWc" }, "source": [ "干得好 👏!\n", "\n", "我们也来尝试一下提升树模型。\n", "\n", "提升树是一种集成方法,它通过创建一系列连续的决策树,每棵树都依赖于前一棵树的结果,试图逐步减少误差。它重点关注被错误分类的项目的权重,并调整下一分类器的拟合以进行纠正。\n", "\n", "有多种方法可以拟合此模型(参见 `help(\"boost_tree\")`)。在这个例子中,我们将通过 `xgboost` 引擎来拟合提升树。\n" ] }, { "cell_type": "code", "metadata": { "id": "Py1YWo-micWs" }, "source": [ "# Make a boosted tree specification\n", "boost_spec <- boost_tree(trees = 200) %>% \n", " set_engine(\"xgboost\") %>% \n", " set_mode(\"classification\")\n", "\n", "# Bundle recipe and model specification into a workflow\n", "boost_wf <- workflow() %>% \n", " add_recipe(cuisines_recipe) %>% \n", " add_model(boost_spec)\n", "\n", "# Train a boosted tree model\n", "boost_wf_fit <- boost_wf %>% \n", " fit(data = cuisines_train)\n", "\n", "\n", "# Make predictions and Evaluate model performance\n", "boost_wf_fit %>% \n", " augment(new_data = cuisines_test) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zNQnbuejigZM" }, "source": [ "> ✅ 请参阅:\n", ">\n", "> - [社会科学中的机器学习](https://cimentadaj.github.io/ml_socsci/tree-based-methods.html#random-forests)\n", ">\n", "> - [R语言实践中的机器学习](https://bradleyboehmke.github.io/HOML/)\n", ">\n", "> - [统计学习导论:R语言应用](https://www.statlearning.com/)\n", ">\n", "> - - 探讨了AdaBoost模型,这是xgboost的一个不错替代方案。\n", ">\n", "> 了解更多关于集成分类器的信息。\n", "\n", "## 4. 额外内容 - 比较多个模型\n", "\n", "在本次实验中,我们已经拟合了相当多的模型 🙌。如果需要从不同的预处理器和/或模型规格中创建大量工作流,然后逐一计算性能指标,这可能会变得繁琐或费力。\n", "\n", "让我们看看是否可以通过创建一个函数来解决这个问题,该函数可以在训练集上拟合一组工作流,并根据测试集返回性能指标。我们将使用 [purrr](https://purrr.tidyverse.org/) 包中的 `map()` 和 `map_dfr()` 来对列表中的每个元素应用函数。\n", "\n", "> [`map()`](https://purrr.tidyverse.org/reference/map.html) 函数允许您用更简洁且更易读的代码替代许多for循环。学习 [`map()`](https://purrr.tidyverse.org/reference/map.html) 函数的最佳地方是《R语言数据科学》中的[迭代章节](http://r4ds.had.co.nz/iteration.html)。\n" ] }, { "cell_type": "code", "metadata": { "id": "Qzb7LyZnimd2" }, "source": [ "set.seed(2056)\n", "\n", "# Create a metric set\n", "eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)\n", "\n", "# Define a function that returns performance metrics\n", "compare_models <- function(workflow_list, train_set, test_set){\n", " \n", " suppressWarnings(\n", " # Fit each model to the train_set\n", " map(workflow_list, fit, data = train_set) %>% \n", " # Make predictions on the test set\n", " map_dfr(augment, new_data = test_set, .id = \"model\") %>%\n", " # Select desired columns\n", " select(model, cuisine, .pred_class) %>% \n", " # Evaluate model performance\n", " group_by(model) %>% \n", " eval_metrics(truth = cuisine, estimate = .pred_class) %>% \n", " ungroup()\n", " )\n", " \n", "} # End of function" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Fwa712sNisDA" }, "source": [] }, { "cell_type": "code", "metadata": { "id": "3i4VJOi2iu-a" }, "source": [ "# Make a list of workflows\n", "workflow_list <- list(\n", " \"svc\" = svc_linear_wf,\n", " \"svm\" = svm_rbf_wf,\n", " \"knn\" = knn_wf,\n", " \"random_forest\" = rf_wf,\n", " \"xgboost\" = boost_wf)\n", "\n", "# Call the function\n", "set.seed(2056)\n", "perf_metrics <- compare_models(workflow_list = workflow_list, train_set = cuisines_train, test_set = cuisines_test)\n", "\n", "# Print out performance metrics\n", "perf_metrics %>% \n", " group_by(.metric) %>% \n", " arrange(desc(.estimate)) %>% \n", " slice_head(n=7)\n", "\n", "# Compare accuracy\n", "perf_metrics %>% \n", " filter(.metric == \"accuracy\") %>% \n", " arrange(desc(.estimate))\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "KuWK_lEli4nW" }, "source": [ "[**workflowset**](https://workflowsets.tidymodels.org/) 包允许用户创建并轻松拟合大量模型,但主要设计用于与诸如 `交叉验证` 之类的重采样技术配合使用,这是一种我们尚未涉及的方法。\n", "\n", "## **🚀挑战**\n", "\n", "每种技术都有许多参数可以调整,例如 SVM 中的 `cost`,KNN 中的 `neighbors`,随机森林中的 `mtry`(随机选择的预测变量)。\n", "\n", "研究每种模型的默认参数,并思考调整这些参数对模型质量的影响。\n", "\n", "要了解特定模型及其参数的更多信息,请使用:`help(\"model\")`,例如 `help(\"rand_forest\")`\n", "\n", "> 实际中,我们通常通过在一个 `模拟数据集` 上训练多个模型并测量这些模型的表现来*估计*这些参数的*最佳值*。这个过程称为 **调参**。\n", "\n", "### [**课后测验**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/24/)\n", "\n", "### **复习与自学**\n", "\n", "这些课程中有很多术语,因此花点时间查看[这个列表](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-77952-leestott)中的有用术语!\n", "\n", "#### 特别感谢:\n", "\n", "[`Allison Horst`](https://twitter.com/allison_horst/) 创作了令人惊叹的插图,使 R 更加友好和吸引人。可以在她的[画廊](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM)中找到更多插图。\n", "\n", "[Cassie Breviu](https://www.twitter.com/cassieview) 和 [Jen Looper](https://www.twitter.com/jenlooper) 创作了本模块的原始 Python 版本 ♥️\n", "\n", "祝学习愉快,\n", "\n", "[Eric](https://twitter.com/ericntay),微软金牌学习学生大使。\n", "\n", "

\n", " \n", "

插图作者 @allison_horst
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免责声明**: \n本文档使用AI翻译服务[Co-op Translator](https://github.com/Azure/co-op-translator)进行翻译。尽管我们努力确保翻译的准确性,但请注意,自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于关键信息,建议使用专业人工翻译。我们不对因使用此翻译而产生的任何误解或误读承担责任。\n" ] } ] }