You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/zh/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb

685 lines
28 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 构建逻辑回归模型 - 第4课\n",
"\n",
"![逻辑回归与线性回归信息图](../../../../../../translated_images/linear-vs-logistic.ba180bf95e7ee66721ba10ebf2dac2666acbd64a88b003c83928712433a13c7d.zh.png)\n",
"\n",
"#### **[课前测验](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n",
"\n",
"#### 介绍\n",
"\n",
"在关于回归的最后一课中,我们将学习一种经典的机器学习技术——逻辑回归。你可以使用这种技术发现模式来预测二元分类。例如,这颗糖果是巧克力还是不是?这种疾病是否具有传染性?这个顾客是否会选择这个产品?\n",
"\n",
"在本课中,你将学习:\n",
"\n",
"- 逻辑回归的技术\n",
"\n",
"✅ 在这个 [学习模块](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott) 中深入了解如何使用这种回归方法。\n",
"\n",
"## 前置知识\n",
"\n",
"在之前使用南瓜数据的过程中,我们已经足够熟悉它,并意识到其中有一个可以使用的二元分类:`Color`。\n",
"\n",
"让我们构建一个逻辑回归模型,根据一些变量来预测*某个南瓜可能的颜色*(橙色 🎃 或白色 👻)。\n",
"\n",
"> 为什么我们在关于回归的课程中讨论二元分类?仅仅是为了语言上的方便,因为逻辑回归实际上是[一种分类方法](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression),尽管它是基于线性的方法。在下一组课程中,你将学习其他分类数据的方法。\n",
"\n",
"在本课中,我们需要以下软件包:\n",
"\n",
"- `tidyverse` [tidyverse](https://www.tidyverse.org/) 是一个 [R 包集合](https://www.tidyverse.org/packages),旨在让数据科学更快、更简单、更有趣!\n",
"\n",
"- `tidymodels` [tidymodels](https://www.tidymodels.org/) 框架是一个 [包集合](https://www.tidymodels.org/packages),用于建模和机器学习。\n",
"\n",
"- `janitor` [janitor 包](https://github.com/sfirke/janitor) 提供了一些简单的小工具,用于检查和清理脏数据。\n",
"\n",
"- `ggbeeswarm` [ggbeeswarm 包](https://github.com/eclarke/ggbeeswarm) 提供了使用 ggplot2 创建蜜蜂群图的方法。\n",
"\n",
"你可以通过以下方式安装它们:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n",
"\n",
"或者,下面的脚本会检查你是否已经安装了完成本模块所需的软件包,并在缺少时为你安装它们。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **定义问题**\n",
"\n",
"在我们的场景中,我们将问题定义为一个二元分类:“白色”或“非白色”。我们的数据集中还有一个“条纹”类别,但它的样本数量很少,因此我们不会使用它。实际上,当我们从数据集中移除空值后,这个类别也会消失。\n",
"\n",
"> 🎃 有趣的事实:我们有时会把白色南瓜称为“幽灵”南瓜。它们不太容易雕刻,因此不像橙色南瓜那么受欢迎,但它们看起来很酷!所以我们也可以将问题重新表述为:“幽灵”或“非幽灵”。👻\n",
"\n",
"## **关于逻辑回归**\n",
"\n",
"逻辑回归与之前学习的线性回归在几个重要方面有所不同。\n",
"\n",
"#### **二元分类**\n",
"\n",
"逻辑回归不具备线性回归的相同功能。前者提供关于`二元类别`(例如“橙色或非橙色”)的预测,而后者能够预测`连续值`,例如根据南瓜的产地和收获时间,*预测其价格将上涨多少*。\n",
"\n",
"![Dasani Madipalli制作的信息图](../../../../../../translated_images/pumpkin-classifier.562771f104ad5436b87d1c67bca02a42a17841133556559325c0a0e348e5b774.zh.png)\n",
"\n",
"### 其他分类方式\n",
"\n",
"逻辑回归还有其他类型,包括多项式和有序分类:\n",
"\n",
"- **多项式分类**,涉及多个类别——例如“橙色、白色和条纹”。\n",
"\n",
"- **有序分类**,涉及有序的类别,这在我们需要逻辑地排列结果时很有用,例如按南瓜的有限尺寸(迷你、小、中、大、特大、超大)进行排序。\n",
"\n",
"![多项式分类 vs 有序分类](../../../../../../translated_images/multinomial-vs-ordinal.36701b4850e37d86c9dd49f7bef93a2f94dbdb8fe03443eb68f0542f97f28f29.zh.png)\n",
"\n",
"#### **变量不需要相关**\n",
"\n",
"还记得线性回归在变量相关性较强时效果更好吗?逻辑回归正好相反——变量不需要相关性。这非常适合我们的数据,因为它的相关性较弱。\n",
"\n",
"#### **需要大量干净的数据**\n",
"\n",
"如果使用更多数据,逻辑回归会提供更准确的结果;我们的数据集较小,因此并不是完成这项任务的最佳选择,请记住这一点。\n",
"\n",
"✅ 思考哪些类型的数据适合逻辑回归\n",
"\n",
"## 练习 - 整理数据\n",
"\n",
"首先,稍微清理一下数据,删除空值并选择部分列:\n",
"\n",
"1. 添加以下代码:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Load the core tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# Import the data and clean column names\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n",
" clean_names()\n",
"\n",
"# Select desired columns\n",
"pumpkins_select <- pumpkins %>% \n",
" select(c(city_name, package, variety, origin, item_size, color)) \n",
"\n",
"# Drop rows containing missing values and encode color as factor (category)\n",
"pumpkins_select <- pumpkins_select %>% \n",
" drop_na() %>% \n",
" mutate(color = factor(color))\n",
"\n",
"# View the first few rows\n",
"pumpkins_select %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"您可以随时使用 [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) 函数来快速查看新的数据框,如下所示:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"pumpkins_select %>% \n",
" glimpse()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"让我们确认一下,我们实际上是在处理一个二分类问题:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Subset distinct observations in outcome column\n",
"pumpkins_select %>% \n",
" distinct(color)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 可视化 - 分类图\n",
"到目前为止,您已经再次加载了南瓜数据并进行了清理,以保留包含一些变量(包括颜色)的数据集。现在让我们使用 ggplot 库在笔记本中可视化这个数据框。\n",
"\n",
"ggplot 库提供了一些很棒的方法来可视化您的数据。例如,您可以在分类图中比较每种品种和颜色的数据分布。\n",
"\n",
"1. 使用 geombar 函数创建这样的图表,使用我们的南瓜数据,并为每种南瓜类别(橙色或白色)指定颜色映射:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Specify colors for each value of the hue variable\n",
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
"\n",
"# Create the bar plot\n",
"ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n",
" geom_bar(position = \"dodge\") +\n",
" scale_fill_manual(values = palette) +\n",
" labs(y = \"Variety\", fill = \"Color\") +\n",
" theme_minimal()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通过观察数据,可以看到颜色数据与品种之间的关系。\n",
"\n",
"✅ 根据这个分类图,你能想到哪些有趣的探索方向?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据预处理:特征编码\n",
"\n",
"我们的南瓜数据集的所有列都包含字符串值。处理分类数据对人类来说很直观,但对机器来说却不是这样。机器学习算法更擅长处理数字数据。这就是为什么编码是数据预处理阶段中非常重要的一步,因为它使我们能够将分类数据转换为数值数据,同时不丢失任何信息。良好的编码能够帮助我们构建一个优秀的模型。\n",
"\n",
"对于特征编码,主要有两种类型的编码器:\n",
"\n",
"1. **序数编码器Ordinal encoder**:适用于序数变量,这类变量是具有逻辑顺序的分类变量,比如我们数据集中的 `item_size` 列。它会创建一个映射,使每个类别用一个数字表示,这个数字对应类别在列中的顺序。\n",
"\n",
"2. **分类编码器Categorical encoder**:适用于名义变量,这类变量是没有逻辑顺序的分类变量,比如我们数据集中除了 `item_size` 以外的所有特征。这是一种独热编码one-hot encoding意味着每个类别都会用一个二进制列表示如果南瓜属于该类别则编码变量等于1否则为0。\n",
"\n",
"Tidymodels 提供了另一个非常实用的包:[recipes](https://recipes.tidymodels.org/),这是一个用于数据预处理的包。我们将定义一个 `recipe`,指定所有预测列都应该被编码为一组整数,然后通过 `prep` 来估算任何操作所需的量和统计数据,最后通过 `bake` 将这些计算应用到新数据上。\n",
"\n",
"> 通常情况下recipes 通常用作建模的预处理器,它定义了为了让数据集适合建模需要应用哪些步骤。在这种情况下,**强烈建议** 使用 `workflow()`,而不是手动通过 prep 和 bake 来估算 recipe。我们稍后会详细讲解这一点。\n",
">\n",
"> 不过目前,我们使用 recipes + prep + bake 来指定对数据集需要应用哪些步骤,以便让数据集准备好进行数据分析,然后提取应用了这些步骤的预处理数据。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Preprocess and extract data to allow some data analysis\n",
"baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%\n",
" # Define ordering for item_size column\n",
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
" # Convert factors to numbers using the order defined above (Ordinal encoding)\n",
" step_integer(item_size, zero_based = F) %>%\n",
" # Encode all other predictors using one hot encoding\n",
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%\n",
" prep(data = pumpkin_select) %>%\n",
" bake(new_data = NULL)\n",
"\n",
"# Display the first few rows of preprocessed data\n",
"baked_pumpkins %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"✅ 使用序数编码器对 Item Size 列进行编码有哪些优势?\n",
"\n",
"### 分析变量之间的关系\n",
"\n",
"现在我们已经对数据进行了预处理,可以分析特征与标签之间的关系,以了解模型在给定特征的情况下预测标签的能力。这类分析的最佳方式是对数据进行可视化。 \n",
"我们将再次使用 ggplot 的 geom_boxplot_ 函数,以分类图的形式展示 Item Size、Variety 和 Color 之间的关系。为了更好地绘制数据,我们将使用编码后的 Item Size 列和未编码的 Variety 列。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Define the color palette\n",
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
"\n",
"# We need the encoded Item Size column to use it as the x-axis values in the plot\n",
"pumpkins_select_plot<-pumpkins_select\n",
"pumpkins_select_plot$item_size <- baked_pumpkins$item_size\n",
"\n",
"# Create the grouped box plot\n",
"ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +\n",
" geom_boxplot() +\n",
" facet_grid(variety ~ ., scales = \"free_x\") +\n",
" scale_fill_manual(values = palette) +\n",
" labs(x = \"Item Size\", y = \"\") +\n",
" theme_minimal() +\n",
" theme(strip.text = element_text(size = 12)) +\n",
" theme(axis.text.x = element_text(size = 10)) +\n",
" theme(axis.title.x = element_text(size = 12)) +\n",
" theme(axis.title.y = element_blank()) +\n",
" theme(legend.position = \"bottom\") +\n",
" guides(fill = guide_legend(title = \"Color\")) +\n",
" theme(panel.spacing = unit(0.5, \"lines\"))+\n",
" theme(strip.text.y = element_text(size = 4, hjust = 0)) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 使用群集图\n",
"\n",
"由于颜色是一个二元类别(白色或非白色),它需要一种“[专门的方法](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)”来进行可视化。\n",
"\n",
"尝试使用`群集图`来展示颜色相对于item_size的分布。\n",
"\n",
"我们将使用[ggbeeswarm包](https://github.com/eclarke/ggbeeswarm)该包提供了使用ggplot2创建蜂群式图的方法。蜂群图是一种将通常会重叠的点排列在彼此旁边的绘图方式。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Create beeswarm plots of color and item_size\n",
"baked_pumpkins %>% \n",
" mutate(color = factor(color)) %>% \n",
" ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n",
" geom_quasirandom() +\n",
" scale_color_brewer(palette = \"Dark2\", direction = -1) +\n",
" theme(legend.position = \"none\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在我们已经了解了颜色的二元分类与更大尺寸类别之间的关系,接下来让我们探索逻辑回归,以确定某个南瓜可能的颜色。\n",
"\n",
"## 构建模型\n",
"\n",
"选择您想在分类模型中使用的变量,并将数据分为训练集和测试集。[rsample](https://rsample.tidymodels.org/) 是 Tidymodels 中的一个包,它提供了高效的数据分割和重采样的基础设施:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Split data into 80% for training and 20% for testing\n",
"set.seed(2056)\n",
"pumpkins_split <- pumpkins_select %>% \n",
" initial_split(prop = 0.8)\n",
"\n",
"# Extract the data in each split\n",
"pumpkins_train <- training(pumpkins_split)\n",
"pumpkins_test <- testing(pumpkins_split)\n",
"\n",
"# Print out the first 5 rows of the training set\n",
"pumpkins_train %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"🙌 我们现在准备通过将训练特征与训练标签(颜色)进行拟合来训练模型。\n",
"\n",
"我们将首先创建一个配方,用于指定对数据进行建模前的预处理步骤,例如:将分类变量编码为一组整数。就像 `baked_pumpkins` 一样,我们创建了一个 `pumpkins_recipe`,但不会立即 `prep` 和 `bake`,因为这些步骤会被整合到一个工作流中,稍后您会看到具体操作。\n",
"\n",
"在 Tidymodels 中,有很多方法可以指定逻辑回归模型。请参阅 `?logistic_reg()`。目前,我们将通过默认的 `stats::glm()` 引擎来指定一个逻辑回归模型。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Create a recipe that specifies preprocessing steps for modelling\n",
"pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n",
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
" step_integer(item_size, zero_based = F) %>% \n",
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)\n",
"\n",
"# Create a logistic model specification\n",
"log_reg <- logistic_reg() %>% \n",
" set_engine(\"glm\") %>% \n",
" set_mode(\"classification\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在我们已经有了一个配方和一个模型规范,我们需要找到一种方法将它们打包成一个对象。这个对象将首先对数据进行预处理(在幕后完成 prep 和 bake 操作),然后在预处理后的数据上拟合模型,同时还支持潜在的后处理操作。\n",
"\n",
"在 Tidymodels 中,这个方便的对象被称为 [`workflow`](https://workflows.tidymodels.org/),它能够方便地容纳你的建模组件。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Bundle modelling components in a workflow\n",
"log_reg_wf <- workflow() %>% \n",
" add_recipe(pumpkins_recipe) %>% \n",
" add_model(log_reg)\n",
"\n",
"# Print out the workflow\n",
"log_reg_wf\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在*指定*工作流程后,可以使用[`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html)函数对模型进行`训练`。工作流程会估算配方并在训练前对数据进行预处理因此我们无需手动使用prep和bake来完成这些步骤。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Train the model\n",
"wf_fit <- log_reg_wf %>% \n",
" fit(data = pumpkins_train)\n",
"\n",
"# Print the trained workflow\n",
"wf_fit\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"模型训练期间打印出的内容显示了学习到的系数。\n",
"\n",
"现在我们已经使用训练数据训练了模型,可以使用 [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html) 对测试数据进行预测。让我们从使用模型预测测试集的标签以及每个标签的概率开始。当概率大于 0.5 时,预测类别为 `WHITE`,否则为 `ORANGE`。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Make predictions for color and corresponding probabilities\n",
"results <- pumpkins_test %>% select(color) %>% \n",
" bind_cols(wf_fit %>% \n",
" predict(new_data = pumpkins_test)) %>%\n",
" bind_cols(wf_fit %>%\n",
" predict(new_data = pumpkins_test, type = \"prob\"))\n",
"\n",
"# Compare predictions\n",
"results %>% \n",
" slice_head(n = 10)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"非常好!这为我们提供了更多关于逻辑回归工作原理的见解。\n",
"\n",
"### 通过混淆矩阵更好地理解\n",
"\n",
"将每个预测值与其对应的“真实值”进行比较并不是评估模型预测效果的高效方法。幸运的是Tidymodels 还有一些其他的技巧:[`yardstick`](https://yardstick.tidymodels.org/)——一个通过性能指标来衡量模型效果的工具包。\n",
"\n",
"与分类问题相关的一个性能指标是[`混淆矩阵`](https://wikipedia.org/wiki/Confusion_matrix)。混淆矩阵描述了分类模型的表现情况。它统计了模型对每个类别正确分类的样本数量。在我们的例子中,它会显示有多少橙色南瓜被正确分类为橙色,有多少白色南瓜被正确分类为白色;同时,混淆矩阵还会显示有多少样本被错误分类到**其他类别**。\n",
"\n",
"来自 yardstick 的 [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) 函数可以计算观察值和预测值的交叉分类表。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Confusion matrix for prediction results\n",
"conf_mat(data = results, truth = color, estimate = .pred_class)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"让我们来解读混淆矩阵。我们的模型需要将南瓜分类为两个二元类别:类别 `white` 和类别 `not-white`。\n",
"\n",
"- 如果你的模型预测南瓜为白色,并且它实际上属于类别 'white',我们称之为 `true positive`,显示在左上角的数字。\n",
"\n",
"- 如果你的模型预测南瓜为非白色,并且它实际上属于类别 'white',我们称之为 `false negative`,显示在左下角的数字。\n",
"\n",
"- 如果你的模型预测南瓜为白色,并且它实际上属于类别 'not-white',我们称之为 `false positive`,显示在右上角的数字。\n",
"\n",
"- 如果你的模型预测南瓜为非白色,并且它实际上属于类别 'not-white',我们称之为 `true negative`,显示在右下角的数字。\n",
"\n",
"| 实际情况 |\n",
"|:-----:|\n",
"\n",
"| | | |\n",
"|---------------|--------|-------|\n",
"| **预测结果** | WHITE | ORANGE |\n",
"| WHITE | TP | FP |\n",
"| ORANGE | FN | TN |\n",
"\n",
"正如你可能猜到的,理想情况下我们希望有更多的 `true positive` 和 `true negative`,以及更少的 `false positive` 和 `false negative`,这意味着模型表现更好。\n",
"\n",
"混淆矩阵非常有用,因为它可以衍生出其他指标,帮助我们更好地评估分类模型的性能。让我们来看看其中的一些指标:\n",
"\n",
"🎓 精确率Precision`TP/(TP + FP)`,定义为预测为正的样本中实际为正的比例。也称为[正预测值](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")。\n",
"\n",
"🎓 召回率Recall`TP/(TP + FN)`,定义为实际为正的样本中被正确预测为正的比例。也称为 `敏感性`。\n",
"\n",
"🎓 特异性Specificity`TN/(TN + FP)`,定义为实际为负的样本中被正确预测为负的比例。\n",
"\n",
"🎓 准确率Accuracy`TP + TN/(TP + TN + FP + FN)`,表示样本中预测正确的标签所占的百分比。\n",
"\n",
"🎓 F值F Measure精确率和召回率的加权平均值最佳值为1最差值为0。\n",
"\n",
"让我们来计算这些指标吧!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Combine metric functions and calculate them all at once\n",
"eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n",
"eval_metrics(data = results, truth = color, estimate = .pred_class)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 可视化该模型的ROC曲线\n",
"\n",
"让我们进行另一个可视化操作,来查看所谓的[`ROC曲线`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Make a roc_curve\n",
"results %>% \n",
" roc_curve(color, .pred_ORANGE) %>% \n",
" autoplot()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ROC 曲线通常用于查看分类器输出的真阳性与假阳性之间的关系。ROC 曲线通常在 Y 轴上显示 `True Positive Rate`(真阳性率)/敏感性,在 X 轴上显示 `False Positive Rate`(假阳性率)/1-特异性。因此,曲线的陡峭程度以及曲线与对角线之间的空间很重要:你希望看到一条快速上升并越过对角线的曲线。在我们的例子中,起初存在一些假阳性,然后曲线正确地上升并越过对角线。\n",
"\n",
"最后,我们使用 `yardstick::roc_auc()` 来计算实际的曲线下面积AUC。AUC 的一种解释方式是:模型将一个随机正例排在一个随机负例之前的概率。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Calculate area under curve\n",
"results %>% \n",
" roc_auc(color, .pred_ORANGE)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"结果约为 `0.975`。由于 AUC 的范围是 0 到 1你希望分数越大越好因为一个模型如果能 100% 准确预测,其 AUC 将达到 1在这个例子中模型表现*相当不错*。\n",
"\n",
"在后续关于分类的课程中,你将学习如何提高模型的分数(例如在这种情况下处理数据不平衡的问题)。\n",
"\n",
"## 🚀挑战\n",
"\n",
"关于逻辑回归还有很多内容可以深入探讨!但学习的最佳方式是通过实践。寻找一个适合这种分析的数据集,并用它构建一个模型。你学到了什么?提示:可以尝试 [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) 上的有趣数据集。\n",
"\n",
"## 复习与自学\n",
"\n",
"阅读 [斯坦福大学这篇论文](https://web.stanford.edu/~jurafsky/slp3/5.pdf) 的前几页,了解逻辑回归的一些实际应用。思考哪些任务更适合我们到目前为止学习的不同回归类型。哪种方法效果最好?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免责声明** \n本文档使用AI翻译服务 [Co-op Translator](https://github.com/Azure/co-op-translator) 进行翻译。尽管我们努力确保翻译的准确性,但请注意,自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于重要信息,建议使用专业人工翻译。我们不对因使用此翻译而产生的任何误解或误读承担责任。\n"
]
}
],
"metadata": {
"anaconda-cloud": "",
"kernelspec": {
"display_name": "R",
"langauge": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.1"
},
"coopTranslator": {
"original_hash": "feaf125f481a89c468fa115bf2aed580",
"translation_date": "2025-09-03T19:36:22+00:00",
"source_file": "2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb",
"language_code": "zh"
}
},
"nbformat": 4,
"nbformat_minor": 1
}