ML-For-Beginners/translations/zh/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 构建逻辑回归模型 - 第4课\n",
    "\n",
    "![逻辑回归与线性回归信息图](../../../../../../translated_images/linear-vs-logistic.ba180bf95e7ee66721ba10ebf2dac2666acbd64a88b003c83928712433a13c7d.zh.png)\n",
    "\n",
    "#### **[课前测验](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n",
    "\n",
    "#### 介绍\n",
    "\n",
    "在关于回归的最后一课中，我们将学习一种经典的机器学习技术——逻辑回归。你可以使用这种技术发现模式来预测二元分类。例如，这颗糖果是巧克力还是不是？这种疾病是否具有传染性？这个顾客是否会选择这个产品？\n",
    "\n",
    "在本课中，你将学习：\n",
    "\n",
    "-   逻辑回归的技术\n",
    "\n",
    "✅ 在这个 [学习模块](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott) 中深入了解如何使用这种回归方法。\n",
    "\n",
    "## 前置知识\n",
    "\n",
    "在之前使用南瓜数据的过程中，我们已经足够熟悉它，并意识到其中有一个可以使用的二元分类：`Color`。\n",
    "\n",
    "让我们构建一个逻辑回归模型，根据一些变量来预测*某个南瓜可能的颜色*（橙色 🎃 或白色 👻）。\n",
    "\n",
    "> 为什么我们在关于回归的课程中讨论二元分类？仅仅是为了语言上的方便，因为逻辑回归实际上是[一种分类方法](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)，尽管它是基于线性的方法。在下一组课程中，你将学习其他分类数据的方法。\n",
    "\n",
    "在本课中，我们需要以下软件包：\n",
    "\n",
    "-   `tidyverse`： [tidyverse](https://www.tidyverse.org/) 是一个 [R 包集合](https://www.tidyverse.org/packages)，旨在让数据科学更快、更简单、更有趣！\n",
    "\n",
    "-   `tidymodels`： [tidymodels](https://www.tidymodels.org/) 框架是一个 [包集合](https://www.tidymodels.org/packages)，用于建模和机器学习。\n",
    "\n",
    "-   `janitor`： [janitor 包](https://github.com/sfirke/janitor) 提供了一些简单的小工具，用于检查和清理脏数据。\n",
    "\n",
    "-   `ggbeeswarm`： [ggbeeswarm 包](https://github.com/eclarke/ggbeeswarm) 提供了使用 ggplot2 创建蜜蜂群图的方法。\n",
    "\n",
    "你可以通过以下方式安装它们：\n",
    "\n",
    "`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n",
    "\n",
    "或者，下面的脚本会检查你是否已经安装了完成本模块所需的软件包，并在缺少时为你安装它们。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
    "\n",
    "pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **定义问题**\n",
    "\n",
    "在我们的场景中，我们将问题定义为一个二元分类：“白色”或“非白色”。我们的数据集中还有一个“条纹”类别，但它的样本数量很少，因此我们不会使用它。实际上，当我们从数据集中移除空值后，这个类别也会消失。\n",
    "\n",
    "> 🎃 有趣的事实：我们有时会把白色南瓜称为“幽灵”南瓜。它们不太容易雕刻，因此不像橙色南瓜那么受欢迎，但它们看起来很酷！所以我们也可以将问题重新表述为：“幽灵”或“非幽灵”。👻\n",
    "\n",
    "## **关于逻辑回归**\n",
    "\n",
    "逻辑回归与之前学习的线性回归在几个重要方面有所不同。\n",
    "\n",
    "#### **二元分类**\n",
    "\n",
    "逻辑回归不具备线性回归的相同功能。前者提供关于`二元类别`（例如“橙色或非橙色”）的预测，而后者能够预测`连续值`，例如根据南瓜的产地和收获时间，*预测其价格将上涨多少*。\n",
    "\n",
    "![Dasani Madipalli制作的信息图](../../../../../../translated_images/pumpkin-classifier.562771f104ad5436b87d1c67bca02a42a17841133556559325c0a0e348e5b774.zh.png)\n",
    "\n",
    "### 其他分类方式\n",
    "\n",
    "逻辑回归还有其他类型，包括多项式和有序分类：\n",
    "\n",
    "- **多项式分类**，涉及多个类别——例如“橙色、白色和条纹”。\n",
    "\n",
    "- **有序分类**，涉及有序的类别，这在我们需要逻辑地排列结果时很有用，例如按南瓜的有限尺寸（迷你、小、中、大、特大、超大）进行排序。\n",
    "\n",
    "![多项式分类 vs 有序分类](../../../../../../translated_images/multinomial-vs-ordinal.36701b4850e37d86c9dd49f7bef93a2f94dbdb8fe03443eb68f0542f97f28f29.zh.png)\n",
    "\n",
    "#### **变量不需要相关**\n",
    "\n",
    "还记得线性回归在变量相关性较强时效果更好吗？逻辑回归正好相反——变量不需要相关性。这非常适合我们的数据，因为它的相关性较弱。\n",
    "\n",
    "#### **需要大量干净的数据**\n",
    "\n",
    "如果使用更多数据，逻辑回归会提供更准确的结果；我们的数据集较小，因此并不是完成这项任务的最佳选择，请记住这一点。\n",
    "\n",
    "✅ 思考哪些类型的数据适合逻辑回归\n",
    "\n",
    "## 练习 - 整理数据\n",
    "\n",
    "首先，稍微清理一下数据，删除空值并选择部分列：\n",
    "\n",
    "1. 添加以下代码：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Load the core tidyverse packages\n",
    "library(tidyverse)\n",
    "\n",
    "# Import the data and clean column names\n",
    "pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n",
    "  clean_names()\n",
    "\n",
    "# Select desired columns\n",
    "pumpkins_select <- pumpkins %>% \n",
    "  select(c(city_name, package, variety, origin, item_size, color)) \n",
    "\n",
    "# Drop rows containing missing values and encode color as factor (category)\n",
    "pumpkins_select <- pumpkins_select %>% \n",
    "  drop_na() %>% \n",
    "  mutate(color = factor(color))\n",
    "\n",
    "# View the first few rows\n",
    "pumpkins_select %>% \n",
    "  slice_head(n = 5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "您可以随时使用 [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) 函数来快速查看新的数据框，如下所示：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "pumpkins_select %>% \n",
    "  glimpse()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "让我们确认一下，我们实际上是在处理一个二分类问题：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Subset distinct observations in outcome column\n",
    "pumpkins_select %>% \n",
    "  distinct(color)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 可视化 - 分类图\n",
    "到目前为止，您已经再次加载了南瓜数据并进行了清理，以保留包含一些变量（包括颜色）的数据集。现在让我们使用 ggplot 库在笔记本中可视化这个数据框。\n",
    "\n",
    "ggplot 库提供了一些很棒的方法来可视化您的数据。例如，您可以在分类图中比较每种品种和颜色的数据分布。\n",
    "\n",
    "1. 使用 geombar 函数创建这样的图表，使用我们的南瓜数据，并为每种南瓜类别（橙色或白色）指定颜色映射：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "python"
    }
   },
   "outputs": [],
   "source": [
    "# Specify colors for each value of the hue variable\n",
    "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
    "\n",
    "# Create the bar plot\n",
    "ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n",
    "  geom_bar(position = \"dodge\") +\n",
    "  scale_fill_manual(values = palette) +\n",
    "  labs(y = \"Variety\", fill = \"Color\") +\n",
    "  theme_minimal()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过观察数据，可以看到颜色数据与品种之间的关系。\n",
    "\n",
    "✅ 根据这个分类图，你能想到哪些有趣的探索方向？\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据预处理：特征编码\n",
    "\n",
    "我们的南瓜数据集的所有列都包含字符串值。处理分类数据对人类来说很直观，但对机器来说却不是这样。机器学习算法更擅长处理数字数据。这就是为什么编码是数据预处理阶段中非常重要的一步，因为它使我们能够将分类数据转换为数值数据，同时不丢失任何信息。良好的编码能够帮助我们构建一个优秀的模型。\n",
    "\n",
    "对于特征编码，主要有两种类型的编码器：\n",
    "\n",
    "1. **序数编码器（Ordinal encoder）**：适用于序数变量，这类变量是具有逻辑顺序的分类变量，比如我们数据集中的 `item_size` 列。它会创建一个映射，使每个类别用一个数字表示，这个数字对应类别在列中的顺序。\n",
    "\n",
    "2. **分类编码器（Categorical encoder）**：适用于名义变量，这类变量是没有逻辑顺序的分类变量，比如我们数据集中除了 `item_size` 以外的所有特征。这是一种独热编码（one-hot encoding），意味着每个类别都会用一个二进制列表示：如果南瓜属于该类别，则编码变量等于1，否则为0。\n",
    "\n",
    "Tidymodels 提供了另一个非常实用的包：[recipes](https://recipes.tidymodels.org/)，这是一个用于数据预处理的包。我们将定义一个 `recipe`，指定所有预测列都应该被编码为一组整数，然后通过 `prep` 来估算任何操作所需的量和统计数据，最后通过 `bake` 将这些计算应用到新数据上。\n",
    "\n",
    "> 通常情况下，recipes 通常用作建模的预处理器，它定义了为了让数据集适合建模需要应用哪些步骤。在这种情况下，**强烈建议** 使用 `workflow()`，而不是手动通过 prep 和 bake 来估算 recipe。我们稍后会详细讲解这一点。\n",
    ">\n",
    "> 不过目前，我们使用 recipes + prep + bake 来指定对数据集需要应用哪些步骤，以便让数据集准备好进行数据分析，然后提取应用了这些步骤的预处理数据。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Preprocess and extract data to allow some data analysis\n",
    "baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%\n",
    "  # Define ordering for item_size column\n",
    "  step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
    "  # Convert factors to numbers using the order defined above (Ordinal encoding)\n",
    "  step_integer(item_size, zero_based = F) %>%\n",
    "  # Encode all other predictors using one hot encoding\n",
    "  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%\n",
    "  prep(data = pumpkin_select) %>%\n",
    "  bake(new_data = NULL)\n",
    "\n",
    "# Display the first few rows of preprocessed data\n",
    "baked_pumpkins %>% \n",
    "  slice_head(n = 5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "✅ 使用序数编码器对 Item Size 列进行编码有哪些优势？\n",
    "\n",
    "### 分析变量之间的关系\n",
    "\n",
    "现在我们已经对数据进行了预处理，可以分析特征与标签之间的关系，以了解模型在给定特征的情况下预测标签的能力。这类分析的最佳方式是对数据进行可视化。  \n",
    "我们将再次使用 ggplot 的 geom_boxplot_ 函数，以分类图的形式展示 Item Size、Variety 和 Color 之间的关系。为了更好地绘制数据，我们将使用编码后的 Item Size 列和未编码的 Variety 列。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Define the color palette\n",
    "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
    "\n",
    "# We need the encoded Item Size column to use it as the x-axis values in the plot\n",
    "pumpkins_select_plot<-pumpkins_select\n",
    "pumpkins_select_plot$item_size <- baked_pumpkins$item_size\n",
    "\n",
    "# Create the grouped box plot\n",
    "ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +\n",
    "  geom_boxplot() +\n",
    "  facet_grid(variety ~ ., scales = \"free_x\") +\n",
    "  scale_fill_manual(values = palette) +\n",
    "  labs(x = \"Item Size\", y = \"\") +\n",
    "  theme_minimal() +\n",
    "  theme(strip.text = element_text(size = 12)) +\n",
    "  theme(axis.text.x = element_text(size = 10)) +\n",
    "  theme(axis.title.x = element_text(size = 12)) +\n",
    "  theme(axis.title.y = element_blank()) +\n",
    "  theme(legend.position = \"bottom\") +\n",
    "  guides(fill = guide_legend(title = \"Color\")) +\n",
    "  theme(panel.spacing = unit(0.5, \"lines\"))+\n",
    "  theme(strip.text.y = element_text(size = 4, hjust = 0)) \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 使用群集图\n",
    "\n",
    "由于颜色是一个二元类别（白色或非白色），它需要一种“[专门的方法](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)”来进行可视化。\n",
    "\n",
    "尝试使用`群集图`来展示颜色相对于item_size的分布。\n",
    "\n",
    "我们将使用[ggbeeswarm包](https://github.com/eclarke/ggbeeswarm)，该包提供了使用ggplot2创建蜂群式图的方法。蜂群图是一种将通常会重叠的点排列在彼此旁边的绘图方式。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Create beeswarm plots of color and item_size\n",
    "baked_pumpkins %>% \n",
    "  mutate(color = factor(color)) %>% \n",
    "  ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n",
    "  geom_quasirandom() +\n",
    "  scale_color_brewer(palette = \"Dark2\", direction = -1) +\n",
    "  theme(legend.position = \"none\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在我们已经了解了颜色的二元分类与更大尺寸类别之间的关系，接下来让我们探索逻辑回归，以确定某个南瓜可能的颜色。\n",
    "\n",
    "## 构建模型\n",
    "\n",
    "选择您想在分类模型中使用的变量，并将数据分为训练集和测试集。[rsample](https://rsample.tidymodels.org/) 是 Tidymodels 中的一个包，它提供了高效的数据分割和重采样的基础设施：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Split data into 80% for training and 20% for testing\n",
    "set.seed(2056)\n",
    "pumpkins_split <- pumpkins_select %>% \n",
    "  initial_split(prop = 0.8)\n",
    "\n",
    "# Extract the data in each split\n",
    "pumpkins_train <- training(pumpkins_split)\n",
    "pumpkins_test <- testing(pumpkins_split)\n",
    "\n",
    "# Print out the first 5 rows of the training set\n",
    "pumpkins_train %>% \n",
    "  slice_head(n = 5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🙌 我们现在准备通过将训练特征与训练标签（颜色）进行拟合来训练模型。\n",
    "\n",
    "我们将首先创建一个配方，用于指定对数据进行建模前的预处理步骤，例如：将分类变量编码为一组整数。就像 `baked_pumpkins` 一样，我们创建了一个 `pumpkins_recipe`，但不会立即 `prep` 和 `bake`，因为这些步骤会被整合到一个工作流中，稍后您会看到具体操作。\n",
    "\n",
    "在 Tidymodels 中，有很多方法可以指定逻辑回归模型。请参阅 `?logistic_reg()`。目前，我们将通过默认的 `stats::glm()` 引擎来指定一个逻辑回归模型。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Create a recipe that specifies preprocessing steps for modelling\n",
    "pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n",
    "  step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
    "  step_integer(item_size, zero_based = F) %>%  \n",
    "  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)\n",
    "\n",
    "# Create a logistic model specification\n",
    "log_reg <- logistic_reg() %>% \n",
    "  set_engine(\"glm\") %>% \n",
    "  set_mode(\"classification\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在我们已经有了一个配方和一个模型规范，我们需要找到一种方法将它们打包成一个对象。这个对象将首先对数据进行预处理（在幕后完成 prep 和 bake 操作），然后在预处理后的数据上拟合模型，同时还支持潜在的后处理操作。\n",
    "\n",
    "在 Tidymodels 中，这个方便的对象被称为 [`workflow`](https://workflows.tidymodels.org/)，它能够方便地容纳你的建模组件。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Bundle modelling components in a workflow\n",
    "log_reg_wf <- workflow() %>% \n",
    "  add_recipe(pumpkins_recipe) %>% \n",
    "  add_model(log_reg)\n",
    "\n",
    "# Print out the workflow\n",
    "log_reg_wf\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在*指定*工作流程后，可以使用[`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html)函数对模型进行`训练`。工作流程会估算配方并在训练前对数据进行预处理，因此我们无需手动使用prep和bake来完成这些步骤。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Train the model\n",
    "wf_fit <- log_reg_wf %>% \n",
    "  fit(data = pumpkins_train)\n",
    "\n",
    "# Print the trained workflow\n",
    "wf_fit\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模型训练期间打印出的内容显示了学习到的系数。\n",
    "\n",
    "现在我们已经使用训练数据训练了模型，可以使用 [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html) 对测试数据进行预测。让我们从使用模型预测测试集的标签以及每个标签的概率开始。当概率大于 0.5 时，预测类别为 `WHITE`，否则为 `ORANGE`。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Make predictions for color and corresponding probabilities\n",
    "results <- pumpkins_test %>% select(color) %>% \n",
    "  bind_cols(wf_fit %>% \n",
    "              predict(new_data = pumpkins_test)) %>%\n",
    "  bind_cols(wf_fit %>%\n",
    "              predict(new_data = pumpkins_test, type = \"prob\"))\n",
    "\n",
    "# Compare predictions\n",
    "results %>% \n",
    "  slice_head(n = 10)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "非常好！这为我们提供了更多关于逻辑回归工作原理的见解。\n",
    "\n",
    "### 通过混淆矩阵更好地理解\n",
    "\n",
    "将每个预测值与其对应的“真实值”进行比较，并不是评估模型预测效果的高效方法。幸运的是，Tidymodels 还有一些其他的技巧：[`yardstick`](https://yardstick.tidymodels.org/)——一个通过性能指标来衡量模型效果的工具包。\n",
    "\n",
    "与分类问题相关的一个性能指标是[`混淆矩阵`](https://wikipedia.org/wiki/Confusion_matrix)。混淆矩阵描述了分类模型的表现情况。它统计了模型对每个类别正确分类的样本数量。在我们的例子中，它会显示有多少橙色南瓜被正确分类为橙色，有多少白色南瓜被正确分类为白色；同时，混淆矩阵还会显示有多少样本被错误分类到**其他类别**。\n",
    "\n",
    "来自 yardstick 的 [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) 函数可以计算观察值和预测值的交叉分类表。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Confusion matrix for prediction results\n",
    "conf_mat(data = results, truth = color, estimate = .pred_class)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "让我们来解读混淆矩阵。我们的模型需要将南瓜分类为两个二元类别：类别 `white` 和类别 `not-white`。\n",
    "\n",
    "- 如果你的模型预测南瓜为白色，并且它实际上属于类别 'white'，我们称之为 `true positive`，显示在左上角的数字。\n",
    "\n",
    "- 如果你的模型预测南瓜为非白色，并且它实际上属于类别 'white'，我们称之为 `false negative`，显示在左下角的数字。\n",
    "\n",
    "- 如果你的模型预测南瓜为白色，并且它实际上属于类别 'not-white'，我们称之为 `false positive`，显示在右上角的数字。\n",
    "\n",
    "- 如果你的模型预测南瓜为非白色，并且它实际上属于类别 'not-white'，我们称之为 `true negative`，显示在右下角的数字。\n",
    "\n",
    "| 实际情况 |\n",
    "|:-----:|\n",
    "\n",
    "|               |        |       |\n",
    "|---------------|--------|-------|\n",
    "| **预测结果**  | WHITE  | ORANGE |\n",
    "| WHITE         | TP     | FP    |\n",
    "| ORANGE        | FN     | TN    |\n",
    "\n",
    "正如你可能猜到的，理想情况下我们希望有更多的 `true positive` 和 `true negative`，以及更少的 `false positive` 和 `false negative`，这意味着模型表现更好。\n",
    "\n",
    "混淆矩阵非常有用，因为它可以衍生出其他指标，帮助我们更好地评估分类模型的性能。让我们来看看其中的一些指标：\n",
    "\n",
    "🎓 精确率（Precision）：`TP/(TP + FP)`，定义为预测为正的样本中实际为正的比例。也称为[正预测值](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")。\n",
    "\n",
    "🎓 召回率（Recall）：`TP/(TP + FN)`，定义为实际为正的样本中被正确预测为正的比例。也称为 `敏感性`。\n",
    "\n",
    "🎓 特异性（Specificity）：`TN/(TN + FP)`，定义为实际为负的样本中被正确预测为负的比例。\n",
    "\n",
    "🎓 准确率（Accuracy）：`TP + TN/(TP + TN + FP + FN)`，表示样本中预测正确的标签所占的百分比。\n",
    "\n",
    "🎓 F值（F Measure）：精确率和召回率的加权平均值，最佳值为1，最差值为0。\n",
    "\n",
    "让我们来计算这些指标吧！\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Combine metric functions and calculate them all at once\n",
    "eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n",
    "eval_metrics(data = results, truth = color, estimate = .pred_class)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 可视化该模型的ROC曲线\n",
    "\n",
    "让我们进行另一个可视化操作，来查看所谓的[`ROC曲线`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Make a roc_curve\n",
    "results %>% \n",
    "  roc_curve(color, .pred_ORANGE) %>% \n",
    "  autoplot()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ROC 曲线通常用于查看分类器输出的真阳性与假阳性之间的关系。ROC 曲线通常在 Y 轴上显示 `True Positive Rate`（真阳性率）/敏感性，在 X 轴上显示 `False Positive Rate`（假阳性率）/1-特异性。因此，曲线的陡峭程度以及曲线与对角线之间的空间很重要：你希望看到一条快速上升并越过对角线的曲线。在我们的例子中，起初存在一些假阳性，然后曲线正确地上升并越过对角线。\n",
    "\n",
    "最后，我们使用 `yardstick::roc_auc()` 来计算实际的曲线下面积（AUC）。AUC 的一种解释方式是：模型将一个随机正例排在一个随机负例之前的概率。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Calculate area under curve\n",
    "results %>% \n",
    "  roc_auc(color, .pred_ORANGE)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "结果约为 `0.975`。由于 AUC 的范围是 0 到 1，你希望分数越大越好，因为一个模型如果能 100% 准确预测，其 AUC 将达到 1；在这个例子中，模型表现*相当不错*。\n",
    "\n",
    "在后续关于分类的课程中，你将学习如何提高模型的分数（例如在这种情况下处理数据不平衡的问题）。\n",
    "\n",
    "## 🚀挑战\n",
    "\n",
    "关于逻辑回归还有很多内容可以深入探讨！但学习的最佳方式是通过实践。寻找一个适合这种分析的数据集，并用它构建一个模型。你学到了什么？提示：可以尝试 [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) 上的有趣数据集。\n",
    "\n",
    "## 复习与自学\n",
    "\n",
    "阅读 [斯坦福大学这篇论文](https://web.stanford.edu/~jurafsky/slp3/5.pdf) 的前几页，了解逻辑回归的一些实际应用。思考哪些任务更适合我们到目前为止学习的不同回归类型。哪种方法效果最好？\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**免责声明**：  \n本文档使用AI翻译服务 [Co-op Translator](https://github.com/Azure/co-op-translator) 进行翻译。尽管我们努力确保翻译的准确性，但请注意，自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于重要信息，建议使用专业人工翻译。我们不对因使用此翻译而产生的任何误解或误读承担责任。\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": "",
  "kernelspec": {
   "display_name": "R",
   "langauge": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.4.1"
  },
  "coopTranslator": {
   "original_hash": "feaf125f481a89c468fa115bf2aed580",
   "translation_date": "2025-09-03T19:36:22+00:00",
   "source_file": "2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb",
   "language_code": "zh"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}