You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/mo/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb

685 lines
28 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 建立邏輯迴歸模型 - 第四課\n",
"\n",
"![邏輯迴歸與線性迴歸資訊圖表](../../../../../../translated_images/linear-vs-logistic.ba180bf95e7ee66721ba10ebf2dac2666acbd64a88b003c83928712433a13c7d.mo.png)\n",
"\n",
"#### **[課前測驗](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n",
"\n",
"#### 介紹\n",
"\n",
"在這最後一課關於迴歸的內容中,我們將探討邏輯迴歸,這是基本的*經典*機器學習技術之一。你可以使用這種技術來發現模式並預測二元分類。例如:這顆糖果是巧克力還是不是?這種疾病是否具有傳染性?這位顧客是否會選擇這個產品?\n",
"\n",
"在這一課中,你將學到:\n",
"\n",
"- 邏輯迴歸的技術\n",
"\n",
"✅ 在這個 [學習模組](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott) 中深入了解如何使用這種類型的迴歸\n",
"\n",
"## 前置條件\n",
"\n",
"在使用南瓜數據後,我們已經足夠熟悉它,並意識到有一個二元分類可以使用:`Color`。\n",
"\n",
"讓我們建立一個邏輯迴歸模型,根據一些變數來預測*某個南瓜可能的顏色*(橙色 🎃 或白色 👻)。\n",
"\n",
"> 為什麼我們在迴歸課程中討論二元分類?僅僅是出於語言上的方便,因為邏輯迴歸[實際上是一種分類方法](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression),儘管它是基於線性的。在下一組課程中,了解其他分類數據的方法。\n",
"\n",
"在這一課中,我們需要以下套件:\n",
"\n",
"- `tidyverse` [tidyverse](https://www.tidyverse.org/) 是一個 [R 套件集合](https://www.tidyverse.org/packages),旨在讓數據科學更快速、更簡單、更有趣!\n",
"\n",
"- `tidymodels` [tidymodels](https://www.tidymodels.org/) 框架是一個 [套件集合](https://www.tidymodels.org/packages),用於建模和機器學習。\n",
"\n",
"- `janitor` [janitor 套件](https://github.com/sfirke/janitor) 提供了一些簡單的工具,用於檢查和清理髒數據。\n",
"\n",
"- `ggbeeswarm` [ggbeeswarm 套件](https://github.com/eclarke/ggbeeswarm) 提供了使用 ggplot2 創建蜜蜂群圖的方式。\n",
"\n",
"你可以通過以下方式安裝它們:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n",
"\n",
"或者,下面的腳本會檢查你是否已安裝完成此模組所需的套件,並在缺少時為你安裝。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **定義問題**\n",
"\n",
"在我們的情境中,我們將問題表述為二元分類:「白色」或「非白色」。在我們的數據集中還有一個「條紋」類別,但由於該類別的樣本數量很少,我們不會使用它。事實上,一旦我們從數據集中移除空值,這個類別就會消失。\n",
"\n",
"> 🎃 有趣的小知識,我們有時會稱白色南瓜為「幽靈南瓜」。它們不太容易雕刻,因此不像橙色南瓜那麼受歡迎,但它們看起來很酷!所以我們也可以將問題重新表述為:「幽靈」或「非幽靈」。👻\n",
"\n",
"## **關於邏輯回歸**\n",
"\n",
"邏輯回歸與之前學過的線性回歸有幾個重要的不同之處。\n",
"\n",
"#### **二元分類**\n",
"\n",
"邏輯回歸不提供與線性回歸相同的功能。前者提供對「二元類別」(例如「橙色或非橙色」)的預測,而後者則能夠預測「連續值」,例如根據南瓜的來源和收穫時間,*價格會上漲多少*。\n",
"\n",
"![Dasani Madipalli 的資訊圖表](../../../../../../translated_images/pumpkin-classifier.562771f104ad5436b87d1c67bca02a42a17841133556559325c0a0e348e5b774.mo.png)\n",
"\n",
"### 其他分類方式\n",
"\n",
"邏輯回歸還有其他類型,包括多項式和序列型:\n",
"\n",
"- **多項式**,涉及多於一個類別,例如「橙色、白色和條紋」。\n",
"\n",
"- **序列型**,涉及有序的類別,適合我們希望按邏輯順序排列結果的情境,例如南瓜按有限的尺寸(迷你、小、中、大、特大、超大)排序。\n",
"\n",
"![多項式 vs 序列型回歸](../../../../../../translated_images/multinomial-vs-ordinal.36701b4850e37d86c9dd49f7bef93a2f94dbdb8fe03443eb68f0542f97f28f29.mo.png)\n",
"\n",
"#### **變數不需要相關**\n",
"\n",
"還記得線性回歸在變數相關性更高時效果更好嗎?邏輯回歸正好相反——變數不需要相關性。這對於我們的數據很有幫助,因為它的相關性相對較弱。\n",
"\n",
"#### **需要大量乾淨的數據**\n",
"\n",
"如果使用更多數據,邏輯回歸會給出更準確的結果;我們的小型數據集並不是完成此任務的最佳選擇,因此請記住這一點。\n",
"\n",
"✅ 思考哪些類型的數據適合邏輯回歸\n",
"\n",
"## 練習 - 整理數據\n",
"\n",
"首先,稍微清理一下數據,刪除空值並選擇部分欄位:\n",
"\n",
"1. 添加以下程式碼:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Load the core tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# Import the data and clean column names\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n",
" clean_names()\n",
"\n",
"# Select desired columns\n",
"pumpkins_select <- pumpkins %>% \n",
" select(c(city_name, package, variety, origin, item_size, color)) \n",
"\n",
"# Drop rows containing missing values and encode color as factor (category)\n",
"pumpkins_select <- pumpkins_select %>% \n",
" drop_na() %>% \n",
" mutate(color = factor(color))\n",
"\n",
"# View the first few rows\n",
"pumpkins_select %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"您可以隨時使用 [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) 函數來快速查看您的新資料框,如下所示:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"pumpkins_select %>% \n",
" glimpse()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"讓我們確認一下,我們確實是在進行一個二元分類問題:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Subset distinct observations in outcome column\n",
"pumpkins_select %>% \n",
" distinct(color)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 視覺化 - 類別圖表\n",
"到目前為止,你已經再次載入了南瓜數據,並清理了數據以保留包含一些變數(例如顏色)的數據集。現在,讓我們使用 ggplot 函式庫在筆記本中視覺化這個資料框。\n",
"\n",
"ggplot 函式庫提供了一些很棒的方法來視覺化你的數據。例如,你可以在類別圖表中比較每種品種和顏色的數據分佈。\n",
"\n",
"1. 使用 geombar 函數,利用我們的南瓜數據,並為每個南瓜類別(橙色或白色)指定顏色映射,來創建這樣的圖表:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Specify colors for each value of the hue variable\n",
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
"\n",
"# Create the bar plot\n",
"ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n",
" geom_bar(position = \"dodge\") +\n",
" scale_fill_manual(values = palette) +\n",
" labs(y = \"Variety\", fill = \"Color\") +\n",
" theme_minimal()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"透過觀察這些數據,可以看出顏色數據與品種之間的關聯。\n",
"\n",
"✅ 根據這個分類圖,你能想到哪些有趣的探索方向?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 資料預處理:特徵編碼\n",
"\n",
"我們的南瓜資料集中的所有欄位都包含字串值。對人類而言,處理分類資料是直觀的,但對機器而言則不然。機器學習演算法更擅長處理數值資料。因此,編碼是資料預處理階段中非常重要的一步,因為它能將分類資料轉換為數值資料,同時不丟失任何資訊。良好的編碼能夠幫助建立良好的模型。\n",
"\n",
"在特徵編碼中,主要有兩種編碼器:\n",
"\n",
"1. **序列編碼器**:適合用於序列變數,這些變數是具有邏輯順序的分類變數,例如我們資料集中的 `item_size` 欄位。它會建立一個映射,將每個類別用一個數字表示,這個數字代表該類別在欄位中的順序。\n",
"\n",
"2. **分類編碼器**:適合用於名義變數,這些變數是沒有邏輯順序的分類變數,例如我們資料集中除了 `item_size` 以外的所有特徵。這是一種獨熱編碼方式,意味著每個類別都用一個二元欄位表示:如果南瓜屬於該品種,編碼變數的值為 1否則為 0。\n",
"\n",
"Tidymodels 提供了一個非常方便的套件:[recipes](https://recipes.tidymodels.org/)——一個用於資料預處理的套件。我們將定義一個 `recipe`,指定所有的預測欄位應該被編碼為一組整數,使用 `prep` 來估算任何操作所需的數量和統計數據,最後使用 `bake` 將計算結果應用到新資料上。\n",
"\n",
"> 通常recipes 通常用作建模的預處理工具,它定義了應該對資料集應用哪些步驟以使其準備好進行建模。在這種情況下,**強烈建議**使用 `workflow()`,而不是手動使用 prep 和 bake 來估算 recipe。我們稍後會看到這些內容。\n",
">\n",
"> 不過,目前我們使用 recipes + prep + bake 來指定應該對資料集應用哪些步驟,以使其準備好進行資料分析,然後提取已應用步驟的預處理資料。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Preprocess and extract data to allow some data analysis\n",
"baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%\n",
" # Define ordering for item_size column\n",
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
" # Convert factors to numbers using the order defined above (Ordinal encoding)\n",
" step_integer(item_size, zero_based = F) %>%\n",
" # Encode all other predictors using one hot encoding\n",
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%\n",
" prep(data = pumpkin_select) %>%\n",
" bake(new_data = NULL)\n",
"\n",
"# Display the first few rows of preprocessed data\n",
"baked_pumpkins %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"✅ 使用序數編碼器處理 Item Size 欄位有什麼優勢?\n",
"\n",
"### 分析變數之間的關係\n",
"\n",
"現在我們已經完成了數據的預處理,可以分析特徵與標籤之間的關係,以了解模型在給定特徵的情況下預測標籤的能力。進行這類分析的最佳方法是繪製數據圖表。\n",
"我們將再次使用 ggplot 的 geom_boxplot_ 函數,以分類圖的形式可視化 Item Size、Variety 和 Color 之間的關係。為了更好地繪製數據,我們將使用已編碼的 Item Size 欄位以及未編碼的 Variety 欄位。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Define the color palette\n",
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
"\n",
"# We need the encoded Item Size column to use it as the x-axis values in the plot\n",
"pumpkins_select_plot<-pumpkins_select\n",
"pumpkins_select_plot$item_size <- baked_pumpkins$item_size\n",
"\n",
"# Create the grouped box plot\n",
"ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +\n",
" geom_boxplot() +\n",
" facet_grid(variety ~ ., scales = \"free_x\") +\n",
" scale_fill_manual(values = palette) +\n",
" labs(x = \"Item Size\", y = \"\") +\n",
" theme_minimal() +\n",
" theme(strip.text = element_text(size = 12)) +\n",
" theme(axis.text.x = element_text(size = 10)) +\n",
" theme(axis.title.x = element_text(size = 12)) +\n",
" theme(axis.title.y = element_blank()) +\n",
" theme(legend.position = \"bottom\") +\n",
" guides(fill = guide_legend(title = \"Color\")) +\n",
" theme(panel.spacing = unit(0.5, \"lines\"))+\n",
" theme(strip.text.y = element_text(size = 4, hjust = 0)) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 使用群集圖\n",
"\n",
"由於 Color 是一個二元分類(白色或非白色),因此需要「一種[專門的方法](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)來進行視覺化」。\n",
"\n",
"嘗試使用 `群集圖` 來展示 item_size 與顏色分佈的關係。\n",
"\n",
"我們將使用 [ggbeeswarm 套件](https://github.com/eclarke/ggbeeswarm),該套件提供使用 ggplot2 創建蜜蜂群集風格圖的方法。蜜蜂群集圖是一種將原本會重疊的點排列在彼此旁邊的繪圖方式。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Create beeswarm plots of color and item_size\n",
"baked_pumpkins %>% \n",
" mutate(color = factor(color)) %>% \n",
" ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n",
" geom_quasirandom() +\n",
" scale_color_brewer(palette = \"Dark2\", direction = -1) +\n",
" theme(legend.position = \"none\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"現在我們已經了解顏色的二元分類與更大尺寸組別之間的關係,接下來讓我們探討如何使用邏輯迴歸來判斷某個南瓜的可能顏色。\n",
"\n",
"## 建立您的模型\n",
"\n",
"選擇您想用於分類模型的變數,並將數據分為訓練集和測試集。[rsample](https://rsample.tidymodels.org/) 是 Tidymodels 中的一個套件,提供了高效的數據分割和重抽樣的基礎設施:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Split data into 80% for training and 20% for testing\n",
"set.seed(2056)\n",
"pumpkins_split <- pumpkins_select %>% \n",
" initial_split(prop = 0.8)\n",
"\n",
"# Extract the data in each split\n",
"pumpkins_train <- training(pumpkins_split)\n",
"pumpkins_test <- testing(pumpkins_split)\n",
"\n",
"# Print out the first 5 rows of the training set\n",
"pumpkins_train %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"🙌 我們現在準備透過將訓練特徵與訓練標籤(顏色)進行擬合來訓練模型。\n",
"\n",
"我們將從建立一個食譜開始,該食譜會指定對數據進行預處理的步驟,以便為建模做好準備。例如:將類別型變數編碼為一組整數。就像 `baked_pumpkins` 一樣,我們會建立一個 `pumpkins_recipe`,但不會執行 `prep` 和 `bake`,因為這些步驟會被打包到一個工作流程中,稍後幾個步驟你就會看到。\n",
"\n",
"在 Tidymodels 中,有很多種方法可以指定邏輯回歸模型。請參考 `?logistic_reg()`。目前,我們將透過預設的 `stats::glm()` 引擎來指定一個邏輯回歸模型。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Create a recipe that specifies preprocessing steps for modelling\n",
"pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n",
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
" step_integer(item_size, zero_based = F) %>% \n",
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)\n",
"\n",
"# Create a logistic model specification\n",
"log_reg <- logistic_reg() %>% \n",
" set_engine(\"glm\") %>% \n",
" set_mode(\"classification\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"現在我們已經有了一個配方和模型規範,我們需要找到一種方法將它們整合成一個物件,該物件首先會對資料進行預處理(在幕後進行準備和烘焙),然後在預處理後的資料上擬合模型,並且還能支持潛在的後處理活動。\n",
"\n",
"在 Tidymodels 中,這個方便的物件被稱為 [`workflow`](https://workflows.tidymodels.org/),它能輕鬆地保存你的建模組件。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Bundle modelling components in a workflow\n",
"log_reg_wf <- workflow() %>% \n",
" add_recipe(pumpkins_recipe) %>% \n",
" add_model(log_reg)\n",
"\n",
"# Print out the workflow\n",
"log_reg_wf\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在*指定*工作流程後,可以使用[`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html)函數來`訓練`模型。工作流程會在訓練之前估算配方並預處理數據,因此我們不需要手動使用 prep 和 bake 來完成這些步驟。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Train the model\n",
"wf_fit <- log_reg_wf %>% \n",
" fit(data = pumpkins_train)\n",
"\n",
"# Print the trained workflow\n",
"wf_fit\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"模型訓練完成後,會顯示訓練過程中學到的係數。\n",
"\n",
"現在我們已經使用訓練數據訓練了模型,可以利用 [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html) 對測試數據進行預測。我們先用模型來預測測試集的標籤以及每個標籤的概率。當概率大於 0.5 時,預測類別為 `WHITE`,否則為 `ORANGE`。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Make predictions for color and corresponding probabilities\n",
"results <- pumpkins_test %>% select(color) %>% \n",
" bind_cols(wf_fit %>% \n",
" predict(new_data = pumpkins_test)) %>%\n",
" bind_cols(wf_fit %>%\n",
" predict(new_data = pumpkins_test, type = \"prob\"))\n",
"\n",
"# Compare predictions\n",
"results %>% \n",
" slice_head(n = 10)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"非常棒!這提供了一些關於邏輯迴歸運作方式的深入見解。\n",
"\n",
"### 透過混淆矩陣更好地理解\n",
"\n",
"將每個預測值與其對應的「真實值」進行比較並不是評估模型預測效果的高效方法。幸運的是Tidymodels 還有一些其他的技巧:[`yardstick`](https://yardstick.tidymodels.org/)——一個用於通過性能指標衡量模型效果的套件。\n",
"\n",
"與分類問題相關的一個性能指標是[`混淆矩陣`](https://wikipedia.org/wiki/Confusion_matrix)。混淆矩陣描述了分類模型的表現如何。混淆矩陣統計了模型正確分類每個類別的例子數量。在我們的例子中,它會顯示有多少橙色南瓜被正確分類為橙色,以及有多少白色南瓜被正確分類為白色;混淆矩陣還會顯示有多少被分類到了**錯誤**的類別。\n",
"\n",
"來自 yardstick 的 [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) 函數可以計算觀察值和預測值類別的交叉統計表。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Confusion matrix for prediction results\n",
"conf_mat(data = results, truth = color, estimate = .pred_class)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"讓我們來解讀混淆矩陣。我們的模型需要將南瓜分類為兩個二元類別:`白色` 和 `非白色`。\n",
"\n",
"- 如果模型預測南瓜是白色,且實際上屬於「白色」類別,我們稱之為 `真正例`,顯示在左上角的數字。\n",
"\n",
"- 如果模型預測南瓜是非白色,但實際上屬於「白色」類別,我們稱之為 `假負例`,顯示在左下角的數字。\n",
"\n",
"- 如果模型預測南瓜是白色,但實際上屬於「非白色」類別,我們稱之為 `假正例`,顯示在右上角的數字。\n",
"\n",
"- 如果模型預測南瓜是非白色,且實際上屬於「非白色」類別,我們稱之為 `真負例`,顯示在右下角的數字。\n",
"\n",
"| 真實值 |\n",
"|:-----:|\n",
"\n",
"| | | |\n",
"|---------------|--------|-------|\n",
"| **預測值** | 白色 | 橙色 |\n",
"| 白色 | TP | FP |\n",
"| 橙色 | FN | TN |\n",
"\n",
"正如你可能猜到的,我們希望有更多的真正例和真負例,以及更少的假正例和假負例,這意味著模型的表現更佳。\n",
"\n",
"混淆矩陣非常有用,因為它衍生出其他指標,幫助我們更好地評估分類模型的性能。讓我們來看看其中一些指標:\n",
"\n",
"🎓 精確率Precision`TP/(TP + FP)`,定義為被預測為正例中實際為正例的比例。也稱為[正確預測值](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")。\n",
"\n",
"🎓 召回率Recall`TP/(TP + FN)`,定義為實際為正例的樣本中被正確預測為正例的比例。也稱為 `敏感性`。\n",
"\n",
"🎓 特異性Specificity`TN/(TN + FP)`,定義為實際為負例的樣本中被正確預測為負例的比例。\n",
"\n",
"🎓 準確率Accuracy`TP + TN/(TP + TN + FP + FN)`,即樣本中被正確預測的標籤所佔的百分比。\n",
"\n",
"🎓 F 值F Measure精確率和召回率的加權平均值最佳值為 1最差值為 0。\n",
"\n",
"讓我們來計算這些指標吧!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Combine metric functions and calculate them all at once\n",
"eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n",
"eval_metrics(data = results, truth = color, estimate = .pred_class)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 視覺化此模型的 ROC 曲線\n",
"\n",
"讓我們再進行一次視覺化,來看看所謂的 [`ROC 曲線`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Make a roc_curve\n",
"results %>% \n",
" roc_curve(color, .pred_ORANGE) %>% \n",
" autoplot()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ROC 曲線通常用來檢視分類器的輸出分析其真陽性與假陽性之間的關係。ROC 曲線通常在 Y 軸上顯示「真陽性率」/敏感度,在 X 軸上顯示「假陽性率」/1-特異性。因此,曲線的陡峭程度以及曲線與中線之間的距離非常重要:理想的曲線應該快速向上並超越中線。在我們的例子中,起初會有一些假陽性,接著曲線正確地向上並超越中線。\n",
"\n",
"最後,我們可以使用 `yardstick::roc_auc()` 來計算實際的曲線下面積AUC。AUC 的一種解釋方式是,模型將隨機正例排序在隨機負例之上的概率。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Calculate area under curve\n",
"results %>% \n",
" roc_auc(color, .pred_ORANGE)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"結果約為 `0.975`。由於 AUC 的範圍是從 0 到 1你會希望分數越高越好因為一個能夠 100% 正確預測的模型會有 AUC 值為 1在這個例子中這個模型表現*相當不錯*。\n",
"\n",
"在未來的分類課程中,你將學習如何提升模型的分數(例如在這種情況下處理不平衡數據)。\n",
"\n",
"## 🚀挑戰\n",
"\n",
"關於邏輯迴歸還有很多值得深入探討的地方!但學習的最佳方式是親自實驗。找一個適合這類分析的數據集,並用它建立一個模型。你學到了什麼?提示:可以試試 [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) 上有趣的數據集。\n",
"\n",
"## 回顧與自學\n",
"\n",
"閱讀 [這篇來自 Stanford 的論文](https://web.stanford.edu/~jurafsky/slp3/5.pdf) 的前幾頁,了解邏輯迴歸的一些實際應用。思考哪些任務更適合我們到目前為止學過的不同迴歸類型。哪一種方法會是最佳選擇?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免責聲明** \n本文件已使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。雖然我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。原始文件的母語版本應被視為權威來源。對於關鍵信息,建議尋求專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或誤釋不承擔責任。\n"
]
}
],
"metadata": {
"anaconda-cloud": "",
"kernelspec": {
"display_name": "R",
"langauge": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.1"
},
"coopTranslator": {
"original_hash": "feaf125f481a89c468fa115bf2aed580",
"translation_date": "2025-08-29T23:02:49+00:00",
"source_file": "2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb",
"language_code": "mo"
}
},
"nbformat": 4,
"nbformat_minor": 1
}