You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/tw/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb

685 lines
28 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 建立邏輯迴歸模型 - 第四課\n",
"\n",
"![邏輯迴歸與線性迴歸資訊圖](../../../../../../translated_images/linear-vs-logistic.ba180bf95e7ee66721ba10ebf2dac2666acbd64a88b003c83928712433a13c7d.tw.png)\n",
"\n",
"#### **[課前測驗](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n",
"\n",
"#### 介紹\n",
"\n",
"在這最後一課關於迴歸的課程中,我們將探討邏輯迴歸,這是基本的*經典*機器學習技術之一。你可以使用這種技術來發現模式並預測二元分類。例如:這顆糖果是巧克力還是不是?這種疾病是否具有傳染性?這位顧客是否會選擇這個產品?\n",
"\n",
"在本課中,你將學到:\n",
"\n",
"- 邏輯迴歸的技術\n",
"\n",
"✅ 在這個 [學習模組](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott) 中深入了解如何使用這種類型的迴歸。\n",
"\n",
"## 前置條件\n",
"\n",
"在使用南瓜數據後,我們已經足夠熟悉它,並意識到有一個二元分類可以使用:`Color`。\n",
"\n",
"讓我們建立一個邏輯迴歸模型,根據一些變數來預測*某個南瓜可能的顏色*(橙色 🎃 或白色 👻)。\n",
"\n",
"> 為什麼我們在迴歸課程中討論二元分類?僅僅是出於語言上的方便,因為邏輯迴歸[實際上是一種分類方法](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression),儘管它是基於線性的。在下一組課程中,了解其他分類數據的方法。\n",
"\n",
"在本課中,我們需要以下套件:\n",
"\n",
"- `tidyverse` [tidyverse](https://www.tidyverse.org/) 是一個 [R 套件集合](https://www.tidyverse.org/packages),旨在讓數據科學更快速、更簡單、更有趣!\n",
"\n",
"- `tidymodels` [tidymodels](https://www.tidymodels.org/) 框架是一個 [套件集合](https://www.tidymodels.org/packages/) 用於建模和機器學習。\n",
"\n",
"- `janitor` [janitor 套件](https://github.com/sfirke/janitor) 提供簡單的工具來檢查和清理髒數據。\n",
"\n",
"- `ggbeeswarm` [ggbeeswarm 套件](https://github.com/eclarke/ggbeeswarm) 提供使用 ggplot2 創建蜜蜂群樣式圖的方法。\n",
"\n",
"你可以通過以下方式安裝它們:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n",
"\n",
"或者,下面的腳本會檢查你是否已安裝完成本模組所需的套件,並在缺少時為你安裝。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **定義問題**\n",
"\n",
"在我們的情境中,我們將問題表述為二元分類:「白色」或「非白色」。在我們的數據集中還有一個「條紋」類別,但由於其樣本數量很少,我們不會使用它。無論如何,當我們從數據集中移除空值後,它就會消失。\n",
"\n",
"> 🎃 有趣的小知識,我們有時會稱白色南瓜為「幽靈」南瓜。它們不太容易雕刻,因此不像橙色南瓜那麼受歡迎,但它們看起來很酷!所以我們也可以將問題重新表述為:「幽靈」或「非幽靈」。👻\n",
"\n",
"## **關於邏輯回歸**\n",
"\n",
"邏輯回歸與之前學過的線性回歸在幾個重要方面有所不同。\n",
"\n",
"#### **二元分類**\n",
"\n",
"邏輯回歸不提供與線性回歸相同的功能。前者提供對「二元類別」(例如「橙色或非橙色」)的預測,而後者則能預測「連續值」,例如根據南瓜的來源和收穫時間,*價格將上漲多少*。\n",
"\n",
"![Dasani Madipalli 的資訊圖表](../../../../../../translated_images/pumpkin-classifier.562771f104ad5436b87d1c67bca02a42a17841133556559325c0a0e348e5b774.tw.png)\n",
"\n",
"### 其他分類方式\n",
"\n",
"邏輯回歸還有其他類型,包括多項式和序列式:\n",
"\n",
"- **多項式**,涉及多於一個類別——「橙色、白色和條紋」。\n",
"\n",
"- **序列式**,涉及有序的類別,適合我們希望按邏輯順序排列結果的情況,例如南瓜按有限的尺寸(迷你、小、中、大、特大、超大)排序。\n",
"\n",
"![多項式 vs 序列式回歸](../../../../../../translated_images/multinomial-vs-ordinal.36701b4850e37d86c9dd49f7bef93a2f94dbdb8fe03443eb68f0542f97f28f29.tw.png)\n",
"\n",
"#### **變數不必相關**\n",
"\n",
"還記得線性回歸在變數相關性更高時效果更好嗎?邏輯回歸則相反——變數不必相關。這適用於我們的數據,因為它的相關性相對較弱。\n",
"\n",
"#### **需要大量乾淨的數據**\n",
"\n",
"如果使用更多數據,邏輯回歸會給出更準確的結果;我們的小型數據集並不適合這項任務,因此請記住這一點。\n",
"\n",
"✅ 思考哪些類型的數據適合邏輯回歸\n",
"\n",
"## 練習 - 整理數據\n",
"\n",
"首先,稍微清理一下數據,刪除空值並僅選擇部分列:\n",
"\n",
"1. 添加以下程式碼:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Load the core tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# Import the data and clean column names\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n",
" clean_names()\n",
"\n",
"# Select desired columns\n",
"pumpkins_select <- pumpkins %>% \n",
" select(c(city_name, package, variety, origin, item_size, color)) \n",
"\n",
"# Drop rows containing missing values and encode color as factor (category)\n",
"pumpkins_select <- pumpkins_select %>% \n",
" drop_na() %>% \n",
" mutate(color = factor(color))\n",
"\n",
"# View the first few rows\n",
"pumpkins_select %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"您可以隨時使用 [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) 函數來快速查看您的新資料框,如下所示:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"pumpkins_select %>% \n",
" glimpse()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"讓我們確認一下,我們確實是在處理一個二元分類問題:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Subset distinct observations in outcome column\n",
"pumpkins_select %>% \n",
" distinct(color)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 視覺化 - 類別圖\n",
"現在,您已再次載入南瓜數據並進行清理,以保留包含一些變數(包括顏色)的數據集。讓我們使用 ggplot 庫在筆記本中視覺化這個數據框。\n",
"\n",
"ggplot 庫提供了一些很棒的方法來視覺化您的數據。例如,您可以在類別圖中比較每種品種和顏色的數據分佈。\n",
"\n",
"1. 使用 geombar 函數創建這樣的圖,使用我們的南瓜數據,並為每個南瓜類別(橙色或白色)指定顏色映射:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Specify colors for each value of the hue variable\n",
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
"\n",
"# Create the bar plot\n",
"ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n",
" geom_bar(position = \"dodge\") +\n",
" scale_fill_manual(values = palette) +\n",
" labs(y = \"Variety\", fill = \"Color\") +\n",
" theme_minimal()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"透過觀察這些數據,您可以看到顏色數據與品種之間的關聯。\n",
"\n",
"✅ 根據這個分類圖,您能想到哪些有趣的探索方向?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 資料前處理:特徵編碼\n",
"\n",
"我們的南瓜數據集中的所有欄位都包含字串值。對於人類來說,處理類別型數據是直觀的,但對機器來說並非如此。機器學習算法對數字數據的處理效果更佳。因此,編碼是資料前處理階段中非常重要的一步,因為它使我們能夠將類別型數據轉換為數值型數據,同時不丟失任何信息。良好的編碼有助於構建出色的模型。\n",
"\n",
"在特徵編碼中,主要有兩種類型的編碼器:\n",
"\n",
"1. **序數編碼器Ordinal encoder**:適用於序數變數,這類變數是具有邏輯順序的類別型變數,例如我們數據集中的 `item_size` 欄位。序數編碼器會創建一個映射,使每個類別用一個數字表示,這個數字代表該類別在欄位中的順序。\n",
"\n",
"2. **類別編碼器Categorical encoder**:適用於名義變數,這類變數是沒有邏輯順序的類別型變數,例如我們數據集中除了 `item_size` 以外的所有特徵。這是一種獨熱編碼one-hot encoding每個類別會用一個二進位欄位表示如果南瓜屬於該品種則編碼變數為 1否則為 0。\n",
"\n",
"Tidymodels 提供了另一個非常方便的套件:[recipes](https://recipes.tidymodels.org/),這是一個用於資料前處理的套件。我們將定義一個 `recipe`,指定所有的預測欄位應被編碼為一組整數,然後使用 `prep` 來估算任何操作所需的數量和統計數據,最後使用 `bake` 將計算應用到新數據上。\n",
"\n",
"> 通常recipes 通常用作建模的前處理器,用於定義應該對數據集應用哪些步驟以使其準備好進行建模。在這種情況下,**強烈建議** 使用 `workflow()`,而不是手動使用 prep 和 bake 來估算 recipe。我們稍後會看到這些內容。\n",
">\n",
"> 不過,目前我們使用 recipes + prep + bake 來指定應該對數據集應用哪些步驟,使其準備好進行數據分析,然後提取應用步驟後的前處理數據。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Preprocess and extract data to allow some data analysis\n",
"baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%\n",
" # Define ordering for item_size column\n",
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
" # Convert factors to numbers using the order defined above (Ordinal encoding)\n",
" step_integer(item_size, zero_based = F) %>%\n",
" # Encode all other predictors using one hot encoding\n",
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%\n",
" prep(data = pumpkin_select) %>%\n",
" bake(new_data = NULL)\n",
"\n",
"# Display the first few rows of preprocessed data\n",
"baked_pumpkins %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"✅ 使用序數編碼器處理 Item Size 欄位有什麼優勢?\n",
"\n",
"### 分析變數之間的關係\n",
"\n",
"現在我們已經完成了資料的前處理,可以開始分析特徵與標籤之間的關係,以了解模型在給定特徵的情況下能多好地預測標籤。進行這類分析的最佳方法是繪製數據圖表。 \n",
"我們將再次使用 ggplot 的 geom_boxplot_ 函數,以分類圖的形式可視化 Item Size、Variety 和 Color 之間的關係。為了更好地繪製數據,我們將使用編碼後的 Item Size 欄位以及未編碼的 Variety 欄位。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Define the color palette\n",
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
"\n",
"# We need the encoded Item Size column to use it as the x-axis values in the plot\n",
"pumpkins_select_plot<-pumpkins_select\n",
"pumpkins_select_plot$item_size <- baked_pumpkins$item_size\n",
"\n",
"# Create the grouped box plot\n",
"ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +\n",
" geom_boxplot() +\n",
" facet_grid(variety ~ ., scales = \"free_x\") +\n",
" scale_fill_manual(values = palette) +\n",
" labs(x = \"Item Size\", y = \"\") +\n",
" theme_minimal() +\n",
" theme(strip.text = element_text(size = 12)) +\n",
" theme(axis.text.x = element_text(size = 10)) +\n",
" theme(axis.title.x = element_text(size = 12)) +\n",
" theme(axis.title.y = element_blank()) +\n",
" theme(legend.position = \"bottom\") +\n",
" guides(fill = guide_legend(title = \"Color\")) +\n",
" theme(panel.spacing = unit(0.5, \"lines\"))+\n",
" theme(strip.text.y = element_text(size = 4, hjust = 0)) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 使用群集圖\n",
"\n",
"由於 Color 是一個二元分類(白色或非白色),因此需要「[專門的方法](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)」來進行視覺化。\n",
"\n",
"嘗試使用 `群集圖` 來展示顏色在 item_size 上的分佈。\n",
"\n",
"我們將使用 [ggbeeswarm 套件](https://github.com/eclarke/ggbeeswarm),該套件提供使用 ggplot2 創建蜜蜂群樣式圖的方法。蜜蜂群圖是一種將原本會重疊的點排列在彼此旁邊的繪圖方式。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Create beeswarm plots of color and item_size\n",
"baked_pumpkins %>% \n",
" mutate(color = factor(color)) %>% \n",
" ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n",
" geom_quasirandom() +\n",
" scale_color_brewer(palette = \"Dark2\", direction = -1) +\n",
" theme(legend.position = \"none\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"現在我們已經了解顏色的二元分類與更大尺寸組別之間的關係,接下來讓我們探討如何使用邏輯迴歸來判斷南瓜可能的顏色。\n",
"\n",
"## 建立模型\n",
"\n",
"選擇您想用於分類模型的變數,並將數據分成訓練集和測試集。[rsample](https://rsample.tidymodels.org/) 是 Tidymodels 中的一個套件,提供了高效的數據分割和重抽樣基礎設施:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Split data into 80% for training and 20% for testing\n",
"set.seed(2056)\n",
"pumpkins_split <- pumpkins_select %>% \n",
" initial_split(prop = 0.8)\n",
"\n",
"# Extract the data in each split\n",
"pumpkins_train <- training(pumpkins_split)\n",
"pumpkins_test <- testing(pumpkins_split)\n",
"\n",
"# Print out the first 5 rows of the training set\n",
"pumpkins_train %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"🙌 我們現在準備透過將訓練特徵與訓練標籤(顏色)配對來訓練模型。\n",
"\n",
"我們將從建立一個配方開始,該配方指定了對數據進行建模前需要執行的預處理步驟,例如:將分類變數編碼為一組整數。就像 `baked_pumpkins` 一樣,我們創建了一個 `pumpkins_recipe`,但不進行 `prep` 和 `bake`,因為這些步驟將被整合到工作流程中,稍後您將看到。\n",
"\n",
"在 Tidymodels 中有許多方法可以指定邏輯回歸模型。請參考 `?logistic_reg()`。目前,我們將通過預設的 `stats::glm()` 引擎來指定邏輯回歸模型。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Create a recipe that specifies preprocessing steps for modelling\n",
"pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n",
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
" step_integer(item_size, zero_based = F) %>% \n",
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)\n",
"\n",
"# Create a logistic model specification\n",
"log_reg <- logistic_reg() %>% \n",
" set_engine(\"glm\") %>% \n",
" set_mode(\"classification\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"現在我們已經有了配方和模型規範,接下來需要找到一種方法將它們整合成一個物件。這個物件首先會對資料進行前處理(在幕後執行 prep 和 bake然後基於前處理後的資料擬合模型並且還能支持潛在的後處理活動。\n",
"\n",
"在 Tidymodels 中,這個方便的物件稱為 [`workflow`](https://workflows.tidymodels.org/),它能輕鬆地整合你的建模組件。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Bundle modelling components in a workflow\n",
"log_reg_wf <- workflow() %>% \n",
" add_recipe(pumpkins_recipe) %>% \n",
" add_model(log_reg)\n",
"\n",
"# Print out the workflow\n",
"log_reg_wf\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在*指定*工作流程後,可以使用[`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html)函數來`訓練`模型。工作流程會估算配方並在訓練前預處理數據因此我們不需要手動使用prep和bake來完成這些步驟。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Train the model\n",
"wf_fit <- log_reg_wf %>% \n",
" fit(data = pumpkins_train)\n",
"\n",
"# Print the trained workflow\n",
"wf_fit\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"模型訓練完成後,會顯示訓練過程中學到的係數。\n",
"\n",
"現在我們已使用訓練數據完成模型訓練,可以使用 [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html) 對測試數據進行預測。讓我們先使用模型為測試集預測標籤以及每個標籤的概率。當概率大於 0.5 時,預測類別為 `WHITE`,否則為 `ORANGE`。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Make predictions for color and corresponding probabilities\n",
"results <- pumpkins_test %>% select(color) %>% \n",
" bind_cols(wf_fit %>% \n",
" predict(new_data = pumpkins_test)) %>%\n",
" bind_cols(wf_fit %>%\n",
" predict(new_data = pumpkins_test, type = \"prob\"))\n",
"\n",
"# Compare predictions\n",
"results %>% \n",
" slice_head(n = 10)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"非常棒!這提供了更多關於邏輯迴歸如何運作的深入見解。\n",
"\n",
"### 透過混淆矩陣更好地理解\n",
"\n",
"將每個預測與其對應的「真實值」進行比較並不是判斷模型預測效果的高效方法。幸運的是Tidymodels 還有一些其他的技巧:[`yardstick`](https://yardstick.tidymodels.org/)——一個用於通過性能指標衡量模型效果的套件。\n",
"\n",
"與分類問題相關的一個性能指標是[`混淆矩陣`](https://wikipedia.org/wiki/Confusion_matrix)。混淆矩陣描述了分類模型的表現如何。混淆矩陣統計了模型正確分類每個類別的例子數量。在我們的例子中,它會顯示有多少橙色南瓜被分類為橙色,有多少白色南瓜被分類為白色;混淆矩陣還會顯示有多少被分類到了**錯誤**的類別。\n",
"\n",
"yardstick 的 [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) 函數可以計算觀察值和預測類別的交叉統計表。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Confusion matrix for prediction results\n",
"conf_mat(data = results, truth = color, estimate = .pred_class)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"讓我們來解讀混淆矩陣。模型需要將南瓜分類為兩個二元類別,類別 `white` 和類別 `not-white`。\n",
"\n",
"- 如果模型預測南瓜為白色,且實際屬於「白色」類別,我們稱之為 `true positive`,顯示在左上角的數字。\n",
"\n",
"- 如果模型預測南瓜為非白色,但實際屬於「白色」類別,我們稱之為 `false negative`,顯示在左下角的數字。\n",
"\n",
"- 如果模型預測南瓜為白色,但實際屬於「非白色」類別,我們稱之為 `false positive`,顯示在右上角的數字。\n",
"\n",
"- 如果模型預測南瓜為非白色,且實際屬於「非白色」類別,我們稱之為 `true negative`,顯示在右下角的數字。\n",
"\n",
"| 真實值 |\n",
"|:-----:|\n",
"\n",
"| | | |\n",
"|---------------|--------|-------|\n",
"| **預測值** | WHITE | ORANGE |\n",
"| WHITE | TP | FP |\n",
"| ORANGE | FN | TN |\n",
"\n",
"如你所料,理想情況是擁有更多的 `true positive` 和 `true negative`,以及更少的 `false positive` 和 `false negative`,這意味著模型表現更好。\n",
"\n",
"混淆矩陣非常有用,因為它衍生出其他指標,幫助我們更好地評估分類模型的性能。讓我們來看看其中一些指標:\n",
"\n",
"🎓 精確度 (Precision): `TP/(TP + FP)`,定義為預測為正確的樣本中實際為正確的比例。也稱為[正確預測值](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")。\n",
"\n",
"🎓 召回率 (Recall): `TP/(TP + FN)`,定義為實際為正確的樣本中被正確預測的比例。也稱為 `敏感度`。\n",
"\n",
"🎓 特異性 (Specificity): `TN/(TN + FP)`,定義為實際為負樣本中被正確預測的比例。\n",
"\n",
"🎓 準確率 (Accuracy): `(TP + TN)/(TP + TN + FP + FN)`,樣本中被正確預測的標籤所佔的百分比。\n",
"\n",
"🎓 F 值 (F Measure): 精確度和召回率的加權平均值,最佳值為 1最差值為 0。\n",
"\n",
"讓我們來計算這些指標吧!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Combine metric functions and calculate them all at once\n",
"eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n",
"eval_metrics(data = results, truth = color, estimate = .pred_class)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 視覺化此模型的 ROC 曲線\n",
"\n",
"讓我們再進行一次視覺化,來看看所謂的 [`ROC 曲線`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Make a roc_curve\n",
"results %>% \n",
" roc_curve(color, .pred_ORANGE) %>% \n",
" autoplot()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ROC 曲線通常用來檢視分類器在真陽性與假陽性方面的表現。ROC 曲線通常在 Y 軸上顯示「真陽性率」/敏感度,在 X 軸上顯示「假陽性率」/1-特異性。因此,曲線的陡峭程度以及曲線與中線之間的空間非常重要:理想的曲線應該迅速向上並超越中線。在我們的例子中,起初會有一些假陽性,然後曲線正確地向上並超越。\n",
"\n",
"最後,我們使用 `yardstick::roc_auc()` 來計算實際的曲線下面積AUC。AUC 的一種解釋方式是模型將隨機正例排序高於隨機負例的概率。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Calculate area under curve\n",
"results %>% \n",
" roc_auc(color, .pred_ORANGE)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"結果約為 `0.975`。由於 AUC 的範圍是從 0 到 1你希望分數越高越好因為一個模型如果能 100% 正確地進行預測,其 AUC 將為 1在這種情況下這個模型表現*相當不錯*。\n",
"\n",
"在未來的分類課程中,你將學習如何提升模型的分數(例如在這種情況下處理數據不平衡問題)。\n",
"\n",
"## 🚀挑戰\n",
"\n",
"關於邏輯迴歸還有很多內容可以深入探討!但最好的學習方式是進行實驗。找一個適合這類分析的數據集,並用它建立一個模型。你學到了什麼?提示:可以試試 [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) 上有趣的數據集。\n",
"\n",
"## 回顧與自學\n",
"\n",
"閱讀 [斯坦福這篇論文](https://web.stanford.edu/~jurafsky/slp3/5.pdf) 的前幾頁,了解邏輯迴歸的一些實際應用。思考哪些任務更適合我們到目前為止所學的迴歸類型。哪一種方法效果最好?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免責聲明** \n本文件已使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。儘管我們努力確保翻譯的準確性,但請注意,自動翻譯可能包含錯誤或不準確之處。原始文件的母語版本應被視為權威來源。對於關鍵資訊,建議尋求專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或錯誤解釋不承擔責任。\n"
]
}
],
"metadata": {
"anaconda-cloud": "",
"kernelspec": {
"display_name": "R",
"langauge": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.1"
},
"coopTranslator": {
"original_hash": "feaf125f481a89c468fa115bf2aed580",
"translation_date": "2025-09-03T19:38:50+00:00",
"source_file": "2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb",
"language_code": "tw"
}
},
"nbformat": 4,
"nbformat_minor": 1
}