You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/hk/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb

685 lines
28 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 建立邏輯迴歸模型 - 第四課\n",
"\n",
"![邏輯迴歸與線性迴歸資訊圖表](../../../../../../translated_images/linear-vs-logistic.ba180bf95e7ee66721ba10ebf2dac2666acbd64a88b003c83928712433a13c7d.hk.png)\n",
"\n",
"#### **[課前測驗](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n",
"\n",
"#### 介紹\n",
"\n",
"在這最後一課關於迴歸的課程中,我們將探討邏輯迴歸,這是基本的*經典*機器學習技術之一。你可以使用這種技術來發現模式並預測二元分類。例如:這顆糖果是巧克力還是不是?這種疾病是否具有傳染性?這位顧客是否會選擇這個產品?\n",
"\n",
"在這課中,你將學到:\n",
"\n",
"- 邏輯迴歸的技術\n",
"\n",
"✅ 在這個 [學習模組](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott) 中深入了解如何使用這種類型的迴歸。\n",
"\n",
"## 先決條件\n",
"\n",
"在使用南瓜數據後,我們已經足夠熟悉它,並意識到有一個二元分類可以使用:`Color`。\n",
"\n",
"讓我們建立一個邏輯迴歸模型,根據一些變數來預測*某個南瓜可能的顏色*(橙色 🎃 或白色 👻)。\n",
"\n",
"> 為什麼我們在迴歸課程中討論二元分類?僅僅是出於語言上的方便,因為邏輯迴歸[實際上是一種分類方法](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression),儘管它是基於線性的。在下一組課程中,了解其他分類數據的方法。\n",
"\n",
"在這課中,我們需要以下套件:\n",
"\n",
"- `tidyverse` [tidyverse](https://www.tidyverse.org/) 是一個[由 R 套件組成的集合](https://www.tidyverse.org/packages),旨在讓數據科學更快速、更簡單、更有趣!\n",
"\n",
"- `tidymodels` [tidymodels](https://www.tidymodels.org/) 框架是一個[由套件組成的集合](https://www.tidymodels.org/packages/),用於建模和機器學習。\n",
"\n",
"- `janitor` [janitor 套件](https://github.com/sfirke/janitor) 提供了一些簡單的工具,用於檢查和清理髒數據。\n",
"\n",
"- `ggbeeswarm` [ggbeeswarm 套件](https://github.com/eclarke/ggbeeswarm) 提供了使用 ggplot2 創建蜜蜂群圖的方式。\n",
"\n",
"你可以通過以下方式安裝它們:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n",
"\n",
"或者,下面的腳本會檢查你是否已安裝完成此模組所需的套件,並在缺少時為你安裝。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **定義問題**\n",
"\n",
"在我們的情境中,我們會將問題表達為二元分類:「白色」或「非白色」。在我們的數據集中還有一個「條紋」類別,但由於該類別的樣本數量很少,我們不會使用它。無論如何,當我們從數據集中移除空值後,它就會消失。\n",
"\n",
"> 🎃 有趣的小知識,我們有時會稱白色南瓜為「幽靈」南瓜。它們不太容易雕刻,因此不像橙色南瓜那麼受歡迎,但它們看起來很酷!所以我們也可以將問題重新表述為:「幽靈」或「非幽靈」。👻\n",
"\n",
"## **關於邏輯回歸**\n",
"\n",
"邏輯回歸與之前學過的線性回歸有幾個重要的不同之處。\n",
"\n",
"#### **二元分類**\n",
"\n",
"邏輯回歸不提供與線性回歸相同的功能。前者提供對「二元類別」(例如「橙色或非橙色」)的預測,而後者則能預測「連續值」,例如根據南瓜的產地和收穫時間,*價格會上漲多少*。\n",
"\n",
"![Dasani Madipalli 的資訊圖表](../../../../../../translated_images/pumpkin-classifier.562771f104ad5436b87d1c67bca02a42a17841133556559325c0a0e348e5b774.hk.png)\n",
"\n",
"### 其他分類方式\n",
"\n",
"邏輯回歸還有其他類型,包括多項式和序列式:\n",
"\n",
"- **多項式**,涉及多於一個類別,例如「橙色、白色和條紋」。\n",
"\n",
"- **序列式**,涉及有序的類別,適合我們希望按邏輯順序排列結果的情況,例如南瓜按有限的大小(迷你、小、中、大、特大、超大)排序。\n",
"\n",
"![多項式 vs 序列式回歸](../../../../../../translated_images/multinomial-vs-ordinal.36701b4850e37d86c9dd49f7bef93a2f94dbdb8fe03443eb68f0542f97f28f29.hk.png)\n",
"\n",
"#### **變數不需要相關**\n",
"\n",
"還記得線性回歸在變數相關性更高時效果更好嗎?邏輯回歸則相反——變數不需要相關性。這適用於我們的數據,因為它的相關性相對較弱。\n",
"\n",
"#### **需要大量乾淨的數據**\n",
"\n",
"如果使用更多數據,邏輯回歸會提供更準確的結果;我們的小型數據集並不適合這項任務,因此請記住這一點。\n",
"\n",
"✅ 思考哪些類型的數據適合邏輯回歸\n",
"\n",
"## 練習 - 整理數據\n",
"\n",
"首先,稍微清理一下數據,刪除空值並選擇部分欄位:\n",
"\n",
"1. 添加以下程式碼:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Load the core tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# Import the data and clean column names\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n",
" clean_names()\n",
"\n",
"# Select desired columns\n",
"pumpkins_select <- pumpkins %>% \n",
" select(c(city_name, package, variety, origin, item_size, color)) \n",
"\n",
"# Drop rows containing missing values and encode color as factor (category)\n",
"pumpkins_select <- pumpkins_select %>% \n",
" drop_na() %>% \n",
" mutate(color = factor(color))\n",
"\n",
"# View the first few rows\n",
"pumpkins_select %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"你可以隨時使用 [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) 函數來快速查看你的新數據框,如下所示:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"pumpkins_select %>% \n",
" glimpse()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"讓我們確認一下,我們確實是在處理一個二元分類問題:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Subset distinct observations in outcome column\n",
"pumpkins_select %>% \n",
" distinct(color)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 視覺化 - 類別圖\n",
"現在你已經再次載入南瓜數據並進行清理,保留了一個包含幾個變數(包括顏色)的數據集。讓我們使用 ggplot 庫在筆記本中視覺化這個數據框。\n",
"\n",
"ggplot 庫提供了一些很棒的方法來視覺化你的數據。例如,你可以在類別圖中比較每個品種和顏色的數據分佈。\n",
"\n",
"1. 使用 geombar 函數創建這樣的圖表,使用我們的南瓜數據,並為每個南瓜類別(橙色或白色)指定顏色映射:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Specify colors for each value of the hue variable\n",
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
"\n",
"# Create the bar plot\n",
"ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n",
" geom_bar(position = \"dodge\") +\n",
" scale_fill_manual(values = palette) +\n",
" labs(y = \"Variety\", fill = \"Color\") +\n",
" theme_minimal()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"透過觀察數據,你可以看到顏色數據與品種之間的關係。\n",
"\n",
"✅ 根據這個分類圖,你能想到哪些有趣的探索方向?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 數據預處理:特徵編碼\n",
"\n",
"我們的南瓜數據集包含所有欄位的字串值。對人類來說,處理分類數據是直觀的,但對機器來說並非如此。機器學習算法更擅長處理數字型數據。因此,編碼是數據預處理階段中非常重要的一步,因為它能夠將分類數據轉換為數字數據,同時不丟失任何信息。良好的編碼有助於建立良好的模型。\n",
"\n",
"特徵編碼主要有兩種編碼器:\n",
"\n",
"1. **序數編碼器**:適合處理序數變量,即具有邏輯順序的分類變量,例如我們數據集中的 `item_size` 欄位。它會創建一個映射,使每個類別用一個數字表示,這個數字是該類別在欄位中的順序。\n",
"\n",
"2. **分類編碼器**:適合處理名義變量,即沒有邏輯順序的分類變量,例如我們數據集中除了 `item_size` 以外的所有特徵。這是一種獨熱編碼one-hot encoding意思是每個類別都用一個二進制欄位表示如果南瓜屬於該品種編碼變量等於 1否則為 0。\n",
"\n",
"Tidymodels 提供了一個非常方便的套件:[recipes](https://recipes.tidymodels.org/)——一個用於數據預處理的套件。我們將定義一個 `recipe`,指定所有預測欄位應編碼為一組整數,使用 `prep` 來估算所需的數量和統計數據,最後使用 `bake` 將計算應用到新數據上。\n",
"\n",
"> 通常情況下recipes 通常用作建模的預處理器,它定義了應對數據集應用哪些步驟以使其準備好進行建模。在這種情況下,**強烈建議**使用 `workflow()`,而不是手動使用 prep 和 bake 來估算 recipe。我們稍後會看到這些內容。\n",
">\n",
"> 不過,目前我們使用 recipes + prep + bake 來指定應對數據集應用哪些步驟以使其準備好進行數據分析,然後提取應用步驟後的預處理數據。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Preprocess and extract data to allow some data analysis\n",
"baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%\n",
" # Define ordering for item_size column\n",
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
" # Convert factors to numbers using the order defined above (Ordinal encoding)\n",
" step_integer(item_size, zero_based = F) %>%\n",
" # Encode all other predictors using one hot encoding\n",
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%\n",
" prep(data = pumpkin_select) %>%\n",
" bake(new_data = NULL)\n",
"\n",
"# Display the first few rows of preprocessed data\n",
"baked_pumpkins %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"✅ 使用序數編碼器處理 Item Size 欄位有什麼優勢?\n",
"\n",
"### 分析變數之間的關係\n",
"\n",
"現在我們已經完成了數據的預處理,可以分析特徵與標籤之間的關係,以了解模型在給定特徵的情況下預測標籤的能力。進行這類分析的最佳方法是繪製數據圖表。 \n",
"我們將再次使用 ggplot 的 geom_boxplot_ 函數,以分類圖的形式可視化 Item Size、Variety 和 Color 之間的關係。為了更好地繪製數據,我們將使用已編碼的 Item Size 欄位以及未編碼的 Variety 欄位。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Define the color palette\n",
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
"\n",
"# We need the encoded Item Size column to use it as the x-axis values in the plot\n",
"pumpkins_select_plot<-pumpkins_select\n",
"pumpkins_select_plot$item_size <- baked_pumpkins$item_size\n",
"\n",
"# Create the grouped box plot\n",
"ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +\n",
" geom_boxplot() +\n",
" facet_grid(variety ~ ., scales = \"free_x\") +\n",
" scale_fill_manual(values = palette) +\n",
" labs(x = \"Item Size\", y = \"\") +\n",
" theme_minimal() +\n",
" theme(strip.text = element_text(size = 12)) +\n",
" theme(axis.text.x = element_text(size = 10)) +\n",
" theme(axis.title.x = element_text(size = 12)) +\n",
" theme(axis.title.y = element_blank()) +\n",
" theme(legend.position = \"bottom\") +\n",
" guides(fill = guide_legend(title = \"Color\")) +\n",
" theme(panel.spacing = unit(0.5, \"lines\"))+\n",
" theme(strip.text.y = element_text(size = 4, hjust = 0)) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 使用群集圖\n",
"\n",
"由於 Color 是一個二元分類(白色或非白色),因此需要「[專門的方法](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)」來進行可視化。\n",
"\n",
"嘗試使用 `群集圖` 來展示顏色在 item_size 上的分佈。\n",
"\n",
"我們將使用 [ggbeeswarm 套件](https://github.com/eclarke/ggbeeswarm),該套件提供使用 ggplot2 創建蜜蜂群樣式圖的方法。蜜蜂群圖是一種將原本會重疊的點排列在彼此旁邊的繪圖方式。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Create beeswarm plots of color and item_size\n",
"baked_pumpkins %>% \n",
" mutate(color = factor(color)) %>% \n",
" ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n",
" geom_quasirandom() +\n",
" scale_color_brewer(palette = \"Dark2\", direction = -1) +\n",
" theme(legend.position = \"none\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"現在我們已經了解顏色的二元分類與更大尺寸組別之間的關係,接下來讓我們探討如何使用邏輯回歸來判斷南瓜可能的顏色。\n",
"\n",
"## 建立模型\n",
"\n",
"選擇您想用於分類模型的變數,並將數據分成訓練集和測試集。[rsample](https://rsample.tidymodels.org/) 是 Tidymodels 中的一個套件,提供高效的數據分割和重抽樣基礎設施:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Split data into 80% for training and 20% for testing\n",
"set.seed(2056)\n",
"pumpkins_split <- pumpkins_select %>% \n",
" initial_split(prop = 0.8)\n",
"\n",
"# Extract the data in each split\n",
"pumpkins_train <- training(pumpkins_split)\n",
"pumpkins_test <- testing(pumpkins_split)\n",
"\n",
"# Print out the first 5 rows of the training set\n",
"pumpkins_train %>% \n",
" slice_head(n = 5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"🙌 我們現在準備透過將訓練特徵與訓練標籤(顏色)配對來訓練模型。\n",
"\n",
"我們會先建立一個配方,指定在模型化之前需要對數據進行的預處理步驟,例如:將分類變數編碼為一組整數。就像 `baked_pumpkins` 一樣,我們會建立一個 `pumpkins_recipe`,但不會進行 `prep` 和 `bake`,因為這些步驟會被整合到工作流程中,稍後幾個步驟你就會看到。\n",
"\n",
"在 Tidymodels 中有很多方法可以指定邏輯回歸模型。請參考 `?logistic_reg()`。目前,我們會透過預設的 `stats::glm()` 引擎來指定邏輯回歸模型。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Create a recipe that specifies preprocessing steps for modelling\n",
"pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n",
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
" step_integer(item_size, zero_based = F) %>% \n",
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)\n",
"\n",
"# Create a logistic model specification\n",
"log_reg <- logistic_reg() %>% \n",
" set_engine(\"glm\") %>% \n",
" set_mode(\"classification\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"現在我們已經有了一個配方和模型規範,我們需要找到一種方法將它們結合成一個物件,這個物件首先會對數據進行預處理(在幕後進行準備和烘焙),然後在預處理後的數據上擬合模型,並且還能支持潛在的後處理活動。\n",
"\n",
"在 Tidymodels 中,這個方便的物件叫做 [`workflow`](https://workflows.tidymodels.org/),它能方便地保存你的建模組件。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Bundle modelling components in a workflow\n",
"log_reg_wf <- workflow() %>% \n",
" add_recipe(pumpkins_recipe) %>% \n",
" add_model(log_reg)\n",
"\n",
"# Print out the workflow\n",
"log_reg_wf\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在工作流程*指定*後,可以使用[`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html)函數來`訓練`模型。工作流程會估算配方並在訓練前預處理數據因此我們不需要手動使用prep和bake來完成這些步驟。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Train the model\n",
"wf_fit <- log_reg_wf %>% \n",
" fit(data = pumpkins_train)\n",
"\n",
"# Print the trained workflow\n",
"wf_fit\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"模型訓練完成後,會顯示訓練過程中學到的係數。\n",
"\n",
"現在我們已經使用訓練數據訓練了模型,可以利用 [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html) 對測試數據進行預測。讓我們先用模型來預測測試集的標籤以及每個標籤的概率。當概率大於 0.5 時,預測類別為 `WHITE`,否則為 `ORANGE`。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Make predictions for color and corresponding probabilities\n",
"results <- pumpkins_test %>% select(color) %>% \n",
" bind_cols(wf_fit %>% \n",
" predict(new_data = pumpkins_test)) %>%\n",
" bind_cols(wf_fit %>%\n",
" predict(new_data = pumpkins_test, type = \"prob\"))\n",
"\n",
"# Compare predictions\n",
"results %>% \n",
" slice_head(n = 10)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"非常好!這提供了一些關於邏輯迴歸如何運作的深入理解。\n",
"\n",
"### 透過混淆矩陣更好地理解\n",
"\n",
"將每個預測值與其對應的「真實值」進行比較並不是判斷模型預測效果的高效方法。幸運的是Tidymodels 還有一些其他的技巧:[`yardstick`](https://yardstick.tidymodels.org/)——一個用於通過性能指標衡量模型效果的套件。\n",
"\n",
"與分類問題相關的一個性能指標是[`混淆矩陣`](https://wikipedia.org/wiki/Confusion_matrix)。混淆矩陣描述了一個分類模型的表現如何。混淆矩陣統計了模型在每個類別中正確分類的例子數量。在我們的例子中,它會顯示有多少橙色南瓜被正確分類為橙色,以及有多少白色南瓜被正確分類為白色;混淆矩陣還會顯示有多少被分類到了**錯誤**的類別。\n",
"\n",
"[`conf_mat()`](https://tidymodels.github.io/yardstick/reference/conf_mat.html) 是 yardstick 中的一個函數,用於計算觀察值和預測值類別的交叉統計表。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Confusion matrix for prediction results\n",
"conf_mat(data = results, truth = color, estimate = .pred_class)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"讓我們來解讀混淆矩陣。我們的模型需要將南瓜分類為兩個二元類別,類別 `white` 和類別 `not-white`。\n",
"\n",
"- 如果你的模型預測南瓜是白色而它在現實中屬於類別「white」我們稱之為 `true positive`,顯示在左上角的數字。\n",
"\n",
"- 如果你的模型預測南瓜不是白色而它在現實中屬於類別「white」我們稱之為 `false negative`,顯示在左下角的數字。\n",
"\n",
"- 如果你的模型預測南瓜是白色而它在現實中屬於類別「not-white」我們稱之為 `false positive`,顯示在右上角的數字。\n",
"\n",
"- 如果你的模型預測南瓜不是白色而它在現實中屬於類別「not-white」我們稱之為 `true negative`,顯示在右下角的數字。\n",
"\n",
"| 真實值 |\n",
"|:-----:|\n",
"\n",
"| | | |\n",
"|---------------|--------|-------|\n",
"| **預測值** | WHITE | ORANGE |\n",
"| WHITE | TP | FP |\n",
"| ORANGE | FN | TN |\n",
"\n",
"正如你可能猜到的,理想情況是擁有更多的 `true positive` 和 `true negative`,以及更少的 `false positive` 和 `false negative`,這意味著模型的表現更好。\n",
"\n",
"混淆矩陣非常有用,因為它衍生出其他指標,幫助我們更好地評估分類模型的性能。讓我們來看看其中一些指標:\n",
"\n",
"🎓 精確度 (Precision): `TP/(TP + FP)` 定義為被預測為正確的樣本中實際正確的比例。也稱為 [正確預測值](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")。\n",
"\n",
"🎓 召回率 (Recall): `TP/(TP + FN)` 定義為實際正確的樣本中被正確預測的比例。也稱為 `敏感度`。\n",
"\n",
"🎓 特異性 (Specificity): `TN/(TN + FP)` 定義為實際為負樣本中被正確預測為負的比例。\n",
"\n",
"🎓 準確率 (Accuracy): `TP + TN/(TP + TN + FP + FN)` 樣本中被正確預測的標籤所佔的百分比。\n",
"\n",
"🎓 F 值 (F Measure): 精確度和召回率的加權平均值,最佳值為 1最差值為 0。\n",
"\n",
"讓我們來計算這些指標吧!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Combine metric functions and calculate them all at once\n",
"eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n",
"eval_metrics(data = results, truth = color, estimate = .pred_class)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 視覺化此模型的ROC曲線\n",
"\n",
"讓我們進行另一個視覺化,來看看所謂的[`ROC曲線`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Make a roc_curve\n",
"results %>% \n",
" roc_curve(color, .pred_ORANGE) %>% \n",
" autoplot()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ROC 曲線通常用來觀察分類器的輸出分析其真陽性與假陽性的表現。ROC 曲線通常在 Y 軸顯示「真陽性率」/敏感度,而在 X 軸顯示「假陽性率」/1-特異性。因此,曲線的陡峭程度以及曲線與中線之間的距離非常重要:理想的曲線應該快速向上並超越中線。在我們的例子中,起初會有一些假陽性,然後曲線正確地向上並超越中線。\n",
"\n",
"最後,我們使用 `yardstick::roc_auc()` 來計算實際的曲線下面積 (AUC)。AUC 的一種解釋方式是,模型將隨機正例排在隨機負例之前的概率。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Calculate area under curve\n",
"results %>% \n",
" roc_auc(color, .pred_ORANGE)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"結果約為 `0.975`。由於 AUC 的範圍是 0 到 1你希望分數越高越好因為一個模型如果能 100% 正確地進行預測,其 AUC 將為 1在這個例子中模型表現*相當不錯*。\n",
"\n",
"在未來的分類課程中,你將學習如何提升模型的分數(例如在這個案例中處理不平衡數據)。\n",
"\n",
"## 🚀挑戰\n",
"\n",
"關於邏輯回歸還有很多內容可以探索!但最好的學習方式是親自嘗試。找一個適合這類分析的數據集,並用它建立一個模型。你學到了什麼?提示:可以試試 [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) 上有趣的數據集。\n",
"\n",
"## 回顧與自學\n",
"\n",
"閱讀 [這篇來自 Stanford 的論文](https://web.stanford.edu/~jurafsky/slp3/5.pdf) 的前幾頁,了解邏輯回歸的一些實際應用。思考哪些任務更適合我們到目前為止所學的不同回歸類型。哪一種方法效果最好?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免責聲明** \n本文件已使用人工智能翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。儘管我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。原始語言的文件應被視為權威來源。對於重要資訊,建議使用專業的人類翻譯。我們對因使用此翻譯而引起的任何誤解或錯誤解釋概不負責。\n"
]
}
],
"metadata": {
"anaconda-cloud": "",
"kernelspec": {
"display_name": "R",
"langauge": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.1"
},
"coopTranslator": {
"original_hash": "feaf125f481a89c468fa115bf2aed580",
"translation_date": "2025-09-03T19:37:37+00:00",
"source_file": "2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb",
"language_code": "hk"
}
},
"nbformat": 4,
"nbformat_minor": 1
}