{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 建立邏輯迴歸模型 - 第四課\n", "\n", "![邏輯迴歸與線性迴歸資訊圖表](../../../../../../translated_images/linear-vs-logistic.ba180bf95e7ee66721ba10ebf2dac2666acbd64a88b003c83928712433a13c7d.mo.png)\n", "\n", "#### **[課前測驗](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n", "\n", "#### 介紹\n", "\n", "在這最後一課關於迴歸的內容中,我們將探討邏輯迴歸,這是基本的*經典*機器學習技術之一。你可以使用這種技術來發現模式並預測二元分類。例如:這顆糖果是巧克力還是不是?這種疾病是否具有傳染性?這位顧客是否會選擇這個產品?\n", "\n", "在這一課中,你將學到:\n", "\n", "- 邏輯迴歸的技術\n", "\n", "✅ 在這個 [學習模組](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott) 中深入了解如何使用這種類型的迴歸\n", "\n", "## 前置條件\n", "\n", "在使用南瓜數據後,我們已經足夠熟悉它,並意識到有一個二元分類可以使用:`Color`。\n", "\n", "讓我們建立一個邏輯迴歸模型,根據一些變數來預測*某個南瓜可能的顏色*(橙色 🎃 或白色 👻)。\n", "\n", "> 為什麼我們在迴歸課程中討論二元分類?僅僅是出於語言上的方便,因為邏輯迴歸[實際上是一種分類方法](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression),儘管它是基於線性的。在下一組課程中,了解其他分類數據的方法。\n", "\n", "在這一課中,我們需要以下套件:\n", "\n", "- `tidyverse`: [tidyverse](https://www.tidyverse.org/) 是一個 [R 套件集合](https://www.tidyverse.org/packages),旨在讓數據科學更快速、更簡單、更有趣!\n", "\n", "- `tidymodels`: [tidymodels](https://www.tidymodels.org/) 框架是一個 [套件集合](https://www.tidymodels.org/packages),用於建模和機器學習。\n", "\n", "- `janitor`: [janitor 套件](https://github.com/sfirke/janitor) 提供了一些簡單的工具,用於檢查和清理髒數據。\n", "\n", "- `ggbeeswarm`: [ggbeeswarm 套件](https://github.com/eclarke/ggbeeswarm) 提供了使用 ggplot2 創建蜜蜂群圖的方式。\n", "\n", "你可以通過以下方式安裝它們:\n", "\n", "`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n", "\n", "或者,下面的腳本會檢查你是否已安裝完成此模組所需的套件,並在缺少時為你安裝。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n", "\n", "pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **定義問題**\n", "\n", "在我們的情境中,我們將問題表述為二元分類:「白色」或「非白色」。在我們的數據集中還有一個「條紋」類別,但由於該類別的樣本數量很少,我們不會使用它。事實上,一旦我們從數據集中移除空值,這個類別就會消失。\n", "\n", "> 🎃 有趣的小知識,我們有時會稱白色南瓜為「幽靈南瓜」。它們不太容易雕刻,因此不像橙色南瓜那麼受歡迎,但它們看起來很酷!所以我們也可以將問題重新表述為:「幽靈」或「非幽靈」。👻\n", "\n", "## **關於邏輯回歸**\n", "\n", "邏輯回歸與之前學過的線性回歸有幾個重要的不同之處。\n", "\n", "#### **二元分類**\n", "\n", "邏輯回歸不提供與線性回歸相同的功能。前者提供對「二元類別」(例如「橙色或非橙色」)的預測,而後者則能夠預測「連續值」,例如根據南瓜的來源和收穫時間,*價格會上漲多少*。\n", "\n", "![Dasani Madipalli 的資訊圖表](../../../../../../translated_images/pumpkin-classifier.562771f104ad5436b87d1c67bca02a42a17841133556559325c0a0e348e5b774.mo.png)\n", "\n", "### 其他分類方式\n", "\n", "邏輯回歸還有其他類型,包括多項式和序列型:\n", "\n", "- **多項式**,涉及多於一個類別,例如「橙色、白色和條紋」。\n", "\n", "- **序列型**,涉及有序的類別,適合我們希望按邏輯順序排列結果的情境,例如南瓜按有限的尺寸(迷你、小、中、大、特大、超大)排序。\n", "\n", "![多項式 vs 序列型回歸](../../../../../../translated_images/multinomial-vs-ordinal.36701b4850e37d86c9dd49f7bef93a2f94dbdb8fe03443eb68f0542f97f28f29.mo.png)\n", "\n", "#### **變數不需要相關**\n", "\n", "還記得線性回歸在變數相關性更高時效果更好嗎?邏輯回歸正好相反——變數不需要相關性。這對於我們的數據很有幫助,因為它的相關性相對較弱。\n", "\n", "#### **需要大量乾淨的數據**\n", "\n", "如果使用更多數據,邏輯回歸會給出更準確的結果;我們的小型數據集並不是完成此任務的最佳選擇,因此請記住這一點。\n", "\n", "✅ 思考哪些類型的數據適合邏輯回歸\n", "\n", "## 練習 - 整理數據\n", "\n", "首先,稍微清理一下數據,刪除空值並選擇部分欄位:\n", "\n", "1. 添加以下程式碼:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Load the core tidyverse packages\n", "library(tidyverse)\n", "\n", "# Import the data and clean column names\n", "pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n", " clean_names()\n", "\n", "# Select desired columns\n", "pumpkins_select <- pumpkins %>% \n", " select(c(city_name, package, variety, origin, item_size, color)) \n", "\n", "# Drop rows containing missing values and encode color as factor (category)\n", "pumpkins_select <- pumpkins_select %>% \n", " drop_na() %>% \n", " mutate(color = factor(color))\n", "\n", "# View the first few rows\n", "pumpkins_select %>% \n", " slice_head(n = 5)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "您可以隨時使用 [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) 函數來快速查看您的新資料框,如下所示:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "pumpkins_select %>% \n", " glimpse()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "讓我們確認一下,我們確實是在進行一個二元分類問題:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Subset distinct observations in outcome column\n", "pumpkins_select %>% \n", " distinct(color)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 視覺化 - 類別圖表\n", "到目前為止,你已經再次載入了南瓜數據,並清理了數據以保留包含一些變數(例如顏色)的數據集。現在,讓我們使用 ggplot 函式庫在筆記本中視覺化這個資料框。\n", "\n", "ggplot 函式庫提供了一些很棒的方法來視覺化你的數據。例如,你可以在類別圖表中比較每種品種和顏色的數據分佈。\n", "\n", "1. 使用 geombar 函數,利用我們的南瓜數據,並為每個南瓜類別(橙色或白色)指定顏色映射,來創建這樣的圖表:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# Specify colors for each value of the hue variable\n", "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n", "\n", "# Create the bar plot\n", "ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n", " geom_bar(position = \"dodge\") +\n", " scale_fill_manual(values = palette) +\n", " labs(y = \"Variety\", fill = \"Color\") +\n", " theme_minimal()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "透過觀察這些數據,可以看出顏色數據與品種之間的關聯。\n", "\n", "✅ 根據這個分類圖,你能想到哪些有趣的探索方向?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 資料預處理:特徵編碼\n", "\n", "我們的南瓜資料集中的所有欄位都包含字串值。對人類而言,處理分類資料是直觀的,但對機器而言則不然。機器學習演算法更擅長處理數值資料。因此,編碼是資料預處理階段中非常重要的一步,因為它能將分類資料轉換為數值資料,同時不丟失任何資訊。良好的編碼能夠幫助建立良好的模型。\n", "\n", "在特徵編碼中,主要有兩種編碼器:\n", "\n", "1. **序列編碼器**:適合用於序列變數,這些變數是具有邏輯順序的分類變數,例如我們資料集中的 `item_size` 欄位。它會建立一個映射,將每個類別用一個數字表示,這個數字代表該類別在欄位中的順序。\n", "\n", "2. **分類編碼器**:適合用於名義變數,這些變數是沒有邏輯順序的分類變數,例如我們資料集中除了 `item_size` 以外的所有特徵。這是一種獨熱編碼方式,意味著每個類別都用一個二元欄位表示:如果南瓜屬於該品種,編碼變數的值為 1,否則為 0。\n", "\n", "Tidymodels 提供了一個非常方便的套件:[recipes](https://recipes.tidymodels.org/)——一個用於資料預處理的套件。我們將定義一個 `recipe`,指定所有的預測欄位應該被編碼為一組整數,使用 `prep` 來估算任何操作所需的數量和統計數據,最後使用 `bake` 將計算結果應用到新資料上。\n", "\n", "> 通常,recipes 通常用作建模的預處理工具,它定義了應該對資料集應用哪些步驟以使其準備好進行建模。在這種情況下,**強烈建議**使用 `workflow()`,而不是手動使用 prep 和 bake 來估算 recipe。我們稍後會看到這些內容。\n", ">\n", "> 不過,目前我們使用 recipes + prep + bake 來指定應該對資料集應用哪些步驟,以使其準備好進行資料分析,然後提取已應用步驟的預處理資料。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Preprocess and extract data to allow some data analysis\n", "baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%\n", " # Define ordering for item_size column\n", " step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n", " # Convert factors to numbers using the order defined above (Ordinal encoding)\n", " step_integer(item_size, zero_based = F) %>%\n", " # Encode all other predictors using one hot encoding\n", " step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%\n", " prep(data = pumpkin_select) %>%\n", " bake(new_data = NULL)\n", "\n", "# Display the first few rows of preprocessed data\n", "baked_pumpkins %>% \n", " slice_head(n = 5)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "✅ 使用序數編碼器處理 Item Size 欄位有什麼優勢?\n", "\n", "### 分析變數之間的關係\n", "\n", "現在我們已經完成了數據的預處理,可以分析特徵與標籤之間的關係,以了解模型在給定特徵的情況下預測標籤的能力。進行這類分析的最佳方法是繪製數據圖表。\n", "我們將再次使用 ggplot 的 geom_boxplot_ 函數,以分類圖的形式可視化 Item Size、Variety 和 Color 之間的關係。為了更好地繪製數據,我們將使用已編碼的 Item Size 欄位以及未編碼的 Variety 欄位。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Define the color palette\n", "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n", "\n", "# We need the encoded Item Size column to use it as the x-axis values in the plot\n", "pumpkins_select_plot<-pumpkins_select\n", "pumpkins_select_plot$item_size <- baked_pumpkins$item_size\n", "\n", "# Create the grouped box plot\n", "ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +\n", " geom_boxplot() +\n", " facet_grid(variety ~ ., scales = \"free_x\") +\n", " scale_fill_manual(values = palette) +\n", " labs(x = \"Item Size\", y = \"\") +\n", " theme_minimal() +\n", " theme(strip.text = element_text(size = 12)) +\n", " theme(axis.text.x = element_text(size = 10)) +\n", " theme(axis.title.x = element_text(size = 12)) +\n", " theme(axis.title.y = element_blank()) +\n", " theme(legend.position = \"bottom\") +\n", " guides(fill = guide_legend(title = \"Color\")) +\n", " theme(panel.spacing = unit(0.5, \"lines\"))+\n", " theme(strip.text.y = element_text(size = 4, hjust = 0)) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 使用群集圖\n", "\n", "由於 Color 是一個二元分類(白色或非白色),因此需要「一種[專門的方法](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)來進行視覺化」。\n", "\n", "嘗試使用 `群集圖` 來展示 item_size 與顏色分佈的關係。\n", "\n", "我們將使用 [ggbeeswarm 套件](https://github.com/eclarke/ggbeeswarm),該套件提供使用 ggplot2 創建蜜蜂群集風格圖的方法。蜜蜂群集圖是一種將原本會重疊的點排列在彼此旁邊的繪圖方式。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Create beeswarm plots of color and item_size\n", "baked_pumpkins %>% \n", " mutate(color = factor(color)) %>% \n", " ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n", " geom_quasirandom() +\n", " scale_color_brewer(palette = \"Dark2\", direction = -1) +\n", " theme(legend.position = \"none\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "現在我們已經了解顏色的二元分類與更大尺寸組別之間的關係,接下來讓我們探討如何使用邏輯迴歸來判斷某個南瓜的可能顏色。\n", "\n", "## 建立您的模型\n", "\n", "選擇您想用於分類模型的變數,並將數據分為訓練集和測試集。[rsample](https://rsample.tidymodels.org/) 是 Tidymodels 中的一個套件,提供了高效的數據分割和重抽樣的基礎設施:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Split data into 80% for training and 20% for testing\n", "set.seed(2056)\n", "pumpkins_split <- pumpkins_select %>% \n", " initial_split(prop = 0.8)\n", "\n", "# Extract the data in each split\n", "pumpkins_train <- training(pumpkins_split)\n", "pumpkins_test <- testing(pumpkins_split)\n", "\n", "# Print out the first 5 rows of the training set\n", "pumpkins_train %>% \n", " slice_head(n = 5)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "🙌 我們現在準備透過將訓練特徵與訓練標籤(顏色)進行擬合來訓練模型。\n", "\n", "我們將從建立一個食譜開始,該食譜會指定對數據進行預處理的步驟,以便為建模做好準備。例如:將類別型變數編碼為一組整數。就像 `baked_pumpkins` 一樣,我們會建立一個 `pumpkins_recipe`,但不會執行 `prep` 和 `bake`,因為這些步驟會被打包到一個工作流程中,稍後幾個步驟你就會看到。\n", "\n", "在 Tidymodels 中,有很多種方法可以指定邏輯回歸模型。請參考 `?logistic_reg()`。目前,我們將透過預設的 `stats::glm()` 引擎來指定一個邏輯回歸模型。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Create a recipe that specifies preprocessing steps for modelling\n", "pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n", " step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n", " step_integer(item_size, zero_based = F) %>% \n", " step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)\n", "\n", "# Create a logistic model specification\n", "log_reg <- logistic_reg() %>% \n", " set_engine(\"glm\") %>% \n", " set_mode(\"classification\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "現在我們已經有了一個配方和模型規範,我們需要找到一種方法將它們整合成一個物件,該物件首先會對資料進行預處理(在幕後進行準備和烘焙),然後在預處理後的資料上擬合模型,並且還能支持潛在的後處理活動。\n", "\n", "在 Tidymodels 中,這個方便的物件被稱為 [`workflow`](https://workflows.tidymodels.org/),它能輕鬆地保存你的建模組件。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Bundle modelling components in a workflow\n", "log_reg_wf <- workflow() %>% \n", " add_recipe(pumpkins_recipe) %>% \n", " add_model(log_reg)\n", "\n", "# Print out the workflow\n", "log_reg_wf\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在*指定*工作流程後,可以使用[`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html)函數來`訓練`模型。工作流程會在訓練之前估算配方並預處理數據,因此我們不需要手動使用 prep 和 bake 來完成這些步驟。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Train the model\n", "wf_fit <- log_reg_wf %>% \n", " fit(data = pumpkins_train)\n", "\n", "# Print the trained workflow\n", "wf_fit\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "模型訓練完成後,會顯示訓練過程中學到的係數。\n", "\n", "現在我們已經使用訓練數據訓練了模型,可以利用 [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html) 對測試數據進行預測。我們先用模型來預測測試集的標籤以及每個標籤的概率。當概率大於 0.5 時,預測類別為 `WHITE`,否則為 `ORANGE`。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Make predictions for color and corresponding probabilities\n", "results <- pumpkins_test %>% select(color) %>% \n", " bind_cols(wf_fit %>% \n", " predict(new_data = pumpkins_test)) %>%\n", " bind_cols(wf_fit %>%\n", " predict(new_data = pumpkins_test, type = \"prob\"))\n", "\n", "# Compare predictions\n", "results %>% \n", " slice_head(n = 10)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "非常棒!這提供了一些關於邏輯迴歸運作方式的深入見解。\n", "\n", "### 透過混淆矩陣更好地理解\n", "\n", "將每個預測值與其對應的「真實值」進行比較並不是評估模型預測效果的高效方法。幸運的是,Tidymodels 還有一些其他的技巧:[`yardstick`](https://yardstick.tidymodels.org/)——一個用於通過性能指標衡量模型效果的套件。\n", "\n", "與分類問題相關的一個性能指標是[`混淆矩陣`](https://wikipedia.org/wiki/Confusion_matrix)。混淆矩陣描述了分類模型的表現如何。混淆矩陣統計了模型正確分類每個類別的例子數量。在我們的例子中,它會顯示有多少橙色南瓜被正確分類為橙色,以及有多少白色南瓜被正確分類為白色;混淆矩陣還會顯示有多少被分類到了**錯誤**的類別。\n", "\n", "來自 yardstick 的 [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) 函數可以計算觀察值和預測值類別的交叉統計表。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Confusion matrix for prediction results\n", "conf_mat(data = results, truth = color, estimate = .pred_class)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "讓我們來解讀混淆矩陣。我們的模型需要將南瓜分類為兩個二元類別:`白色` 和 `非白色`。\n", "\n", "- 如果模型預測南瓜是白色,且實際上屬於「白色」類別,我們稱之為 `真正例`,顯示在左上角的數字。\n", "\n", "- 如果模型預測南瓜是非白色,但實際上屬於「白色」類別,我們稱之為 `假負例`,顯示在左下角的數字。\n", "\n", "- 如果模型預測南瓜是白色,但實際上屬於「非白色」類別,我們稱之為 `假正例`,顯示在右上角的數字。\n", "\n", "- 如果模型預測南瓜是非白色,且實際上屬於「非白色」類別,我們稱之為 `真負例`,顯示在右下角的數字。\n", "\n", "| 真實值 |\n", "|:-----:|\n", "\n", "| | | |\n", "|---------------|--------|-------|\n", "| **預測值** | 白色 | 橙色 |\n", "| 白色 | TP | FP |\n", "| 橙色 | FN | TN |\n", "\n", "正如你可能猜到的,我們希望有更多的真正例和真負例,以及更少的假正例和假負例,這意味著模型的表現更佳。\n", "\n", "混淆矩陣非常有用,因為它衍生出其他指標,幫助我們更好地評估分類模型的性能。讓我們來看看其中一些指標:\n", "\n", "🎓 精確率(Precision):`TP/(TP + FP)`,定義為被預測為正例中實際為正例的比例。也稱為[正確預測值](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")。\n", "\n", "🎓 召回率(Recall):`TP/(TP + FN)`,定義為實際為正例的樣本中被正確預測為正例的比例。也稱為 `敏感性`。\n", "\n", "🎓 特異性(Specificity):`TN/(TN + FP)`,定義為實際為負例的樣本中被正確預測為負例的比例。\n", "\n", "🎓 準確率(Accuracy):`TP + TN/(TP + TN + FP + FN)`,即樣本中被正確預測的標籤所佔的百分比。\n", "\n", "🎓 F 值(F Measure):精確率和召回率的加權平均值,最佳值為 1,最差值為 0。\n", "\n", "讓我們來計算這些指標吧!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Combine metric functions and calculate them all at once\n", "eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n", "eval_metrics(data = results, truth = color, estimate = .pred_class)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 視覺化此模型的 ROC 曲線\n", "\n", "讓我們再進行一次視覺化,來看看所謂的 [`ROC 曲線`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Make a roc_curve\n", "results %>% \n", " roc_curve(color, .pred_ORANGE) %>% \n", " autoplot()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ROC 曲線通常用來檢視分類器的輸出,分析其真陽性與假陽性之間的關係。ROC 曲線通常在 Y 軸上顯示「真陽性率」/敏感度,在 X 軸上顯示「假陽性率」/1-特異性。因此,曲線的陡峭程度以及曲線與中線之間的距離非常重要:理想的曲線應該快速向上並超越中線。在我們的例子中,起初會有一些假陽性,接著曲線正確地向上並超越中線。\n", "\n", "最後,我們可以使用 `yardstick::roc_auc()` 來計算實際的曲線下面積(AUC)。AUC 的一種解釋方式是,模型將隨機正例排序在隨機負例之上的概率。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Calculate area under curve\n", "results %>% \n", " roc_auc(color, .pred_ORANGE)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "結果約為 `0.975`。由於 AUC 的範圍是從 0 到 1,你會希望分數越高越好,因為一個能夠 100% 正確預測的模型會有 AUC 值為 1;在這個例子中,這個模型表現*相當不錯*。\n", "\n", "在未來的分類課程中,你將學習如何提升模型的分數(例如在這種情況下處理不平衡數據)。\n", "\n", "## 🚀挑戰\n", "\n", "關於邏輯迴歸還有很多值得深入探討的地方!但學習的最佳方式是親自實驗。找一個適合這類分析的數據集,並用它建立一個模型。你學到了什麼?提示:可以試試 [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) 上有趣的數據集。\n", "\n", "## 回顧與自學\n", "\n", "閱讀 [這篇來自 Stanford 的論文](https://web.stanford.edu/~jurafsky/slp3/5.pdf) 的前幾頁,了解邏輯迴歸的一些實際應用。思考哪些任務更適合我們到目前為止學過的不同迴歸類型。哪一種方法會是最佳選擇?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免責聲明**: \n本文件已使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。雖然我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。原始文件的母語版本應被視為權威來源。對於關鍵信息,建議尋求專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或誤釋不承擔責任。\n" ] } ], "metadata": { "anaconda-cloud": "", "kernelspec": { "display_name": "R", "langauge": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.1" }, "coopTranslator": { "original_hash": "feaf125f481a89c468fa115bf2aed580", "translation_date": "2025-08-29T23:02:49+00:00", "source_file": "2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb", "language_code": "mo" } }, "nbformat": 4, "nbformat_minor": 1 }