{ "nbformat": 4, "nbformat_minor": 2, "metadata": { "colab": { "name": "lesson_2-R.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "f3c335f9940cfd76528b3ef918b9b342", "translation_date": "2025-08-29T23:14:57+00:00", "source_file": "2-Regression/2-Data/solution/R/lesson_2-R.ipynb", "language_code": "mo" } }, "cells": [ { "cell_type": "markdown", "source": [ "# 建立回歸模型:準備與視覺化數據\n", "\n", "## **南瓜的線性回歸 - 第2課**\n", "#### 簡介\n", "\n", "現在你已經準備好使用 Tidymodels 和 Tidyverse 開始建立機器學習模型,是時候開始對你的數據提出問題了。在處理數據並應用機器學習解決方案時,了解如何提出正確的問題以充分發揮數據的潛力是非常重要的。\n", "\n", "在本課中,你將學到:\n", "\n", "- 如何為模型建立準備數據。\n", "\n", "- 如何使用 `ggplot2` 進行數據視覺化。\n", "\n", "你需要回答的問題將決定你會使用哪種類型的機器學習算法。而你得到的答案的質量,將在很大程度上取決於數據的性質。\n", "\n", "讓我們通過一個實際的練習來看看這一點。\n", "\n", "

\n", " \n", "

插圖由 @allison_horst 提供
\n", "\n", "\n", "\n" ], "metadata": { "id": "Pg5aexcOPqAZ" } }, { "cell_type": "markdown", "source": [ "## 1. 匯入南瓜數據並召喚 Tidyverse\n", "\n", "我們需要以下套件來完成這堂課的數據處理:\n", "\n", "- `tidyverse`:[tidyverse](https://www.tidyverse.org/) 是一個[由 R 套件組成的集合](https://www.tidyverse.org/packages),旨在讓數據科學更快速、更簡單、更有趣!\n", "\n", "你可以使用以下指令安裝:\n", "\n", "`install.packages(c(\"tidyverse\"))`\n", "\n", "以下的腳本會檢查你是否已安裝完成此模組所需的套件,若有缺少的套件,則會自動幫你安裝。\n" ], "metadata": { "id": "dc5WhyVdXAjR" } }, { "cell_type": "code", "execution_count": null, "source": [ "suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n", "pacman::p_load(tidyverse)" ], "outputs": [], "metadata": { "id": "GqPYUZgfXOBt" } }, { "cell_type": "markdown", "source": [ "現在,讓我們啟動一些套件並載入本課提供的[數據](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv)!\n" ], "metadata": { "id": "kvjDTPDSXRr2" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Load the core Tidyverse packages\n", "library(tidyverse)\n", "\n", "# Import the pumpkins data\n", "pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n", "\n", "\n", "# Get a glimpse and dimensions of the data\n", "glimpse(pumpkins)\n", "\n", "\n", "# Print the first 50 rows of the data set\n", "pumpkins %>% \n", " slice_head(n =50)" ], "outputs": [], "metadata": { "id": "VMri-t2zXqgD" } }, { "cell_type": "markdown", "source": [ "快速使用 `glimpse()` 可以立即看出資料中有空白值,並且混合了字串(`chr`)和數值型資料(`dbl`)。`Date` 是字元型別,還有一個奇怪的欄位叫做 `Package`,其中的資料混合了 `sacks`、`bins` 和其他值。事實上,這份資料有點混亂 😤。\n", "\n", "事實上,很少能直接獲得一個完全準備好用來建立機器學習模型的資料集。但別擔心,在這節課中,你將學習如何使用標準的 R 函式庫來準備原始資料集 🧑‍🔧。你還會學到各種技術來視覺化資料。📈📊\n", "
\n", "\n", "> 溫故知新:管道運算子(`%>%`)透過將物件向前傳遞到函式或呼叫表達式中,按邏輯順序執行操作。你可以將管道運算子理解為在程式碼中表示「然後」。\n" ], "metadata": { "id": "REWcIv9yX29v" } }, { "cell_type": "markdown", "source": [ "## 2. 檢查遺漏資料\n", "\n", "資料科學家最常遇到的問題之一就是不完整或遺漏的資料。R 使用特殊的哨兵值 `NA`(Not Available)來表示遺漏或未知的值。\n", "\n", "那麼我們如何知道資料框中是否包含遺漏值呢?\n", "
\n", "- 一個直接的方法是使用 R 的基礎函數 `anyNA`,它會返回邏輯物件 `TRUE` 或 `FALSE`\n" ], "metadata": { "id": "Zxfb3AM5YbUe" } }, { "cell_type": "code", "execution_count": null, "source": [ "pumpkins %>% \n", " anyNA()" ], "outputs": [], "metadata": { "id": "G--DQutAYltj" } }, { "cell_type": "markdown", "source": [ "太好了,看起來有一些遺漏的數據!這是一個不錯的起點。\n", "\n", "- 另一種方法是使用函數 `is.na()`,它會用邏輯值 `TRUE` 指出哪些單個欄位元素是遺漏的。\n" ], "metadata": { "id": "mU-7-SB6YokF" } }, { "cell_type": "code", "execution_count": null, "source": [ "pumpkins %>% \n", " is.na() %>% \n", " head(n = 7)" ], "outputs": [], "metadata": { "id": "W-DxDOR4YxSW" } }, { "cell_type": "markdown", "source": [ "好的,完成了工作,但像這樣的大型數據框,逐行逐列檢查效率低下,幾乎不可能😴。\n", "\n", "- 一個更直觀的方法是計算每列中缺失值的總和:\n" ], "metadata": { "id": "xUWxipKYY0o7" } }, { "cell_type": "code", "execution_count": null, "source": [ "pumpkins %>% \n", " is.na() %>% \n", " colSums()" ], "outputs": [], "metadata": { "id": "ZRBWV6P9ZArL" } }, { "cell_type": "markdown", "source": [ "更好!雖然有些資料缺失,但或許對於目前的任務來說並不重要。我們來看看進一步的分析會帶來什麼結果。\n", "\n", "> 除了豐富的套件和函數,R 還擁有非常完善的文件說明。例如,可以使用 `help(colSums)` 或 `?colSums` 來了解更多關於該函數的資訊。\n" ], "metadata": { "id": "9gv-crB6ZD1Y" } }, { "cell_type": "markdown", "source": [ "## 3. Dplyr:資料操作的語法\n", "\n", "

\n", " \n", "

插圖由 @allison_horst 提供
\n", "\n", "\n", "\n" ], "metadata": { "id": "o4jLY5-VZO2C" } }, { "cell_type": "markdown", "source": [ "[`dplyr`](https://dplyr.tidyverse.org/),是 Tidyverse 中的一個套件,提供了一套一致的資料操作語法,包含一系列動詞,幫助你解決最常見的資料操作挑戰。在本節中,我們將探索一些 dplyr 的動詞! \n", "
\n" ], "metadata": { "id": "i5o33MQBZWWw" } }, { "cell_type": "markdown", "source": [ "#### dplyr::select()\n", "\n", "`select()` 是套件 `dplyr` 中的一個函數,幫助你選擇要保留或排除的欄位。\n", "\n", "為了讓你的資料框更容易操作,可以使用 `select()` 刪除一些欄位,只保留你需要的欄位。\n", "\n", "例如,在這個練習中,我們的分析將涉及 `Package`、`Low Price`、`High Price` 和 `Date` 這些欄位。讓我們選擇這些欄位。\n" ], "metadata": { "id": "x3VGMAGBZiUr" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Select desired columns\n", "pumpkins <- pumpkins %>% \n", " select(Package, `Low Price`, `High Price`, Date)\n", "\n", "\n", "# Print data set\n", "pumpkins %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "F_FgxQnVZnM0" } }, { "cell_type": "markdown", "source": [ "#### dplyr::mutate()\n", "\n", "`mutate()` 是 `dplyr` 套件中的一個函數,用於創建或修改欄位,同時保留現有的欄位。\n", "\n", "`mutate` 的一般結構如下:\n", "\n", "`data %>% mutate(new_column_name = what_it_contains)`\n", "\n", "讓我們使用 `Date` 欄位來試試 `mutate`,進行以下操作:\n", "\n", "1. 將日期(目前是字元類型)轉換為月份格式(這些是美國日期,所以格式為 `MM/DD/YYYY`)。\n", "\n", "2. 從日期中提取月份到一個新欄位。\n", "\n", "在 R 中,[lubridate](https://lubridate.tidyverse.org/) 套件讓處理日期時間數據變得更簡單。因此,讓我們使用 `dplyr::mutate()`、`lubridate::mdy()`、`lubridate::month()`,來完成上述目標。我們可以刪除 Date 欄位,因為在後續操作中不再需要它。\n" ], "metadata": { "id": "2KKo0Ed9Z1VB" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Load lubridate\n", "library(lubridate)\n", "\n", "pumpkins <- pumpkins %>% \n", " # Convert the Date column to a date object\n", " mutate(Date = mdy(Date)) %>% \n", " # Extract month from Date\n", " mutate(Month = month(Date)) %>% \n", " # Drop Date column\n", " select(-Date)\n", "\n", "# View the first few rows\n", "pumpkins %>% \n", " slice_head(n = 7)" ], "outputs": [], "metadata": { "id": "5joszIVSZ6xe" } }, { "cell_type": "markdown", "source": [ "哇哦!🤩\n", "\n", "接下來,我們來新增一個名為 `Price` 的新欄位,代表南瓜的平均價格。現在,讓我們取 `Low Price` 和 `High Price` 欄位的平均值來填充新的 Price 欄位。\n" ], "metadata": { "id": "nIgLjNMCZ-6Y" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Create a new column Price\n", "pumpkins <- pumpkins %>% \n", " mutate(Price = (`Low Price` + `High Price`)/2)\n", "\n", "# View the first few rows of the data\n", "pumpkins %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "Zo0BsqqtaJw2" } }, { "cell_type": "markdown", "source": [ "耶!💪\n", "\n", "「等等!」當你快速瀏覽整個資料集並使用 `View(pumpkins)` 時,你可能會說:「這裡有點奇怪啊!」🤔\n", "\n", "如果你查看 `Package` 欄位,會發現南瓜是以許多不同的方式出售的。有些是以 `1 1/9 蒲式耳` 為單位,有些是以 `1/2 蒲式耳` 為單位,有些是按南瓜數量出售,有些是按重量(每磅)出售,還有一些是裝在寬度各異的大箱子裡。\n", "\n", "讓我們來驗證一下:\n" ], "metadata": { "id": "p77WZr-9aQAR" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Verify the distinct observations in Package column\n", "pumpkins %>% \n", " distinct(Package)" ], "outputs": [], "metadata": { "id": "XISGfh0IaUy6" } }, { "cell_type": "markdown", "source": [ "太棒了!👏\n", "\n", "南瓜的重量似乎很難保持一致,所以我們來篩選一下,只選擇在 `Package` 欄位中包含 *bushel* 字串的南瓜,並將其放入一個新的資料框 `new_pumpkins` 中。\n", "
\n" ], "metadata": { "id": "7sMjiVujaZxY" } }, { "cell_type": "markdown", "source": [ "#### dplyr::filter() 和 stringr::str_detect()\n", "\n", "[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html):建立一個資料的子集,只包含符合條件的**列**,在此例中,是指在 `Package` 欄位中包含 *bushel* 字串的南瓜。\n", "\n", "[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html):檢測字串中是否存在或不存在某個模式。\n", "\n", "[`stringr`](https://github.com/tidyverse/stringr) 套件提供簡單的函數,用於常見的字串操作。\n" ], "metadata": { "id": "L8Qfcs92ageF" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Retain only pumpkins with \"bushel\"\n", "new_pumpkins <- pumpkins %>% \n", " filter(str_detect(Package, \"bushel\"))\n", "\n", "# Get the dimensions of the new data\n", "dim(new_pumpkins)\n", "\n", "# View a few rows of the new data\n", "new_pumpkins %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "hy_SGYREampd" } }, { "cell_type": "markdown", "source": [ "你可以看到,我們已經篩選出大約 415 行左右的數據,這些數據是以蒲式耳為單位的南瓜。🤩 \n", "
\n" ], "metadata": { "id": "VrDwF031avlR" } }, { "cell_type": "markdown", "source": [ "#### dplyr::case_when()\n", "\n", "**但是等等!還有一件事要做**\n", "\n", "你有注意到每一行的蒲式耳數量都不一樣嗎?你需要將價格標準化,顯示每蒲式耳的價格,而不是每 1 1/9 或 1/2 蒲式耳的價格。是時候做一些數學運算來統一標準了。\n", "\n", "我們將使用函數[`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html)來根據一些條件*變更*Price 欄位的值。`case_when` 允許你將多個 `if_else()` 條件語句向量化處理。\n" ], "metadata": { "id": "mLpw2jH4a0tx" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Convert the price if the Package contains fractional bushel values\n", "new_pumpkins <- new_pumpkins %>% \n", " mutate(Price = case_when(\n", " str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n", " str_detect(Package, \"1/2\") ~ Price/(1/2),\n", " TRUE ~ Price))\n", "\n", "# View the first few rows of the data\n", "new_pumpkins %>% \n", " slice_head(n = 30)" ], "outputs": [], "metadata": { "id": "P68kLVQmbM6I" } }, { "cell_type": "markdown", "source": [ "現在,我們可以根據蒲式耳的測量來分析每單位的價格。不過,這些對南瓜蒲式耳的研究,恰恰說明了「了解數據的本質」是多麼`重要`!\n", "\n", "> ✅ 根據 [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308) 的說法,蒲式耳的重量取決於農產品的種類,因為它是一種體積的測量方式。「例如,一蒲式耳的番茄應該重56磅……葉類和綠色蔬菜佔據更多空間但重量較輕,所以一蒲式耳的菠菜只有20磅。」這一切都相當複雜!我們就別費心去做蒲式耳到磅的轉換了,直接按蒲式耳定價吧。不過,這些對南瓜蒲式耳的研究,恰恰說明了了解數據的本質是多麼重要!\n", ">\n", "> ✅ 你有注意到按半蒲式耳出售的南瓜非常昂貴嗎?你能猜出原因嗎?提示:小南瓜比大南瓜貴得多,可能是因為每蒲式耳的小南瓜數量多得多,畢竟一個大而中空的派南瓜會佔據很多未使用的空間。\n" ], "metadata": { "id": "pS2GNPagbSdb" } }, { "cell_type": "markdown", "source": [ "現在,最後為了冒險的樂趣 💁‍♀️,我們將「月份」欄位移到第一個位置,也就是在「套件」欄位之前。\n", "\n", "`dplyr::relocate()` 用於更改欄位位置。\n" ], "metadata": { "id": "qql1SowfbdnP" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Create a new data frame new_pumpkins\n", "new_pumpkins <- new_pumpkins %>% \n", " relocate(Month, .before = Package)\n", "\n", "new_pumpkins %>% \n", " slice_head(n = 7)" ], "outputs": [], "metadata": { "id": "JJ1x6kw8bixF" } }, { "cell_type": "markdown", "source": [ "做得好!👌 現在你擁有一個乾淨整潔的數據集,可以用來建立新的回歸模型! \n", "
\n" ], "metadata": { "id": "y8TJ0Za_bn5Y" } }, { "cell_type": "markdown", "source": [ "## 4. 使用 ggplot2 進行數據視覺化\n", "\n", "

\n", " \n", "

資訊圖表由 Dasani Madipalli 提供
\n", "\n", "\n", "\n", "\n", "有一句*智慧*的名言是這樣說的:\n", "\n", "> 「簡單的圖表比任何其他工具都能為數據分析師帶來更多的信息。」 --- John Tukey\n", "\n", "數據科學家的其中一個角色是展示他們所處理數據的質量和特性。為了達到這個目的,他們經常創建有趣的視覺化圖表,例如散點圖、折線圖和柱狀圖,來展示數據的不同面向。透過這種方式,他們能夠以視覺化的方式揭示那些難以察覺的關係和差距。\n", "\n", "視覺化還能幫助確定最適合數據的機器學習技術。例如,一個看起來呈線性分佈的散點圖可能表明該數據適合用於線性回歸分析。\n", "\n", "R 提供了多種繪製圖表的系統,但 [`ggplot2`](https://ggplot2.tidyverse.org/index.html) 是其中最優雅且最靈活的一種。`ggplot2` 允許你通過**組合獨立的組件**來構建圖表。\n", "\n", "讓我們從一個簡單的散點圖開始,展示 Price 和 Month 這兩個欄位。\n", "\n", "在這個例子中,我們將從 [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html) 開始,提供一個數據集和美學映射(使用 [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)),然後添加圖層(例如用於散點圖的 [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html))。\n" ], "metadata": { "id": "mYSH6-EtbvNa" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Set a theme for the plots\n", "theme_set(theme_light())\n", "\n", "# Create a scatter plot\n", "p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\n", "p + geom_point()" ], "outputs": [], "metadata": { "id": "g2YjnGeOcLo4" } }, { "cell_type": "markdown", "source": [ "這是一個有用的圖表嗎 🤷?有什麼讓你感到驚訝的地方嗎?\n", "\n", "它並不是特別有用,因為它只是顯示你的數據在某個月份中的分佈點。\n" ], "metadata": { "id": "Ml7SDCLQcPvE" } }, { "cell_type": "markdown", "source": [ "### **如何讓它變得有用?**\n", "\n", "為了讓圖表顯示有用的數據,通常需要以某種方式對數據進行分組。例如,在我們的案例中,找出每個月南瓜的平均價格,能夠更深入地了解數據中的潛在模式。這引導我們進一步了解 **dplyr** 的一個功能:\n", "\n", "#### `dplyr::group_by() %>% summarize()`\n", "\n", "在 R 中,分組聚合可以輕鬆地使用以下方式計算:\n", "\n", "`dplyr::group_by() %>% summarize()`\n", "\n", "- `dplyr::group_by()` 將分析單位從整個數據集改變為個別的分組,例如按月份分組。\n", "\n", "- `dplyr::summarize()` 創建一個新的數據框,其中包含每個分組變量的一列,以及您指定的每個摘要統計量的一列。\n", "\n", "例如,我們可以使用 `dplyr::group_by() %>% summarize()` 將南瓜根據 **Month** 列進行分組,然後找出每個月的 **平均價格**。\n" ], "metadata": { "id": "jMakvJZIcVkh" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Find the average price of pumpkins per month\r\n", "new_pumpkins %>%\r\n", " group_by(Month) %>% \r\n", " summarise(mean_price = mean(Price))" ], "outputs": [], "metadata": { "id": "6kVSUa2Bcilf" } }, { "cell_type": "markdown", "source": [ "簡潔明瞭!✨\n", "\n", "像月份這類的分類特徵,用長條圖來呈現會更合適 📊。負責繪製長條圖的圖層是 `geom_bar()` 和 `geom_col()`。可以查閱 `?geom_bar` 來了解更多資訊。\n", "\n", "我們來試著做一個吧!\n" ], "metadata": { "id": "Kds48GUBcj3W" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Find the average price of pumpkins per month then plot a bar chart\r\n", "new_pumpkins %>%\r\n", " group_by(Month) %>% \r\n", " summarise(mean_price = mean(Price)) %>% \r\n", " ggplot(aes(x = Month, y = mean_price)) +\r\n", " geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n", " ylab(\"Pumpkin Price\")" ], "outputs": [], "metadata": { "id": "VNbU1S3BcrxO" } }, { "cell_type": "markdown", "source": [ "🤩🤩這是一個更有用的數據視覺化!看起來南瓜的最高價格出現在九月和十月。這符合你的預期嗎?為什麼符合或不符合?\n", "\n", "恭喜你完成了第二課 👏!你已經為模型構建準備好了數據,然後通過視覺化發現了更多的洞見!\n" ], "metadata": { "id": "zDm0VOzzcuzR" } }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免責聲明**: \n本文件已使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。雖然我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。原始文件的母語版本應被視為權威來源。對於關鍵信息,建議尋求專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或誤釋不承擔責任。\n" ] } ] }