{ "nbformat": 4, "nbformat_minor": 2, "metadata": { "colab": { "name": "lesson_2-R.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "f3c335f9940cfd76528b3ef918b9b342", "translation_date": "2025-08-29T23:14:57+00:00", "source_file": "2-Regression/2-Data/solution/R/lesson_2-R.ipynb", "language_code": "mo" } }, "cells": [ { "cell_type": "markdown", "source": [ "# 建立回歸模型:準備與視覺化數據\n", "\n", "## **南瓜的線性回歸 - 第2課**\n", "#### 簡介\n", "\n", "現在你已經準備好使用 Tidymodels 和 Tidyverse 開始建立機器學習模型,是時候開始對你的數據提出問題了。在處理數據並應用機器學習解決方案時,了解如何提出正確的問題以充分發揮數據的潛力是非常重要的。\n", "\n", "在本課中,你將學到:\n", "\n", "- 如何為模型建立準備數據。\n", "\n", "- 如何使用 `ggplot2` 進行數據視覺化。\n", "\n", "你需要回答的問題將決定你會使用哪種類型的機器學習算法。而你得到的答案的質量,將在很大程度上取決於數據的性質。\n", "\n", "讓我們通過一個實際的練習來看看這一點。\n", "\n", "
\n",
" \n",
"
\n",
"\n",
"> 溫故知新:管道運算子(`%>%`)透過將物件向前傳遞到函式或呼叫表達式中,按邏輯順序執行操作。你可以將管道運算子理解為在程式碼中表示「然後」。\n"
],
"metadata": {
"id": "REWcIv9yX29v"
}
},
{
"cell_type": "markdown",
"source": [
"## 2. 檢查遺漏資料\n",
"\n",
"資料科學家最常遇到的問題之一就是不完整或遺漏的資料。R 使用特殊的哨兵值 `NA`(Not Available)來表示遺漏或未知的值。\n",
"\n",
"那麼我們如何知道資料框中是否包含遺漏值呢?\n",
"
\n",
"- 一個直接的方法是使用 R 的基礎函數 `anyNA`,它會返回邏輯物件 `TRUE` 或 `FALSE`\n"
],
"metadata": {
"id": "Zxfb3AM5YbUe"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" anyNA()"
],
"outputs": [],
"metadata": {
"id": "G--DQutAYltj"
}
},
{
"cell_type": "markdown",
"source": [
"太好了,看起來有一些遺漏的數據!這是一個不錯的起點。\n",
"\n",
"- 另一種方法是使用函數 `is.na()`,它會用邏輯值 `TRUE` 指出哪些單個欄位元素是遺漏的。\n"
],
"metadata": {
"id": "mU-7-SB6YokF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "W-DxDOR4YxSW"
}
},
{
"cell_type": "markdown",
"source": [
"好的,完成了工作,但像這樣的大型數據框,逐行逐列檢查效率低下,幾乎不可能😴。\n",
"\n",
"- 一個更直觀的方法是計算每列中缺失值的總和:\n"
],
"metadata": {
"id": "xUWxipKYY0o7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" colSums()"
],
"outputs": [],
"metadata": {
"id": "ZRBWV6P9ZArL"
}
},
{
"cell_type": "markdown",
"source": [
"更好!雖然有些資料缺失,但或許對於目前的任務來說並不重要。我們來看看進一步的分析會帶來什麼結果。\n",
"\n",
"> 除了豐富的套件和函數,R 還擁有非常完善的文件說明。例如,可以使用 `help(colSums)` 或 `?colSums` 來了解更多關於該函數的資訊。\n"
],
"metadata": {
"id": "9gv-crB6ZD1Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Dplyr:資料操作的語法\n",
"\n",
"
\n",
" \n",
"
\n"
],
"metadata": {
"id": "i5o33MQBZWWw"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::select()\n",
"\n",
"`select()` 是套件 `dplyr` 中的一個函數,幫助你選擇要保留或排除的欄位。\n",
"\n",
"為了讓你的資料框更容易操作,可以使用 `select()` 刪除一些欄位,只保留你需要的欄位。\n",
"\n",
"例如,在這個練習中,我們的分析將涉及 `Package`、`Low Price`、`High Price` 和 `Date` 這些欄位。讓我們選擇這些欄位。\n"
],
"metadata": {
"id": "x3VGMAGBZiUr"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Select desired columns\n",
"pumpkins <- pumpkins %>% \n",
" select(Package, `Low Price`, `High Price`, Date)\n",
"\n",
"\n",
"# Print data set\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "F_FgxQnVZnM0"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::mutate()\n",
"\n",
"`mutate()` 是 `dplyr` 套件中的一個函數,用於創建或修改欄位,同時保留現有的欄位。\n",
"\n",
"`mutate` 的一般結構如下:\n",
"\n",
"`data %>% mutate(new_column_name = what_it_contains)`\n",
"\n",
"讓我們使用 `Date` 欄位來試試 `mutate`,進行以下操作:\n",
"\n",
"1. 將日期(目前是字元類型)轉換為月份格式(這些是美國日期,所以格式為 `MM/DD/YYYY`)。\n",
"\n",
"2. 從日期中提取月份到一個新欄位。\n",
"\n",
"在 R 中,[lubridate](https://lubridate.tidyverse.org/) 套件讓處理日期時間數據變得更簡單。因此,讓我們使用 `dplyr::mutate()`、`lubridate::mdy()`、`lubridate::month()`,來完成上述目標。我們可以刪除 Date 欄位,因為在後續操作中不再需要它。\n"
],
"metadata": {
"id": "2KKo0Ed9Z1VB"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load lubridate\n",
"library(lubridate)\n",
"\n",
"pumpkins <- pumpkins %>% \n",
" # Convert the Date column to a date object\n",
" mutate(Date = mdy(Date)) %>% \n",
" # Extract month from Date\n",
" mutate(Month = month(Date)) %>% \n",
" # Drop Date column\n",
" select(-Date)\n",
"\n",
"# View the first few rows\n",
"pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "5joszIVSZ6xe"
}
},
{
"cell_type": "markdown",
"source": [
"哇哦!🤩\n",
"\n",
"接下來,我們來新增一個名為 `Price` 的新欄位,代表南瓜的平均價格。現在,讓我們取 `Low Price` 和 `High Price` 欄位的平均值來填充新的 Price 欄位。\n"
],
"metadata": {
"id": "nIgLjNMCZ-6Y"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new column Price\n",
"pumpkins <- pumpkins %>% \n",
" mutate(Price = (`Low Price` + `High Price`)/2)\n",
"\n",
"# View the first few rows of the data\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "Zo0BsqqtaJw2"
}
},
{
"cell_type": "markdown",
"source": [
"耶!💪\n",
"\n",
"「等等!」當你快速瀏覽整個資料集並使用 `View(pumpkins)` 時,你可能會說:「這裡有點奇怪啊!」🤔\n",
"\n",
"如果你查看 `Package` 欄位,會發現南瓜是以許多不同的方式出售的。有些是以 `1 1/9 蒲式耳` 為單位,有些是以 `1/2 蒲式耳` 為單位,有些是按南瓜數量出售,有些是按重量(每磅)出售,還有一些是裝在寬度各異的大箱子裡。\n",
"\n",
"讓我們來驗證一下:\n"
],
"metadata": {
"id": "p77WZr-9aQAR"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Verify the distinct observations in Package column\n",
"pumpkins %>% \n",
" distinct(Package)"
],
"outputs": [],
"metadata": {
"id": "XISGfh0IaUy6"
}
},
{
"cell_type": "markdown",
"source": [
"太棒了!👏\n",
"\n",
"南瓜的重量似乎很難保持一致,所以我們來篩選一下,只選擇在 `Package` 欄位中包含 *bushel* 字串的南瓜,並將其放入一個新的資料框 `new_pumpkins` 中。\n",
"
\n"
],
"metadata": {
"id": "7sMjiVujaZxY"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::filter() 和 stringr::str_detect()\n",
"\n",
"[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html):建立一個資料的子集,只包含符合條件的**列**,在此例中,是指在 `Package` 欄位中包含 *bushel* 字串的南瓜。\n",
"\n",
"[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html):檢測字串中是否存在或不存在某個模式。\n",
"\n",
"[`stringr`](https://github.com/tidyverse/stringr) 套件提供簡單的函數,用於常見的字串操作。\n"
],
"metadata": {
"id": "L8Qfcs92ageF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Retain only pumpkins with \"bushel\"\n",
"new_pumpkins <- pumpkins %>% \n",
" filter(str_detect(Package, \"bushel\"))\n",
"\n",
"# Get the dimensions of the new data\n",
"dim(new_pumpkins)\n",
"\n",
"# View a few rows of the new data\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "hy_SGYREampd"
}
},
{
"cell_type": "markdown",
"source": [
"你可以看到,我們已經篩選出大約 415 行左右的數據,這些數據是以蒲式耳為單位的南瓜。🤩 \n",
"
\n"
],
"metadata": {
"id": "VrDwF031avlR"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::case_when()\n",
"\n",
"**但是等等!還有一件事要做**\n",
"\n",
"你有注意到每一行的蒲式耳數量都不一樣嗎?你需要將價格標準化,顯示每蒲式耳的價格,而不是每 1 1/9 或 1/2 蒲式耳的價格。是時候做一些數學運算來統一標準了。\n",
"\n",
"我們將使用函數[`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html)來根據一些條件*變更*Price 欄位的值。`case_when` 允許你將多個 `if_else()` 條件語句向量化處理。\n"
],
"metadata": {
"id": "mLpw2jH4a0tx"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Convert the price if the Package contains fractional bushel values\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(Price = case_when(\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
" TRUE ~ Price))\n",
"\n",
"# View the first few rows of the data\n",
"new_pumpkins %>% \n",
" slice_head(n = 30)"
],
"outputs": [],
"metadata": {
"id": "P68kLVQmbM6I"
}
},
{
"cell_type": "markdown",
"source": [
"現在,我們可以根據蒲式耳的測量來分析每單位的價格。不過,這些對南瓜蒲式耳的研究,恰恰說明了「了解數據的本質」是多麼`重要`!\n",
"\n",
"> ✅ 根據 [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308) 的說法,蒲式耳的重量取決於農產品的種類,因為它是一種體積的測量方式。「例如,一蒲式耳的番茄應該重56磅……葉類和綠色蔬菜佔據更多空間但重量較輕,所以一蒲式耳的菠菜只有20磅。」這一切都相當複雜!我們就別費心去做蒲式耳到磅的轉換了,直接按蒲式耳定價吧。不過,這些對南瓜蒲式耳的研究,恰恰說明了了解數據的本質是多麼重要!\n",
">\n",
"> ✅ 你有注意到按半蒲式耳出售的南瓜非常昂貴嗎?你能猜出原因嗎?提示:小南瓜比大南瓜貴得多,可能是因為每蒲式耳的小南瓜數量多得多,畢竟一個大而中空的派南瓜會佔據很多未使用的空間。\n"
],
"metadata": {
"id": "pS2GNPagbSdb"
}
},
{
"cell_type": "markdown",
"source": [
"現在,最後為了冒險的樂趣 💁♀️,我們將「月份」欄位移到第一個位置,也就是在「套件」欄位之前。\n",
"\n",
"`dplyr::relocate()` 用於更改欄位位置。\n"
],
"metadata": {
"id": "qql1SowfbdnP"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new data frame new_pumpkins\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(Month, .before = Package)\n",
"\n",
"new_pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "JJ1x6kw8bixF"
}
},
{
"cell_type": "markdown",
"source": [
"做得好!👌 現在你擁有一個乾淨整潔的數據集,可以用來建立新的回歸模型! \n",
"
\n"
],
"metadata": {
"id": "y8TJ0Za_bn5Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 4. 使用 ggplot2 進行數據視覺化\n",
"\n",
"
\n",
" \n",
"