{ "nbformat": 4, "nbformat_minor": 2, "metadata": { "colab": { "name": "lesson_2-R.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "f3c335f9940cfd76528b3ef918b9b342", "translation_date": "2025-09-03T19:52:28+00:00", "source_file": "2-Regression/2-Data/solution/R/lesson_2-R.ipynb", "language_code": "tw" } }, "cells": [ { "cell_type": "markdown", "source": [ "# 建立回歸模型:準備並視覺化數據\n", "\n", "## **南瓜的線性回歸 - 第二課**\n", "#### 介紹\n", "\n", "現在你已經準備好使用 Tidymodels 和 Tidyverse 開始構建機器學習模型,接下來可以開始對數據提出問題了。在處理數據並應用機器學習解決方案時,了解如何提出正確的問題以充分挖掘數據的潛力是非常重要的。\n", "\n", "在本課中,你將學到:\n", "\n", "- 如何為模型構建準備數據。\n", "\n", "- 如何使用 `ggplot2` 進行數據視覺化。\n", "\n", "你需要回答的問題將決定你會使用哪種類型的機器學習算法。而你獲得答案的質量將在很大程度上取決於數據的性質。\n", "\n", "讓我們通過一個實際練習來看看這一點。\n", "\n", "
\n",
" \n",
"
\n",
"\n",
"> 溫故知新:管道運算子(`%>%`)透過將物件向前傳遞到函式或呼叫表達式中,按邏輯順序執行操作。你可以將管道運算子理解為程式碼中的「然後」。\n"
],
"metadata": {
"id": "REWcIv9yX29v"
}
},
{
"cell_type": "markdown",
"source": [
"## 2. 檢查缺失資料\n",
"\n",
"資料科學家經常需要處理的一個常見問題是資料不完整或缺失。R 使用特殊的哨兵值 `NA`(Not Available)來表示缺失或未知的值。\n",
"\n",
"那麼我們如何知道資料框中是否包含缺失值呢?\n",
"
\n",
"- 一個直接的方法是使用 R 的基礎函數 `anyNA`,它會返回邏輯值 `TRUE` 或 `FALSE`\n"
],
"metadata": {
"id": "Zxfb3AM5YbUe"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" anyNA()"
],
"outputs": [],
"metadata": {
"id": "G--DQutAYltj"
}
},
{
"cell_type": "markdown",
"source": [
"太好了,看起來有一些遺漏的數據!這是一個不錯的起點。\n",
"\n",
"- 另一種方法是使用函數 `is.na()`,它會用邏輯值 `TRUE` 指出哪些單個欄位元素是遺漏的。\n"
],
"metadata": {
"id": "mU-7-SB6YokF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "W-DxDOR4YxSW"
}
},
{
"cell_type": "markdown",
"source": [
"好的,完成了工作,但像這樣的大型數據框,逐一檢查所有行和列既低效又幾乎不可能😴。\n",
"\n",
"- 一個更直觀的方法是計算每列的缺失值總和:\n"
],
"metadata": {
"id": "xUWxipKYY0o7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" colSums()"
],
"outputs": [],
"metadata": {
"id": "ZRBWV6P9ZArL"
}
},
{
"cell_type": "markdown",
"source": [
"更好!雖然有些資料缺失,但或許對於當前的任務來說並不重要。我們來看看接下來的分析會帶來什麼結果。\n",
"\n",
"> 除了豐富的套件和函數,R 還擁有非常完善的文件說明。例如,可以使用 `help(colSums)` 或 `?colSums` 來了解更多關於該函數的資訊。\n"
],
"metadata": {
"id": "9gv-crB6ZD1Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Dplyr:資料操作的語法\n",
"\n",
"
\n",
" \n",
"
\n"
],
"metadata": {
"id": "nIgLjNMCZ-6Y"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new column Price\n",
"pumpkins <- pumpkins %>% \n",
" mutate(Price = (`Low Price` + `High Price`)/2)\n",
"\n",
"# View the first few rows of the data\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "Zo0BsqqtaJw2"
}
},
{
"cell_type": "markdown",
"source": [
"耶!💪\n",
"\n",
"「等等!」你可能在快速瀏覽整個數據集後使用 `View(pumpkins)` 時會說,「這裡有些奇怪的地方!」🤔\n",
"\n",
"如果你查看 `Package` 欄位,南瓜是以許多不同的方式銷售的。有些是以 `1 1/9 蒲式耳` 的單位銷售,有些是以 `1/2 蒲式耳` 的單位銷售,有些是按南瓜個數銷售,有些是按重量(磅)銷售,還有些是以不同寬度的大箱子銷售。\n",
"\n",
"讓我們來驗證一下:\n"
],
"metadata": {
"id": "p77WZr-9aQAR"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Verify the distinct observations in Package column\n",
"pumpkins %>% \n",
" distinct(Package)"
],
"outputs": [],
"metadata": {
"id": "XISGfh0IaUy6"
}
},
{
"cell_type": "markdown",
"source": [
"太棒了!👏\n",
"\n",
"南瓜似乎很難穩定地稱重,因此我們可以篩選出在 `Package` 欄位中包含 *bushel* 字串的南瓜,並將其放入一個新的資料框 `new_pumpkins` 中。 \n"
],
"metadata": {
"id": "7sMjiVujaZxY"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::filter() 和 stringr::str_detect()\n",
"\n",
"[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html):建立一個僅包含符合條件的**列**的資料子集,在此例中,`Package` 欄位中包含 *bushel* 字串的南瓜。\n",
"\n",
"[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html):檢測字串中是否存在或不存在某個模式。\n",
"\n",
"[`stringr`](https://github.com/tidyverse/stringr) 套件提供簡單的函數,用於常見的字串操作。\n"
],
"metadata": {
"id": "L8Qfcs92ageF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Retain only pumpkins with \"bushel\"\n",
"new_pumpkins <- pumpkins %>% \n",
" filter(str_detect(Package, \"bushel\"))\n",
"\n",
"# Get the dimensions of the new data\n",
"dim(new_pumpkins)\n",
"\n",
"# View a few rows of the new data\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "hy_SGYREampd"
}
},
{
"cell_type": "markdown",
"source": [
"您可以看到,我們已經篩選出大約415行左右的數據,這些數據包含了以蒲式耳計算的南瓜。🤩 \n",
"
\n"
],
"metadata": {
"id": "VrDwF031avlR"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::case_when()\n",
"\n",
"**但是等等!還有一件事要做**\n",
"\n",
"你是否注意到每行的蒲式耳數量不同?你需要將價格標準化,顯示每蒲式耳的價格,而不是每 1 1/9 或 1/2 蒲式耳的價格。是時候進行一些數學運算來標準化了。\n",
"\n",
"我們將使用函數 [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) 根據一些條件來*變更* Price 欄位的值。`case_when` 允許你向量化多個 `if_else()` 語句。\n"
],
"metadata": {
"id": "mLpw2jH4a0tx"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Convert the price if the Package contains fractional bushel values\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(Price = case_when(\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
" TRUE ~ Price))\n",
"\n",
"# View the first few rows of the data\n",
"new_pumpkins %>% \n",
" slice_head(n = 30)"
],
"outputs": [],
"metadata": {
"id": "P68kLVQmbM6I"
}
},
{
"cell_type": "markdown",
"source": [
"現在,我們可以根據蒲式耳的測量來分析每單位的價格。不過,這些對南瓜蒲式耳的研究表明,`了解數據的本質`是多麼地`重要`!\n",
"\n",
"> ✅ 根據 [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308) 的說法,蒲式耳的重量取決於農產品的種類,因為這是一種體積的測量方式。例如,「一蒲式耳的番茄應該重56磅……葉類和綠色蔬菜佔據更多空間但重量較輕,所以一蒲式耳的菠菜只有20磅。」這一切都相當複雜!我們不妨省去將蒲式耳轉換為磅的麻煩,直接按蒲式耳定價。不過,這些對南瓜蒲式耳的研究表明,了解數據的本質是多麼重要!\n",
"\n",
"> ✅ 你有注意到按半蒲式耳出售的南瓜非常昂貴嗎?你能猜出原因嗎?提示:小南瓜比大南瓜貴得多,可能是因為每蒲式耳中小南瓜的數量遠多於大南瓜,這是由於一個大的中空派南瓜佔據了更多未使用的空間。\n"
],
"metadata": {
"id": "pS2GNPagbSdb"
}
},
{
"cell_type": "markdown",
"source": [
"現在,最後為了冒險的樂趣 💁♀️,我們將「月份」欄位移到第一個位置,也就是在「套件」欄位之前。\n",
"\n",
"`dplyr::relocate()` 用於更改欄位的位置。\n"
],
"metadata": {
"id": "qql1SowfbdnP"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new data frame new_pumpkins\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(Month, .before = Package)\n",
"\n",
"new_pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "JJ1x6kw8bixF"
}
},
{
"cell_type": "markdown",
"source": [
"做得好!👌 現在你擁有一個乾淨、整潔的數據集,可以用來建立新的回歸模型! \n"
],
"metadata": {
"id": "y8TJ0Za_bn5Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 4. 使用 ggplot2 進行數據視覺化\n",
"\n",
"
\n",
" \n",
"
\n"
],
"metadata": {
"id": "Ml7SDCLQcPvE"
}
},
{
"cell_type": "markdown",
"source": [
"### **如何讓它更有用?**\n",
"\n",
"為了讓圖表顯示有用的數據,通常需要以某種方式對數據進行分組。例如,在我們的案例中,找出每個月南瓜的平均價格可以提供更多關於數據中隱藏模式的洞察。這引導我們進一步了解 **dplyr** 的功能:\n",
"\n",
"#### `dplyr::group_by() %>% summarize()`\n",
"\n",
"在 R 中,可以輕鬆地使用以下方法進行分組聚合:\n",
"\n",
"`dplyr::group_by() %>% summarize()`\n",
"\n",
"- `dplyr::group_by()` 將分析單位從整個數據集更改為個別的分組,例如按月分組。\n",
"\n",
"- `dplyr::summarize()` 創建一個新的數據框,其中包含每個分組變數的一列,以及您指定的每個摘要統計的一列。\n",
"\n",
"例如,我們可以使用 `dplyr::group_by() %>% summarize()` 將南瓜根據 **Month** 列進行分組,然後找出每個月的 **平均價格**。\n"
],
"metadata": {
"id": "jMakvJZIcVkh"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price))"
],
"outputs": [],
"metadata": {
"id": "6kVSUa2Bcilf"
}
},
{
"cell_type": "markdown",
"source": [
"簡潔明瞭!✨\n",
"\n",
"像月份這類的分類特徵,用長條圖來呈現會更合適 📊。負責繪製長條圖的圖層是 `geom_bar()` 和 `geom_col()`。可以查詢 `?geom_bar` 來了解更多資訊。\n",
"\n",
"讓我們來試試吧!\n"
],
"metadata": {
"id": "Kds48GUBcj3W"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month then plot a bar chart\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price)) %>% \r\n",
" ggplot(aes(x = Month, y = mean_price)) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"Pumpkin Price\")"
],
"outputs": [],
"metadata": {
"id": "VNbU1S3BcrxO"
}
},
{
"cell_type": "markdown",
"source": [
"🤩🤩這是一個更有用的數據視覺化!看起來南瓜的最高價格出現在九月和十月。這符合你的預期嗎?為什麼符合或不符合?\n",
"\n",
"恭喜你完成了第二課 👏!你已經為模型建構準備好了數據,並通過視覺化發現了更多的洞察!\n"
],
"metadata": {
"id": "zDm0VOzzcuzR"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免責聲明**: \n本文件已使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。儘管我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。原始文件的母語版本應被視為權威來源。對於關鍵資訊,建議使用專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或錯誤解釋不承擔責任。\n"
]
}
]
}