{ "nbformat": 4, "nbformat_minor": 2, "metadata": { "colab": { "name": "lesson_2-R.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "f3c335f9940cfd76528b3ef918b9b342", "translation_date": "2025-09-03T19:50:05+00:00", "source_file": "2-Regression/2-Data/solution/R/lesson_2-R.ipynb", "language_code": "zh" } }, "cells": [ { "cell_type": "markdown", "source": [ "# 构建回归模型:准备和可视化数据\n", "\n", "## **南瓜线性回归 - 第二课**\n", "#### 介绍\n", "\n", "现在你已经准备好了使用Tidymodels和Tidyverse来构建机器学习模型的工具,可以开始对数据提出问题了。在处理数据并应用机器学习解决方案时,正确提出问题以充分挖掘数据的潜力是非常重要的。\n", "\n", "在本课中,你将学习:\n", "\n", "- 如何为模型构建准备数据。\n", "\n", "- 如何使用`ggplot2`进行数据可视化。\n", "\n", "你需要回答的问题将决定你使用哪种类型的机器学习算法。而你得到的答案质量将很大程度上取决于数据的性质。\n", "\n", "让我们通过一个实际练习来看看这一点。\n", "\n", "

\n", " \n", "

艺术作品由 @allison_horst 提供
\n", "\n", "\n", "\n" ], "metadata": { "id": "Pg5aexcOPqAZ" } }, { "cell_type": "markdown", "source": [ "## 1. 导入南瓜数据并召唤 Tidyverse\n", "\n", "我们需要以下软件包来完成本课程的分析和处理:\n", "\n", "- `tidyverse`: [tidyverse](https://www.tidyverse.org/) 是一个 [R 软件包集合](https://www.tidyverse.org/packages),旨在让数据科学更快速、更简单、更有趣!\n", "\n", "你可以通过以下方式安装它们:\n", "\n", "`install.packages(c(\"tidyverse\"))`\n", "\n", "下面的脚本会检查你是否已经安装了完成本模块所需的软件包,并在缺少时为你安装它们。\n" ], "metadata": { "id": "dc5WhyVdXAjR" } }, { "cell_type": "code", "execution_count": null, "source": [ "suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n", "pacman::p_load(tidyverse)" ], "outputs": [], "metadata": { "id": "GqPYUZgfXOBt" } }, { "cell_type": "markdown", "source": [ "现在,让我们启动一些软件包并加载为本课程提供的[数据](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv)!\n" ], "metadata": { "id": "kvjDTPDSXRr2" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Load the core Tidyverse packages\n", "library(tidyverse)\n", "\n", "# Import the pumpkins data\n", "pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n", "\n", "\n", "# Get a glimpse and dimensions of the data\n", "glimpse(pumpkins)\n", "\n", "\n", "# Print the first 50 rows of the data set\n", "pumpkins %>% \n", " slice_head(n =50)" ], "outputs": [], "metadata": { "id": "VMri-t2zXqgD" } }, { "cell_type": "markdown", "source": [ "一个快速的 `glimpse()` 立即显示出数据中存在空值,并且混合了字符串 (`chr`) 和数值数据 (`dbl`)。`Date` 是字符类型,还有一个奇怪的列叫做 `Package`,其中数据是 `sacks`、`bins` 和其他值的混合。事实上,这些数据有点乱 😤。\n", "\n", "实际上,很少会直接获得一个完全准备好用于创建机器学习模型的数据集。但别担心,在本节课中,你将学习如何使用标准的 R 库来准备一个原始数据集 🧑‍🔧。你还将学习各种技术来可视化数据。📈📊\n", "
\n", "\n", "> 温故知新:管道操作符 (`%>%`) 按逻辑顺序执行操作,将一个对象向前传递到函数或调用表达式中。你可以将管道操作符理解为代码中的“然后”。\n" ], "metadata": { "id": "REWcIv9yX29v" } }, { "cell_type": "markdown", "source": [ "## 2. 检查缺失数据\n", "\n", "数据科学家经常需要处理的一个常见问题是数据不完整或缺失。R使用特殊的哨兵值`NA`(Not Available)来表示缺失或未知的值。\n", "\n", "那么我们如何知道数据框中是否包含缺失值呢?\n", "
\n", "- 一个直接的方法是使用R的基础函数`anyNA`,它会返回逻辑对象`TRUE`或`FALSE`\n" ], "metadata": { "id": "Zxfb3AM5YbUe" } }, { "cell_type": "code", "execution_count": null, "source": [ "pumpkins %>% \n", " anyNA()" ], "outputs": [], "metadata": { "id": "G--DQutAYltj" } }, { "cell_type": "markdown", "source": [ "太好了,看来有一些数据缺失!这是一个不错的起点。\n", "\n", "- 另一种方法是使用函数 `is.na()`,它通过逻辑值 `TRUE` 来指示哪些单个列元素是缺失的。\n" ], "metadata": { "id": "mU-7-SB6YokF" } }, { "cell_type": "code", "execution_count": null, "source": [ "pumpkins %>% \n", " is.na() %>% \n", " head(n = 7)" ], "outputs": [], "metadata": { "id": "W-DxDOR4YxSW" } }, { "cell_type": "markdown", "source": [ "对于如此大的数据框,逐行逐列地检查显然效率低下,几乎不可能完成😴。\n", "\n", "- 更直观的方法是计算每列中缺失值的总和:\n" ], "metadata": { "id": "xUWxipKYY0o7" } }, { "cell_type": "code", "execution_count": null, "source": [ "pumpkins %>% \n", " is.na() %>% \n", " colSums()" ], "outputs": [], "metadata": { "id": "ZRBWV6P9ZArL" } }, { "cell_type": "markdown", "source": [ "更棒了!虽然有些数据缺失,但可能对当前任务影响不大。让我们看看进一步的分析会带来什么结果。\n", "\n", "> 除了强大的包和函数集合,R 还拥有非常优秀的文档支持。例如,可以使用 `help(colSums)` 或 `?colSums` 来了解更多关于该函数的信息。\n" ], "metadata": { "id": "9gv-crB6ZD1Y" } }, { "cell_type": "markdown", "source": [ "## 3. Dplyr:数据操作的语法\n", "\n", "

\n", " \n", "

插图作者:@allison_horst
\n", "\n", "\n", "\n" ], "metadata": { "id": "o4jLY5-VZO2C" } }, { "cell_type": "markdown", "source": [ "[`dplyr`](https://dplyr.tidyverse.org/) 是 Tidyverse 中的一个包,它是一种数据操作的语法,提供了一组一致的动词,帮助你解决最常见的数据操作问题。在本节中,我们将探索一些 dplyr 的动词! \n", "
\n" ], "metadata": { "id": "i5o33MQBZWWw" } }, { "cell_type": "markdown", "source": [ "#### dplyr::select()\n", "\n", "`select()` 是 `dplyr` 包中的一个函数,用于选择保留或排除特定的列。\n", "\n", "为了让数据框更易于操作,可以使用 `select()` 删除一些列,仅保留你需要的列。\n", "\n", "例如,在这个练习中,我们的分析将涉及 `Package`、`Low Price`、`High Price` 和 `Date` 这些列。让我们选择这些列吧。\n" ], "metadata": { "id": "x3VGMAGBZiUr" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Select desired columns\n", "pumpkins <- pumpkins %>% \n", " select(Package, `Low Price`, `High Price`, Date)\n", "\n", "\n", "# Print data set\n", "pumpkins %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "F_FgxQnVZnM0" } }, { "cell_type": "markdown", "source": [ "#### dplyr::mutate()\n", "\n", "`mutate()` 是 `dplyr` 包中的一个函数,用于创建或修改列,同时保留现有的列。\n", "\n", "`mutate` 的一般结构是:\n", "\n", "`data %>% mutate(new_column_name = what_it_contains)`\n", "\n", "让我们通过以下操作来尝试使用 `mutate` 对 `Date` 列进行处理:\n", "\n", "1. 将日期(目前是字符类型)转换为月份格式(这些是美国日期格式,因此格式为 `MM/DD/YYYY`)。\n", "\n", "2. 从日期中提取月份到一个新列。\n", "\n", "在 R 中,[lubridate](https://lubridate.tidyverse.org/) 包可以更轻松地处理日期时间数据。因此,让我们使用 `dplyr::mutate()`、`lubridate::mdy()` 和 `lubridate::month()` 来实现上述目标。我们可以删除 `Date` 列,因为在后续操作中不再需要它。\n" ], "metadata": { "id": "2KKo0Ed9Z1VB" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Load lubridate\n", "library(lubridate)\n", "\n", "pumpkins <- pumpkins %>% \n", " # Convert the Date column to a date object\n", " mutate(Date = mdy(Date)) %>% \n", " # Extract month from Date\n", " mutate(Month = month(Date)) %>% \n", " # Drop Date column\n", " select(-Date)\n", "\n", "# View the first few rows\n", "pumpkins %>% \n", " slice_head(n = 7)" ], "outputs": [], "metadata": { "id": "5joszIVSZ6xe" } }, { "cell_type": "markdown", "source": [ "哇哦!🤩\n", "\n", "接下来,让我们创建一个新的列 `Price`,表示南瓜的平均价格。现在,我们将 `Low Price` 和 `High Price` 列的平均值计算出来,用来填充新的 Price 列。\n", "
\n" ], "metadata": { "id": "nIgLjNMCZ-6Y" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Create a new column Price\n", "pumpkins <- pumpkins %>% \n", " mutate(Price = (`Low Price` + `High Price`)/2)\n", "\n", "# View the first few rows of the data\n", "pumpkins %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "Zo0BsqqtaJw2" } }, { "cell_type": "markdown", "source": [ "耶!💪\n", "\n", "“等等!”你可能会在用 `View(pumpkins)` 浏览整个数据集后说,“这里有点奇怪!”🤔\n", "\n", "如果你查看 `Package` 列,会发现南瓜是以多种不同的方式出售的。有些是按 `1 1/9 蒲式耳` 计量出售的,有些是按 `1/2 蒲式耳` 计量出售的,有些是按个数出售的,有些是按重量(磅)出售的,还有一些是装在宽度各异的大箱子里出售的。\n", "\n", "让我们来验证一下:\n" ], "metadata": { "id": "p77WZr-9aQAR" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Verify the distinct observations in Package column\n", "pumpkins %>% \n", " distinct(Package)" ], "outputs": [], "metadata": { "id": "XISGfh0IaUy6" } }, { "cell_type": "markdown", "source": [ "太棒了!👏\n", "\n", "南瓜似乎很难保持一致的称重,因此我们可以通过筛选 `Package` 列中包含字符串 *bushel* 的南瓜来过滤它们,并将结果放入一个新的数据框 `new_pumpkins` 中。\n" ], "metadata": { "id": "7sMjiVujaZxY" } }, { "cell_type": "markdown", "source": [ "#### dplyr::filter() 和 stringr::str_detect()\n", "\n", "[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html):创建一个数据子集,仅包含满足条件的**行**,在本例中是 `Package` 列中包含字符串 *bushel* 的南瓜。\n", "\n", "[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html):检测字符串中是否存在某个模式。\n", "\n", "[`stringr`](https://github.com/tidyverse/stringr) 包提供了用于常见字符串操作的简单函数。\n" ], "metadata": { "id": "L8Qfcs92ageF" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Retain only pumpkins with \"bushel\"\n", "new_pumpkins <- pumpkins %>% \n", " filter(str_detect(Package, \"bushel\"))\n", "\n", "# Get the dimensions of the new data\n", "dim(new_pumpkins)\n", "\n", "# View a few rows of the new data\n", "new_pumpkins %>% \n", " slice_head(n = 5)" ], "outputs": [], "metadata": { "id": "hy_SGYREampd" } }, { "cell_type": "markdown", "source": [ "你可以看到我们已经缩小到大约415行左右的数据,这些数据包含了按蒲式耳计算的南瓜。🤩 \n" ], "metadata": { "id": "VrDwF031avlR" } }, { "cell_type": "markdown", "source": [ "#### dplyr::case_when()\n", "\n", "**但等等!还有一件事要做**\n", "\n", "你是否注意到每行的蒲式耳数量是不同的?你需要将价格标准化,以显示每蒲式耳的价格,而不是每1 1/9或1/2蒲式耳的价格。是时候做一些数学运算来进行标准化了。\n", "\n", "我们将使用函数[`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html)根据一些条件来*变更*价格列的值。`case_when`允许你将多个`if_else()`语句向量化处理。\n" ], "metadata": { "id": "mLpw2jH4a0tx" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Convert the price if the Package contains fractional bushel values\n", "new_pumpkins <- new_pumpkins %>% \n", " mutate(Price = case_when(\n", " str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n", " str_detect(Package, \"1/2\") ~ Price/(1/2),\n", " TRUE ~ Price))\n", "\n", "# View the first few rows of the data\n", "new_pumpkins %>% \n", " slice_head(n = 30)" ], "outputs": [], "metadata": { "id": "P68kLVQmbM6I" } }, { "cell_type": "markdown", "source": [ "现在,我们可以根据蒲式耳的测量来分析每单位的定价。然而,所有这些关于南瓜蒲式耳的研究都表明,`了解数据的本质`是多么`重要`!\n", "\n", "> ✅ 根据 [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308),蒲式耳的重量取决于农产品的类型,因为它是一种体积测量单位。“例如,一个番茄的蒲式耳应该重56磅……叶类和绿叶蔬菜占据更多空间但重量较轻,所以一个菠菜的蒲式耳只有20磅。”这真的很复杂!我们不必费心将蒲式耳转换为磅,而是直接按蒲式耳定价。然而,所有这些关于南瓜蒲式耳的研究都表明,了解数据的本质是多么重要!\n", "\n", "> ✅ 你注意到按半蒲式耳出售的南瓜非常贵吗?你能找出原因吗?提示:小南瓜比大南瓜贵得多,可能是因为每蒲式耳的小南瓜数量更多,而一个大的空心派南瓜占据了更多未使用的空间。\n" ], "metadata": { "id": "pS2GNPagbSdb" } }, { "cell_type": "markdown", "source": [ "现在最后,为了冒险的乐趣 💁‍♀️,我们还将“Month”列移动到第一个位置,也就是在“Package”列之前。\n", "\n", "`dplyr::relocate()` 用于更改列的位置。\n" ], "metadata": { "id": "qql1SowfbdnP" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Create a new data frame new_pumpkins\n", "new_pumpkins <- new_pumpkins %>% \n", " relocate(Month, .before = Package)\n", "\n", "new_pumpkins %>% \n", " slice_head(n = 7)" ], "outputs": [], "metadata": { "id": "JJ1x6kw8bixF" } }, { "cell_type": "markdown", "source": [ "干得好!👌 现在你有一个干净整洁的数据集,可以用来构建新的回归模型! \n" ], "metadata": { "id": "y8TJ0Za_bn5Y" } }, { "cell_type": "markdown", "source": [ "## 4. 使用 ggplot2 进行数据可视化\n", "\n", "

\n", " \n", "

信息图表作者:Dasani Madipalli
\n", "\n", "\n", "\n", "\n", "有一句*智慧*的名言是这样说的:\n", "\n", "> “简单的图表比任何其他工具都能为数据分析师带来更多的信息。” --- John Tukey\n", "\n", "数据科学家的职责之一是展示他们所处理数据的质量和特性。为此,他们通常会创建有趣的可视化内容,比如图表、折线图和柱状图,来展示数据的不同方面。通过这种方式,他们能够直观地展示数据中的关系和差距,这些信息通常难以通过其他方式发现。\n", "\n", "可视化还可以帮助确定最适合数据的机器学习技术。例如,一个看起来沿着直线分布的散点图表明该数据非常适合线性回归分析。\n", "\n", "R 提供了多种绘图系统,而 [`ggplot2`](https://ggplot2.tidyverse.org/index.html) 是其中最优雅且最灵活的一个。`ggplot2` 允许你通过**组合独立组件**来构建图表。\n", "\n", "我们先从一个简单的散点图开始,展示 Price 和 Month 列的数据。\n", "\n", "在这个例子中,我们将从 [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html) 开始,提供一个数据集和美学映射(使用 [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)),然后添加图层(例如用于散点图的 [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html))。\n" ], "metadata": { "id": "mYSH6-EtbvNa" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Set a theme for the plots\n", "theme_set(theme_light())\n", "\n", "# Create a scatter plot\n", "p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\n", "p + geom_point()" ], "outputs": [], "metadata": { "id": "g2YjnGeOcLo4" } }, { "cell_type": "markdown", "source": [ "这个图表有用吗🤷?有没有什么让你感到惊讶的地方?\n", "\n", "它并不是特别有用,因为它只是将你的数据以某个月的点状分布显示出来。\n", "
\n" ], "metadata": { "id": "Ml7SDCLQcPvE" } }, { "cell_type": "markdown", "source": [ "### **如何让它更有用?**\n", "\n", "为了让图表显示有用的数据,通常需要以某种方式对数据进行分组。例如,在我们的案例中,计算每个月南瓜的平均价格可以为数据中的潜在模式提供更多洞察。这引导我们了解另一个 **dplyr** 的功能:\n", "\n", "#### `dplyr::group_by() %>% summarize()`\n", "\n", "在 R 中可以轻松计算分组聚合:\n", "\n", "`dplyr::group_by() %>% summarize()`\n", "\n", "- `dplyr::group_by()` 将分析单位从整个数据集更改为单个组,例如按月分组。\n", "\n", "- `dplyr::summarize()` 创建一个新的数据框,其中每个分组变量有一列,以及每个指定的汇总统计量有一列。\n", "\n", "例如,我们可以使用 `dplyr::group_by() %>% summarize()` 将南瓜按 **Month** 列分组,然后计算每个月的 **平均价格**。\n" ], "metadata": { "id": "jMakvJZIcVkh" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Find the average price of pumpkins per month\r\n", "new_pumpkins %>%\r\n", " group_by(Month) %>% \r\n", " summarise(mean_price = mean(Price))" ], "outputs": [], "metadata": { "id": "6kVSUa2Bcilf" } }, { "cell_type": "markdown", "source": [ "简洁明了!✨\n", "\n", "像月份这样的分类特征更适合用柱状图来表示 📊。负责绘制柱状图的图层是 `geom_bar()` 和 `geom_col()`。查看 `?geom_bar` 以了解更多信息。\n", "\n", "让我们来试试吧!\n" ], "metadata": { "id": "Kds48GUBcj3W" } }, { "cell_type": "code", "execution_count": null, "source": [ "# Find the average price of pumpkins per month then plot a bar chart\r\n", "new_pumpkins %>%\r\n", " group_by(Month) %>% \r\n", " summarise(mean_price = mean(Price)) %>% \r\n", " ggplot(aes(x = Month, y = mean_price)) +\r\n", " geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n", " ylab(\"Pumpkin Price\")" ], "outputs": [], "metadata": { "id": "VNbU1S3BcrxO" } }, { "cell_type": "markdown", "source": [ "🤩🤩这是一个更有用的数据可视化!它似乎表明南瓜的最高价格出现在九月和十月。这符合你的预期吗?为什么符合或不符合?\n", "\n", "恭喜你完成了第二课 👏!你已经为模型构建准备好了数据,并通过可视化发现了更多的洞察!\n" ], "metadata": { "id": "zDm0VOzzcuzR" } }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免责声明**: \n本文档使用AI翻译服务 [Co-op Translator](https://github.com/Azure/co-op-translator) 进行翻译。尽管我们努力确保翻译的准确性,但请注意,自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于关键信息,建议使用专业人工翻译。我们不对因使用此翻译而产生的任何误解或误读承担责任。\n" ] } ] }