{ "nbformat": 4, "nbformat_minor": 2, "metadata": { "colab": { "name": "lesson_2-R.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "f3c335f9940cfd76528b3ef918b9b342", "translation_date": "2025-09-03T19:50:05+00:00", "source_file": "2-Regression/2-Data/solution/R/lesson_2-R.ipynb", "language_code": "zh" } }, "cells": [ { "cell_type": "markdown", "source": [ "# 构建回归模型:准备和可视化数据\n", "\n", "## **南瓜线性回归 - 第二课**\n", "#### 介绍\n", "\n", "现在你已经准备好了使用Tidymodels和Tidyverse来构建机器学习模型的工具,可以开始对数据提出问题了。在处理数据并应用机器学习解决方案时,正确提出问题以充分挖掘数据的潜力是非常重要的。\n", "\n", "在本课中,你将学习:\n", "\n", "- 如何为模型构建准备数据。\n", "\n", "- 如何使用`ggplot2`进行数据可视化。\n", "\n", "你需要回答的问题将决定你使用哪种类型的机器学习算法。而你得到的答案质量将很大程度上取决于数据的性质。\n", "\n", "让我们通过一个实际练习来看看这一点。\n", "\n", "
\n",
" \n",
"
\n",
"\n",
"> 温故知新:管道操作符 (`%>%`) 按逻辑顺序执行操作,将一个对象向前传递到函数或调用表达式中。你可以将管道操作符理解为代码中的“然后”。\n"
],
"metadata": {
"id": "REWcIv9yX29v"
}
},
{
"cell_type": "markdown",
"source": [
"## 2. 检查缺失数据\n",
"\n",
"数据科学家经常需要处理的一个常见问题是数据不完整或缺失。R使用特殊的哨兵值`NA`(Not Available)来表示缺失或未知的值。\n",
"\n",
"那么我们如何知道数据框中是否包含缺失值呢?\n",
"
\n",
"- 一个直接的方法是使用R的基础函数`anyNA`,它会返回逻辑对象`TRUE`或`FALSE`\n"
],
"metadata": {
"id": "Zxfb3AM5YbUe"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" anyNA()"
],
"outputs": [],
"metadata": {
"id": "G--DQutAYltj"
}
},
{
"cell_type": "markdown",
"source": [
"太好了,看来有一些数据缺失!这是一个不错的起点。\n",
"\n",
"- 另一种方法是使用函数 `is.na()`,它通过逻辑值 `TRUE` 来指示哪些单个列元素是缺失的。\n"
],
"metadata": {
"id": "mU-7-SB6YokF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "W-DxDOR4YxSW"
}
},
{
"cell_type": "markdown",
"source": [
"对于如此大的数据框,逐行逐列地检查显然效率低下,几乎不可能完成😴。\n",
"\n",
"- 更直观的方法是计算每列中缺失值的总和:\n"
],
"metadata": {
"id": "xUWxipKYY0o7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" colSums()"
],
"outputs": [],
"metadata": {
"id": "ZRBWV6P9ZArL"
}
},
{
"cell_type": "markdown",
"source": [
"更棒了!虽然有些数据缺失,但可能对当前任务影响不大。让我们看看进一步的分析会带来什么结果。\n",
"\n",
"> 除了强大的包和函数集合,R 还拥有非常优秀的文档支持。例如,可以使用 `help(colSums)` 或 `?colSums` 来了解更多关于该函数的信息。\n"
],
"metadata": {
"id": "9gv-crB6ZD1Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Dplyr:数据操作的语法\n",
"\n",
"
\n",
" \n",
"
\n"
],
"metadata": {
"id": "i5o33MQBZWWw"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::select()\n",
"\n",
"`select()` 是 `dplyr` 包中的一个函数,用于选择保留或排除特定的列。\n",
"\n",
"为了让数据框更易于操作,可以使用 `select()` 删除一些列,仅保留你需要的列。\n",
"\n",
"例如,在这个练习中,我们的分析将涉及 `Package`、`Low Price`、`High Price` 和 `Date` 这些列。让我们选择这些列吧。\n"
],
"metadata": {
"id": "x3VGMAGBZiUr"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Select desired columns\n",
"pumpkins <- pumpkins %>% \n",
" select(Package, `Low Price`, `High Price`, Date)\n",
"\n",
"\n",
"# Print data set\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "F_FgxQnVZnM0"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::mutate()\n",
"\n",
"`mutate()` 是 `dplyr` 包中的一个函数,用于创建或修改列,同时保留现有的列。\n",
"\n",
"`mutate` 的一般结构是:\n",
"\n",
"`data %>% mutate(new_column_name = what_it_contains)`\n",
"\n",
"让我们通过以下操作来尝试使用 `mutate` 对 `Date` 列进行处理:\n",
"\n",
"1. 将日期(目前是字符类型)转换为月份格式(这些是美国日期格式,因此格式为 `MM/DD/YYYY`)。\n",
"\n",
"2. 从日期中提取月份到一个新列。\n",
"\n",
"在 R 中,[lubridate](https://lubridate.tidyverse.org/) 包可以更轻松地处理日期时间数据。因此,让我们使用 `dplyr::mutate()`、`lubridate::mdy()` 和 `lubridate::month()` 来实现上述目标。我们可以删除 `Date` 列,因为在后续操作中不再需要它。\n"
],
"metadata": {
"id": "2KKo0Ed9Z1VB"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load lubridate\n",
"library(lubridate)\n",
"\n",
"pumpkins <- pumpkins %>% \n",
" # Convert the Date column to a date object\n",
" mutate(Date = mdy(Date)) %>% \n",
" # Extract month from Date\n",
" mutate(Month = month(Date)) %>% \n",
" # Drop Date column\n",
" select(-Date)\n",
"\n",
"# View the first few rows\n",
"pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "5joszIVSZ6xe"
}
},
{
"cell_type": "markdown",
"source": [
"哇哦!🤩\n",
"\n",
"接下来,让我们创建一个新的列 `Price`,表示南瓜的平均价格。现在,我们将 `Low Price` 和 `High Price` 列的平均值计算出来,用来填充新的 Price 列。\n",
"
\n"
],
"metadata": {
"id": "nIgLjNMCZ-6Y"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new column Price\n",
"pumpkins <- pumpkins %>% \n",
" mutate(Price = (`Low Price` + `High Price`)/2)\n",
"\n",
"# View the first few rows of the data\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "Zo0BsqqtaJw2"
}
},
{
"cell_type": "markdown",
"source": [
"耶!💪\n",
"\n",
"“等等!”你可能会在用 `View(pumpkins)` 浏览整个数据集后说,“这里有点奇怪!”🤔\n",
"\n",
"如果你查看 `Package` 列,会发现南瓜是以多种不同的方式出售的。有些是按 `1 1/9 蒲式耳` 计量出售的,有些是按 `1/2 蒲式耳` 计量出售的,有些是按个数出售的,有些是按重量(磅)出售的,还有一些是装在宽度各异的大箱子里出售的。\n",
"\n",
"让我们来验证一下:\n"
],
"metadata": {
"id": "p77WZr-9aQAR"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Verify the distinct observations in Package column\n",
"pumpkins %>% \n",
" distinct(Package)"
],
"outputs": [],
"metadata": {
"id": "XISGfh0IaUy6"
}
},
{
"cell_type": "markdown",
"source": [
"太棒了!👏\n",
"\n",
"南瓜似乎很难保持一致的称重,因此我们可以通过筛选 `Package` 列中包含字符串 *bushel* 的南瓜来过滤它们,并将结果放入一个新的数据框 `new_pumpkins` 中。\n"
],
"metadata": {
"id": "7sMjiVujaZxY"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::filter() 和 stringr::str_detect()\n",
"\n",
"[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html):创建一个数据子集,仅包含满足条件的**行**,在本例中是 `Package` 列中包含字符串 *bushel* 的南瓜。\n",
"\n",
"[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html):检测字符串中是否存在某个模式。\n",
"\n",
"[`stringr`](https://github.com/tidyverse/stringr) 包提供了用于常见字符串操作的简单函数。\n"
],
"metadata": {
"id": "L8Qfcs92ageF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Retain only pumpkins with \"bushel\"\n",
"new_pumpkins <- pumpkins %>% \n",
" filter(str_detect(Package, \"bushel\"))\n",
"\n",
"# Get the dimensions of the new data\n",
"dim(new_pumpkins)\n",
"\n",
"# View a few rows of the new data\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "hy_SGYREampd"
}
},
{
"cell_type": "markdown",
"source": [
"你可以看到我们已经缩小到大约415行左右的数据,这些数据包含了按蒲式耳计算的南瓜。🤩 \n"
],
"metadata": {
"id": "VrDwF031avlR"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::case_when()\n",
"\n",
"**但等等!还有一件事要做**\n",
"\n",
"你是否注意到每行的蒲式耳数量是不同的?你需要将价格标准化,以显示每蒲式耳的价格,而不是每1 1/9或1/2蒲式耳的价格。是时候做一些数学运算来进行标准化了。\n",
"\n",
"我们将使用函数[`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html)根据一些条件来*变更*价格列的值。`case_when`允许你将多个`if_else()`语句向量化处理。\n"
],
"metadata": {
"id": "mLpw2jH4a0tx"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Convert the price if the Package contains fractional bushel values\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(Price = case_when(\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
" TRUE ~ Price))\n",
"\n",
"# View the first few rows of the data\n",
"new_pumpkins %>% \n",
" slice_head(n = 30)"
],
"outputs": [],
"metadata": {
"id": "P68kLVQmbM6I"
}
},
{
"cell_type": "markdown",
"source": [
"现在,我们可以根据蒲式耳的测量来分析每单位的定价。然而,所有这些关于南瓜蒲式耳的研究都表明,`了解数据的本质`是多么`重要`!\n",
"\n",
"> ✅ 根据 [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308),蒲式耳的重量取决于农产品的类型,因为它是一种体积测量单位。“例如,一个番茄的蒲式耳应该重56磅……叶类和绿叶蔬菜占据更多空间但重量较轻,所以一个菠菜的蒲式耳只有20磅。”这真的很复杂!我们不必费心将蒲式耳转换为磅,而是直接按蒲式耳定价。然而,所有这些关于南瓜蒲式耳的研究都表明,了解数据的本质是多么重要!\n",
"\n",
"> ✅ 你注意到按半蒲式耳出售的南瓜非常贵吗?你能找出原因吗?提示:小南瓜比大南瓜贵得多,可能是因为每蒲式耳的小南瓜数量更多,而一个大的空心派南瓜占据了更多未使用的空间。\n"
],
"metadata": {
"id": "pS2GNPagbSdb"
}
},
{
"cell_type": "markdown",
"source": [
"现在最后,为了冒险的乐趣 💁♀️,我们还将“Month”列移动到第一个位置,也就是在“Package”列之前。\n",
"\n",
"`dplyr::relocate()` 用于更改列的位置。\n"
],
"metadata": {
"id": "qql1SowfbdnP"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new data frame new_pumpkins\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(Month, .before = Package)\n",
"\n",
"new_pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "JJ1x6kw8bixF"
}
},
{
"cell_type": "markdown",
"source": [
"干得好!👌 现在你有一个干净整洁的数据集,可以用来构建新的回归模型! \n"
],
"metadata": {
"id": "y8TJ0Za_bn5Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 4. 使用 ggplot2 进行数据可视化\n",
"\n",
"
\n",
" \n",
"
\n"
],
"metadata": {
"id": "Ml7SDCLQcPvE"
}
},
{
"cell_type": "markdown",
"source": [
"### **如何让它更有用?**\n",
"\n",
"为了让图表显示有用的数据,通常需要以某种方式对数据进行分组。例如,在我们的案例中,计算每个月南瓜的平均价格可以为数据中的潜在模式提供更多洞察。这引导我们了解另一个 **dplyr** 的功能:\n",
"\n",
"#### `dplyr::group_by() %>% summarize()`\n",
"\n",
"在 R 中可以轻松计算分组聚合:\n",
"\n",
"`dplyr::group_by() %>% summarize()`\n",
"\n",
"- `dplyr::group_by()` 将分析单位从整个数据集更改为单个组,例如按月分组。\n",
"\n",
"- `dplyr::summarize()` 创建一个新的数据框,其中每个分组变量有一列,以及每个指定的汇总统计量有一列。\n",
"\n",
"例如,我们可以使用 `dplyr::group_by() %>% summarize()` 将南瓜按 **Month** 列分组,然后计算每个月的 **平均价格**。\n"
],
"metadata": {
"id": "jMakvJZIcVkh"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price))"
],
"outputs": [],
"metadata": {
"id": "6kVSUa2Bcilf"
}
},
{
"cell_type": "markdown",
"source": [
"简洁明了!✨\n",
"\n",
"像月份这样的分类特征更适合用柱状图来表示 📊。负责绘制柱状图的图层是 `geom_bar()` 和 `geom_col()`。查看 `?geom_bar` 以了解更多信息。\n",
"\n",
"让我们来试试吧!\n"
],
"metadata": {
"id": "Kds48GUBcj3W"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month then plot a bar chart\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price)) %>% \r\n",
" ggplot(aes(x = Month, y = mean_price)) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"Pumpkin Price\")"
],
"outputs": [],
"metadata": {
"id": "VNbU1S3BcrxO"
}
},
{
"cell_type": "markdown",
"source": [
"🤩🤩这是一个更有用的数据可视化!它似乎表明南瓜的最高价格出现在九月和十月。这符合你的预期吗?为什么符合或不符合?\n",
"\n",
"恭喜你完成了第二课 👏!你已经为模型构建准备好了数据,并通过可视化发现了更多的洞察!\n"
],
"metadata": {
"id": "zDm0VOzzcuzR"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免责声明**: \n本文档使用AI翻译服务 [Co-op Translator](https://github.com/Azure/co-op-translator) 进行翻译。尽管我们努力确保翻译的准确性,但请注意,自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于关键信息,建议使用专业人工翻译。我们不对因使用此翻译而产生的任何误解或误读承担责任。\n"
]
}
]
}