{ "nbformat": 4, "nbformat_minor": 2, "metadata": { "colab": { "name": "lesson_2-R.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "f3c335f9940cfd76528b3ef918b9b342", "translation_date": "2025-09-06T15:32:51+00:00", "source_file": "2-Regression/2-Data/solution/R/lesson_2-R.ipynb", "language_code": "en" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Build a regression model: prepare and visualize data\n", "\n", "## **Linear Regression for Pumpkins - Lesson 2**\n", "#### Introduction\n", "\n", "Now that you have the tools needed to start building machine learning models using Tidymodels and the Tidyverse, you're ready to begin asking questions about your data. When working with data and applying ML solutions, it's crucial to know how to ask the right questions to fully unlock the potential of your dataset.\n", "\n", "In this lesson, you will learn:\n", "\n", "- How to prepare your data for building models.\n", "\n", "- How to use `ggplot2` for visualizing data.\n", "\n", "The type of question you want answered will determine which ML algorithms you use. Additionally, the quality of the answer you receive will largely depend on the characteristics of your data.\n", "\n", "Let's explore this through a practical exercise.\n", "\n", "
\n",
" \n",
"
\n",
"\n",
"> Quick reminder: The pipe operator (`%>%`) allows you to perform operations in a logical sequence by passing an object forward into a function or expression. You can think of the pipe operator as saying \"and then\" in your code.\n"
],
"metadata": {
"id": "REWcIv9yX29v"
}
},
{
"cell_type": "markdown",
"source": [
"## 2. Check for missing data\n",
"\n",
"One of the most common challenges data scientists face is handling incomplete or missing data. R uses a special sentinel value, `NA` (Not Available), to represent missing or unknown values.\n",
"\n",
"How can we determine if the data frame contains missing values?\n",
"
\n",
"- A simple approach is to use the base R function `anyNA`, which returns the logical values `TRUE` or `FALSE`.\n"
],
"metadata": {
"id": "Zxfb3AM5YbUe"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" anyNA()"
],
"outputs": [],
"metadata": {
"id": "G--DQutAYltj"
}
},
{
"cell_type": "markdown",
"source": [
"Great, there seems to be some missing data! That's a good place to start.\n",
"\n",
"- Another approach would be to use the function `is.na()` which identifies the missing elements in each column with a logical `TRUE`.\n"
],
"metadata": {
"id": "mU-7-SB6YokF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "W-DxDOR4YxSW"
}
},
{
"cell_type": "markdown",
"source": [
"Okay, got the job done, but with a large data frame like this, reviewing all the rows and columns individually would be inefficient and practically impossible 😴.\n",
"\n",
"- A more practical approach would be to calculate the total number of missing values for each column:\n"
],
"metadata": {
"id": "xUWxipKYY0o7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" colSums()"
],
"outputs": [],
"metadata": {
"id": "ZRBWV6P9ZArL"
}
},
{
"cell_type": "markdown",
"source": [
"Much better! There is missing data, but maybe it won't matter for the task at hand. Let's see what further analysis brings forth.\n",
"\n",
"> In addition to its impressive collection of packages and functions, R also offers excellent documentation. For example, you can use `help(colSums)` or `?colSums` to learn more about the function.\n"
],
"metadata": {
"id": "9gv-crB6ZD1Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Dplyr: A Grammar of Data Manipulation\n",
"\n",
"
\n",
" \n",
"
\n",
" \n",
"
\n"
],
"metadata": {
"id": "Ml7SDCLQcPvE"
}
},
{
"cell_type": "markdown",
"source": [
"### **How do we make it useful?**\n",
"\n",
"To display meaningful data in charts, you often need to organize the data in some way. For example, in our case, calculating the average price of pumpkins for each month would reveal more insights into the patterns within our data. This brings us to another quick look at **dplyr**:\n",
"\n",
"#### `dplyr::group_by() %>% summarize()`\n",
"\n",
"Grouped aggregation in R can be easily performed using\n",
"\n",
"`dplyr::group_by() %>% summarize()`\n",
"\n",
"- `dplyr::group_by()` shifts the focus of analysis from the entire dataset to specific groups, such as by month.\n",
"\n",
"- `dplyr::summarize()` generates a new data frame with one column for each grouping variable and one column for each summary statistic you specify.\n",
"\n",
"For instance, we can use `dplyr::group_by() %>% summarize()` to group the pumpkins based on the **Month** column and then calculate the **average price** for each month.\n"
],
"metadata": {
"id": "jMakvJZIcVkh"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price))"
],
"outputs": [],
"metadata": {
"id": "6kVSUa2Bcilf"
}
},
{
"cell_type": "markdown",
"source": [
"Succinct!✨\n",
"\n",
"Categorical features like months are best visualized with a bar plot 📊. The layers used for creating bar charts are `geom_bar()` and `geom_col()`. Check `?geom_bar` for more details.\n",
"\n",
"Let’s create one!\n"
],
"metadata": {
"id": "Kds48GUBcj3W"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month then plot a bar chart\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price)) %>% \r\n",
" ggplot(aes(x = Month, y = mean_price)) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"Pumpkin Price\")"
],
"outputs": [],
"metadata": {
"id": "VNbU1S3BcrxO"
}
},
{
"cell_type": "markdown",
"source": [
"🤩🤩 This is a much more useful data visualization! It appears to show that pumpkin prices peak in September and October. Does that align with your expectations? Why or why not?\n",
"\n",
"Well done on completing the second lesson 👏! You prepared your data for building a model and discovered additional insights through visualizations!\n"
],
"metadata": {
"id": "zDm0VOzzcuzR"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**Disclaimer**: \nThis document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.\n"
]
}
]
}