You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/pcm/2-Regression/2-Data/solution/R/lesson_2-R.ipynb

665 lines
23 KiB

{
"nbformat": 4,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "lesson_2-R.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
},
"coopTranslator": {
"original_hash": "f3c335f9940cfd76528b3ef918b9b342",
"translation_date": "2025-11-18T19:21:08+00:00",
"source_file": "2-Regression/2-Data/solution/R/lesson_2-R.ipynb",
"language_code": "pcm"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Build Regression Model: Prepare and Visualize Data\n",
"\n",
"## **Linear Regression for Pumpkins - Lesson 2**\n",
"#### Introduction\n",
"\n",
"Now wey you don get di tools wey you need to start dey build machine learning model wit Tidymodels and di Tidyverse, you fit begin dey ask beta questions about your data. As you dey work wit data and dey apply ML solutions, e dey very important make you sabi how to ask di correct question so you fit unlock di full potential of your dataset.\n",
"\n",
"For dis lesson, you go learn:\n",
"\n",
"- How to prepare your data for model-building.\n",
"\n",
"- How to use `ggplot2` for data visualization.\n",
"\n",
"Di kain question wey you wan answer go determine di type of ML algorithm wey you go use. And di quality of di answer wey you go get go depend well well on how your data be.\n",
"\n",
"Make we see how e go be by working through one practical exercise.\n",
"\n",
"<p >\n",
" <img src=\"../../../../../../translated_images/unruly_data.0eedc7ced92d2d91.pcm.jpg\"\n",
" width=\"700\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n"
],
"metadata": {
"id": "Pg5aexcOPqAZ"
}
},
{
"cell_type": "markdown",
"source": [
"## 1. Import pumpkin data and call Tidyverse\n",
"\n",
"We go need dis packages to fit chop-chop dis lesson:\n",
"\n",
"- `tidyverse`: Di [tidyverse](https://www.tidyverse.org/) na [collection of R packages](https://www.tidyverse.org/packages) wey dem design to make data science fast, easy, and even sweet!\n",
"\n",
"You fit install dem like dis:\n",
"\n",
"`install.packages(c(\"tidyverse\"))`\n",
"\n",
"Di script wey dey below go check whether you get di packages wey you need to finish dis module, and e go install dem for you if any dey miss.\n"
],
"metadata": {
"id": "dc5WhyVdXAjR"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n",
"pacman::p_load(tidyverse)"
],
"outputs": [],
"metadata": {
"id": "GqPYUZgfXOBt"
}
},
{
"cell_type": "markdown",
"source": [
"Make we fire up some packages and load di [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) wey dem provide for dis lesson!\n"
],
"metadata": {
"id": "kvjDTPDSXRr2"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load the core Tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# Import the pumpkins data\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n",
"\n",
"\n",
"# Get a glimpse and dimensions of the data\n",
"glimpse(pumpkins)\n",
"\n",
"\n",
"# Print the first 50 rows of the data set\n",
"pumpkins %>% \n",
" slice_head(n =50)"
],
"outputs": [],
"metadata": {
"id": "VMri-t2zXqgD"
}
},
{
"cell_type": "markdown",
"source": [
"If you use `glimpse()`, e go quick show say some space dey and mix of string (`chr`) and number data (`dbl`). The `Date` na character type and one kind column dey wey dem call `Package` wey get mix of `sacks`, `bins` and other values. The data sef, e no too clean 😤.\n",
"\n",
"E no dey too common to get dataset wey don ready to use for ML model straight from box. But no worry, for dis lesson, you go sabi how to prepare raw dataset using standard R libraries 🧑‍🔧. You go still learn different ways to take show the data.📈📊\n",
"<br>\n",
"\n",
"> Small reminder: The pipe operator (`%>%`) dey do operation step by step by passing one object go front into one function or call expression. You fit think of pipe operator like say e dey talk \"and then\" for your code.\n"
],
"metadata": {
"id": "REWcIv9yX29v"
}
},
{
"cell_type": "markdown",
"source": [
"## 2. Check for missing data\n",
"\n",
"One common wahala wey data scientists dey face na incomplete or missing data. R dey use special sentinel value `NA` (Not Available) to represent missing or unknown values.\n",
"\n",
"So, how we go take sabi say the data frame get missing values? \n",
"<br>\n",
"- One simple way na to use the base R function `anyNA` wey dey return logical objects `TRUE` or `FALSE`.\n"
],
"metadata": {
"id": "Zxfb3AM5YbUe"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" anyNA()"
],
"outputs": [],
"metadata": {
"id": "G--DQutAYltj"
}
},
{
"cell_type": "markdown",
"source": [
"Nice one, e be like say some data dey miss! Na good place to start be dat.\n",
"\n",
"- Another way fit be to use di function `is.na()` wey go show which column elements dey miss wit logical `TRUE`.\n"
],
"metadata": {
"id": "mU-7-SB6YokF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "W-DxDOR4YxSW"
}
},
{
"cell_type": "markdown",
"source": [
"Okay, work don finish but if data frame big like this, e no go make sense or even possible to check all the rows and columns one by one😴.\n",
"\n",
"- Better way na to calculate the total of the missing values for each column:\n"
],
"metadata": {
"id": "xUWxipKYY0o7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" colSums()"
],
"outputs": [],
"metadata": {
"id": "ZRBWV6P9ZArL"
}
},
{
"cell_type": "markdown",
"source": [
"Beta pass! E get some data wey miss sha, but e fit no too matter for wetin we wan do. Make we see wetin further analysis go show.\n",
"\n",
"> Wit di plenty better packages and functions wey R get, e still get correct documentation. For example, you fit use `help(colSums)` or `?colSums` to sabi more about di function.\n"
],
"metadata": {
"id": "9gv-crB6ZD1Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Dplyr: Grammar wey dey for Data Manipulation\n",
"\n",
"<p >\n",
" <img src=\"../../../../../../translated_images/dplyr_wrangling.f5f99c64fd4580f1.pcm.png\"\n",
" width=\"569\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n",
"\n",
"\n",
"<!--![Artwork by \\@allison_horst](../../../../../../translated_images/dplyr_wrangling.f5f99c64fd4580f1.pcm.png)<br/>Artwork by \\@allison_horst-->\n"
],
"metadata": {
"id": "o4jLY5-VZO2C"
}
},
{
"cell_type": "markdown",
"source": [
"[`dplyr`](https://dplyr.tidyverse.org/), na one package wey dey Tidyverse, e be grammar for data manipulation wey dey give consistent set of verbs wey go help you solve di common wahala for data manipulation. For dis section, we go look some of di dplyr verbs!\n"
],
"metadata": {
"id": "i5o33MQBZWWw"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::select()\n",
"\n",
"`select()` na one function wey dey inside `dplyr` package wey go help you choose columns wey you wan keep or remove.\n",
"\n",
"To make your data frame easy to work with, you fit drop plenty columns wey dey inside am, use `select()` keep only the columns wey you need.\n",
"\n",
"For example, for this exercise, our analysis go use the columns `Package`, `Low Price`, `High Price` and `Date`. Make we select these columns.\n"
],
"metadata": {
"id": "x3VGMAGBZiUr"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Select desired columns\n",
"pumpkins <- pumpkins %>% \n",
" select(Package, `Low Price`, `High Price`, Date)\n",
"\n",
"\n",
"# Print data set\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "F_FgxQnVZnM0"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::mutate()\n",
"\n",
"`mutate()` na one function wey dey inside `dplyr` package wey go help you create or change columns, and e go still keep the columns wey dey already.\n",
"\n",
"How `mutate` dey work be like this:\n",
"\n",
"`data %>% mutate(new_column_name = what_it_contains)`\n",
"\n",
"Make we use `mutate` try do something with the `Date` column by doing these things:\n",
"\n",
"1. Change the dates (wey be character type now) to month format (these dates na US style, so the format na `MM/DD/YYYY`).\n",
"\n",
"2. Comot the month from the dates put for new column.\n",
"\n",
"For R, the package [lubridate](https://lubridate.tidyverse.org/) dey make am easy to work with Date-time data. So, make we use `dplyr::mutate()`, `lubridate::mdy()`, `lubridate::month()` to see how we go fit do wetin we talk. We fit drop the Date column since we no go need am again for the next things wey we wan do.\n"
],
"metadata": {
"id": "2KKo0Ed9Z1VB"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load lubridate\n",
"library(lubridate)\n",
"\n",
"pumpkins <- pumpkins %>% \n",
" # Convert the Date column to a date object\n",
" mutate(Date = mdy(Date)) %>% \n",
" # Extract month from Date\n",
" mutate(Month = month(Date)) %>% \n",
" # Drop Date column\n",
" select(-Date)\n",
"\n",
"# View the first few rows\n",
"pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "5joszIVSZ6xe"
}
},
{
"cell_type": "markdown",
"source": [
"Woohoo! 🤩\n",
"\n",
"Next thing we go do na to create new column wey we go call `Price`, wey go show the average price of pumpkin. Now, make we calculate the average of the `Low Price` and `High Price` columns to fill the new Price column.\n"
],
"metadata": {
"id": "nIgLjNMCZ-6Y"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new column Price\n",
"pumpkins <- pumpkins %>% \n",
" mutate(Price = (`Low Price` + `High Price`)/2)\n",
"\n",
"# View the first few rows of the data\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "Zo0BsqqtaJw2"
}
},
{
"cell_type": "markdown",
"source": [
"Yesss!💪\n",
"\n",
"\"But wait!\", you go talk after you don look through all di data set wit `View(pumpkins)`, \"Somtin no clear here!\"🤔\n",
"\n",
"If you check di `Package` column, you go see say dem dey sell pumpkins for plenty different ways. Some dem dey sell am for `1 1/9 bushel`, some na for `1/2 bushel`, some na per pumpkin, some na per pound, and some dey inside big box wey get different size.\n",
"\n",
"Make we confirm am:\n"
],
"metadata": {
"id": "p77WZr-9aQAR"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Verify the distinct observations in Package column\n",
"pumpkins %>% \n",
" distinct(Package)"
],
"outputs": [],
"metadata": {
"id": "XISGfh0IaUy6"
}
},
{
"cell_type": "markdown",
"source": [
"Amazing!👏\n",
"\n",
"Pumpkin dem dey hard to weigh well well, so make we filter dem by choosing only pumpkin wey get di word *bushel* for di `Package` column and put am inside new data frame `new_pumpkins`.\n"
],
"metadata": {
"id": "7sMjiVujaZxY"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::filter() and stringr::str_detect()\n",
"\n",
"[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html): epp you fit commot only **rows** wey match wetin you dey find, for dis case na pumpkins wey get *bushel* for `Package` column.\n",
"\n",
"[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html): e dey check if pattern dey or no dey inside string.\n",
"\n",
"The [`stringr`](https://github.com/tidyverse/stringr) package dey provide easy functions wey you fit use do common string wahala.\n"
],
"metadata": {
"id": "L8Qfcs92ageF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Retain only pumpkins with \"bushel\"\n",
"new_pumpkins <- pumpkins %>% \n",
" filter(str_detect(Package, \"bushel\"))\n",
"\n",
"# Get the dimensions of the new data\n",
"dim(new_pumpkins)\n",
"\n",
"# View a few rows of the new data\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"outputs": [],
"metadata": {
"id": "hy_SGYREampd"
}
},
{
"cell_type": "markdown",
"source": [
"You fit see say we don reduce am reach like 415 rows of data wey get pumpkins by di bushel.🤩 \n",
"<br>\n"
],
"metadata": {
"id": "VrDwF031avlR"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::case_when()\n",
"\n",
"**Wait o! One more tin dey wey we gatz do**\n",
"\n",
"You don notice say di bushel amount dey change for each row? You gatz make di pricing normal so e go show di price per bushel, no be per 1 1/9 or 1/2 bushel. Time don reach to do small maths to make am standard.\n",
"\n",
"We go use di function [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to *mutate* di Price column based on some conditions. `case_when` dey allow you fit vectorise plenty `if_else()` statements.\n"
],
"metadata": {
"id": "mLpw2jH4a0tx"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Convert the price if the Package contains fractional bushel values\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(Price = case_when(\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
" TRUE ~ Price))\n",
"\n",
"# View the first few rows of the data\n",
"new_pumpkins %>% \n",
" slice_head(n = 30)"
],
"outputs": [],
"metadata": {
"id": "P68kLVQmbM6I"
}
},
{
"cell_type": "markdown",
"source": [
"Now, we fit check di price per unit based on di bushel measurement. All dis study wey we dey do for bushels of pumpkins, e just show how `important` e be to `understand di nature of your data`!\n",
"\n",
"> ✅ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), di weight of bushel dey depend on di type of produce, because na volume measurement. \"One bushel of tomatoes, for example, suppose weigh 56 pounds... Leaves and greens dey take plenty space but dem no heavy, so one bushel of spinach na only 20 pounds.\" E dey somehow complicated! Make we no stress to dey convert bushel to pound, instead make we just price am by di bushel. All dis study wey we dey do for bushels of pumpkins, e just show how very important e be to understand di nature of your data!\n",
">\n",
"> ✅ You notice say pumpkins wey dem dey sell by half-bushel dey very expensive? You fit figure out why? Hint: small pumpkins dey cost pass big ones, probably because plenty dey inside one bushel, as di space wey one big hollow pie pumpkin dey take dey plenty.\n"
],
"metadata": {
"id": "pS2GNPagbSdb"
}
},
{
"cell_type": "markdown",
"source": [
"Now finally, for adventure sake 💁‍♀️, make we move the Month column go first position, wey mean say e go dey `before` column `Package`.\n",
"\n",
"`dplyr::relocate()` na wetin dem dey use change column position.\n"
],
"metadata": {
"id": "qql1SowfbdnP"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new data frame new_pumpkins\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(Month, .before = Package)\n",
"\n",
"new_pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "JJ1x6kw8bixF"
}
},
{
"cell_type": "markdown",
"source": [
"Good work!👌 You don get clean, well-arrange dataset wey you fit use build your new regression model! \n",
"<br>\n"
],
"metadata": {
"id": "y8TJ0Za_bn5Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 4. Data visualization wit ggplot2\n",
"\n",
"<p >\n",
" <img src=\"../../../../../../translated_images/data-visualization.54e56dded7c1a804.pcm.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Infographic by Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"<!--![Infographic by Dasani Madipalli](../../../../../../translated_images/data-visualization.54e56dded7c1a804.pcm.png){width=\"600\"}-->\n",
"\n",
"Dem get one *wise* saying wey talk like dis:\n",
"\n",
"> \"Di simple graph don show data analyst mind more information pass any other device.\" --- John Tukey\n",
"\n",
"Part of wetin data scientist dey do na to show di quality and di kind data wey dem dey work wit. To do dis one, dem dey create better visualizations, or plots, graphs, and charts, wey go show different parts of di data. Wit dis one, dem fit use eye see di relationships and gaps wey no dey easy to find.\n",
"\n",
"Visualizations fit also help decide di machine learning method wey go work well for di data. For example, scatterplot wey dey follow line fit show say di data go work well for linear regression exercise.\n",
"\n",
"R get plenty systems to make graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) na one of di most fine and flexible ones. `ggplot2` dey allow you build graphs by **combining independent components**.\n",
"\n",
"Make we start wit simple scatter plot for di Price and Month columns.\n",
"\n",
"So for dis case, we go start wit [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), provide dataset and aesthetic mapping (wit [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\n"
],
"metadata": {
"id": "mYSH6-EtbvNa"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Set a theme for the plots\n",
"theme_set(theme_light())\n",
"\n",
"# Create a scatter plot\n",
"p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\n",
"p + geom_point()"
],
"outputs": [],
"metadata": {
"id": "g2YjnGeOcLo4"
}
},
{
"cell_type": "markdown",
"source": [
"Dis plot no too useful sha 🤷. E no dey do anytin special, e just dey show how your data scatter as points for one particular month. \n",
"<br>\n"
],
"metadata": {
"id": "Ml7SDCLQcPvE"
}
},
{
"cell_type": "markdown",
"source": [
"### **How we go make am useful?**\n",
"\n",
"To make charts show data wey dey useful, you go need group the data somehow. For example, for our case, if we fit find the average price of pumpkins for each month, e go give us better understanding of the patterns wey dey inside our data. This one go lead us to another **dplyr** waka:\n",
"\n",
"#### `dplyr::group_by() %>% summarize()`\n",
"\n",
"For R, e easy to do grouped aggregation if you use\n",
"\n",
"`dplyr::group_by() %>% summarize()`\n",
"\n",
"- `dplyr::group_by()` go change the way we dey analyze the data from the whole dataset to small-small groups like per month.\n",
"\n",
"- `dplyr::summarize()` go create new data frame wey go get one column for each grouping variable and another column for the summary statistics wey you don specify.\n",
"\n",
"For example, we fit use `dplyr::group_by() %>% summarize()` to group the pumpkins based on the **Month** column, then calculate the **mean price** for each month.\n"
],
"metadata": {
"id": "jMakvJZIcVkh"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price))"
],
"outputs": [],
"metadata": {
"id": "6kVSUa2Bcilf"
}
},
{
"cell_type": "markdown",
"source": [
"Short and sharp!✨\n",
"\n",
"Categorical features like months dey better to show wit bar plot 📊. Di layers wey dey for bar charts na `geom_bar()` and `geom_col()`. Check `?geom_bar` to sabi more.\n",
"\n",
"Make we run one sharp sharp!\n"
],
"metadata": {
"id": "Kds48GUBcj3W"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month then plot a bar chart\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price)) %>% \r\n",
" ggplot(aes(x = Month, y = mean_price)) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"Pumpkin Price\")"
],
"outputs": [],
"metadata": {
"id": "VNbU1S3BcrxO"
}
},
{
"cell_type": "markdown",
"source": [
"🤩🤩Dis one na beta way to show data! E dey look like say di highest price for pumpkins dey happen for September and October. E match wetin you dey expect? Why or why e no match?\n",
"\n",
"Congrats say you don finish di second lesson 👏! You don arrange your data for model building, plus you don find more tori from di visualizations!\n"
],
"metadata": {
"id": "zDm0VOzzcuzR"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n\n<!-- CO-OP TRANSLATOR DISCLAIMER START -->\n**Disclaimer**: \nDis dokyument don use AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translation. Even as we dey try make am correct, abeg sabi say machine translation fit get mistake or no dey accurate well. Di original dokyument for im native language na di main source wey you go trust. For important information, e better make professional human translator check am. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.\n<!-- CO-OP TRANSLATOR DISCLAIMER END -->\n"
]
}
]
}