{ "nbformat": 4, "nbformat_minor": 2, "metadata": { "colab": { "name": "lesson_2-R.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "ir", "display_name": "R" }, "language_info": { "name": "R" }, "coopTranslator": { "original_hash": "f3c335f9940cfd76528b3ef918b9b342", "translation_date": "2025-11-18T19:21:08+00:00", "source_file": "2-Regression/2-Data/solution/R/lesson_2-R.ipynb", "language_code": "pcm" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Build Regression Model: Prepare and Visualize Data\n", "\n", "## **Linear Regression for Pumpkins - Lesson 2**\n", "#### Introduction\n", "\n", "Now wey you don get di tools wey you need to start dey build machine learning model wit Tidymodels and di Tidyverse, you fit begin dey ask beta questions about your data. As you dey work wit data and dey apply ML solutions, e dey very important make you sabi how to ask di correct question so you fit unlock di full potential of your dataset.\n", "\n", "For dis lesson, you go learn:\n", "\n", "- How to prepare your data for model-building.\n", "\n", "- How to use `ggplot2` for data visualization.\n", "\n", "Di kain question wey you wan answer go determine di type of ML algorithm wey you go use. And di quality of di answer wey you go get go depend well well on how your data be.\n", "\n", "Make we see how e go be by working through one practical exercise.\n", "\n", "
\n",
"
\n",
"
\n",
"\n",
"> Small reminder: The pipe operator (`%>%`) dey do operation step by step by passing one object go front into one function or call expression. You fit think of pipe operator like say e dey talk \"and then\" for your code.\n"
],
"metadata": {
"id": "REWcIv9yX29v"
}
},
{
"cell_type": "markdown",
"source": [
"## 2. Check for missing data\n",
"\n",
"One common wahala wey data scientists dey face na incomplete or missing data. R dey use special sentinel value `NA` (Not Available) to represent missing or unknown values.\n",
"\n",
"So, how we go take sabi say the data frame get missing values? \n",
"
\n",
"- One simple way na to use the base R function `anyNA` wey dey return logical objects `TRUE` or `FALSE`.\n"
],
"metadata": {
"id": "Zxfb3AM5YbUe"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" anyNA()"
],
"outputs": [],
"metadata": {
"id": "G--DQutAYltj"
}
},
{
"cell_type": "markdown",
"source": [
"Nice one, e be like say some data dey miss! Na good place to start be dat.\n",
"\n",
"- Another way fit be to use di function `is.na()` wey go show which column elements dey miss wit logical `TRUE`.\n"
],
"metadata": {
"id": "mU-7-SB6YokF"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "W-DxDOR4YxSW"
}
},
{
"cell_type": "markdown",
"source": [
"Okay, work don finish but if data frame big like this, e no go make sense or even possible to check all the rows and columns one by oneπ΄.\n",
"\n",
"- Better way na to calculate the total of the missing values for each column:\n"
],
"metadata": {
"id": "xUWxipKYY0o7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" colSums()"
],
"outputs": [],
"metadata": {
"id": "ZRBWV6P9ZArL"
}
},
{
"cell_type": "markdown",
"source": [
"Beta pass! E get some data wey miss sha, but e fit no too matter for wetin we wan do. Make we see wetin further analysis go show.\n",
"\n",
"> Wit di plenty better packages and functions wey R get, e still get correct documentation. For example, you fit use `help(colSums)` or `?colSums` to sabi more about di function.\n"
],
"metadata": {
"id": "9gv-crB6ZD1Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Dplyr: Grammar wey dey for Data Manipulation\n",
"\n",
"
\n",
"
\n",
"
\n"
],
"metadata": {
"id": "VrDwF031avlR"
}
},
{
"cell_type": "markdown",
"source": [
"#### dplyr::case_when()\n",
"\n",
"**Wait o! One more tin dey wey we gatz do**\n",
"\n",
"You don notice say di bushel amount dey change for each row? You gatz make di pricing normal so e go show di price per bushel, no be per 1 1/9 or 1/2 bushel. Time don reach to do small maths to make am standard.\n",
"\n",
"We go use di function [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to *mutate* di Price column based on some conditions. `case_when` dey allow you fit vectorise plenty `if_else()` statements.\n"
],
"metadata": {
"id": "mLpw2jH4a0tx"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Convert the price if the Package contains fractional bushel values\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(Price = case_when(\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
" TRUE ~ Price))\n",
"\n",
"# View the first few rows of the data\n",
"new_pumpkins %>% \n",
" slice_head(n = 30)"
],
"outputs": [],
"metadata": {
"id": "P68kLVQmbM6I"
}
},
{
"cell_type": "markdown",
"source": [
"Now, we fit check di price per unit based on di bushel measurement. All dis study wey we dey do for bushels of pumpkins, e just show how `important` e be to `understand di nature of your data`!\n",
"\n",
"> β
According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), di weight of bushel dey depend on di type of produce, because na volume measurement. \"One bushel of tomatoes, for example, suppose weigh 56 pounds... Leaves and greens dey take plenty space but dem no heavy, so one bushel of spinach na only 20 pounds.\" E dey somehow complicated! Make we no stress to dey convert bushel to pound, instead make we just price am by di bushel. All dis study wey we dey do for bushels of pumpkins, e just show how very important e be to understand di nature of your data!\n",
">\n",
"> β
You notice say pumpkins wey dem dey sell by half-bushel dey very expensive? You fit figure out why? Hint: small pumpkins dey cost pass big ones, probably because plenty dey inside one bushel, as di space wey one big hollow pie pumpkin dey take dey plenty.\n"
],
"metadata": {
"id": "pS2GNPagbSdb"
}
},
{
"cell_type": "markdown",
"source": [
"Now finally, for adventure sake πββοΈ, make we move the Month column go first position, wey mean say e go dey `before` column `Package`.\n",
"\n",
"`dplyr::relocate()` na wetin dem dey use change column position.\n"
],
"metadata": {
"id": "qql1SowfbdnP"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Create a new data frame new_pumpkins\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(Month, .before = Package)\n",
"\n",
"new_pumpkins %>% \n",
" slice_head(n = 7)"
],
"outputs": [],
"metadata": {
"id": "JJ1x6kw8bixF"
}
},
{
"cell_type": "markdown",
"source": [
"Good work!π You don get clean, well-arrange dataset wey you fit use build your new regression model! \n",
"
\n"
],
"metadata": {
"id": "y8TJ0Za_bn5Y"
}
},
{
"cell_type": "markdown",
"source": [
"## 4. Data visualization wit ggplot2\n",
"\n",
"
\n",
"
\n",
"
\n"
],
"metadata": {
"id": "Ml7SDCLQcPvE"
}
},
{
"cell_type": "markdown",
"source": [
"### **How we go make am useful?**\n",
"\n",
"To make charts show data wey dey useful, you go need group the data somehow. For example, for our case, if we fit find the average price of pumpkins for each month, e go give us better understanding of the patterns wey dey inside our data. This one go lead us to another **dplyr** waka:\n",
"\n",
"#### `dplyr::group_by() %>% summarize()`\n",
"\n",
"For R, e easy to do grouped aggregation if you use\n",
"\n",
"`dplyr::group_by() %>% summarize()`\n",
"\n",
"- `dplyr::group_by()` go change the way we dey analyze the data from the whole dataset to small-small groups like per month.\n",
"\n",
"- `dplyr::summarize()` go create new data frame wey go get one column for each grouping variable and another column for the summary statistics wey you don specify.\n",
"\n",
"For example, we fit use `dplyr::group_by() %>% summarize()` to group the pumpkins based on the **Month** column, then calculate the **mean price** for each month.\n"
],
"metadata": {
"id": "jMakvJZIcVkh"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price))"
],
"outputs": [],
"metadata": {
"id": "6kVSUa2Bcilf"
}
},
{
"cell_type": "markdown",
"source": [
"Short and sharp!β¨\n",
"\n",
"Categorical features like months dey better to show wit bar plot π. Di layers wey dey for bar charts na `geom_bar()` and `geom_col()`. Check `?geom_bar` to sabi more.\n",
"\n",
"Make we run one sharp sharp!\n"
],
"metadata": {
"id": "Kds48GUBcj3W"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Find the average price of pumpkins per month then plot a bar chart\r\n",
"new_pumpkins %>%\r\n",
" group_by(Month) %>% \r\n",
" summarise(mean_price = mean(Price)) %>% \r\n",
" ggplot(aes(x = Month, y = mean_price)) +\r\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
" ylab(\"Pumpkin Price\")"
],
"outputs": [],
"metadata": {
"id": "VNbU1S3BcrxO"
}
},
{
"cell_type": "markdown",
"source": [
"π€©π€©Dis one na beta way to show data! E dey look like say di highest price for pumpkins dey happen for September and October. E match wetin you dey expect? Why or why e no match?\n",
"\n",
"Congrats say you don finish di second lesson π! You don arrange your data for model building, plus you don find more tori from di visualizations!\n"
],
"metadata": {
"id": "zDm0VOzzcuzR"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n\n\n**Disclaimer**: \nDis dokyument don use AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translation. Even as we dey try make am correct, abeg sabi say machine translation fit get mistake or no dey accurate well. Di original dokyument for im native language na di main source wey you go trust. For important information, e better make professional human translator check am. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.\n\n"
]
}
]
}