# Build a regression model: prepare and visualize data

## **Linear Regression for Pumpkins - Lesson 2**
#### Introduction

Now that you have the tools needed to start building machine learning models using Tidymodels and the Tidyverse, you're ready to begin asking questions about your data. When working with data and applying ML solutions, it's crucial to know how to ask the right questions to fully unlock the potential of your dataset.

In this lesson, you will learn:

-   How to prepare your data for building models.

-   How to use `ggplot2` for visualizing data.

The type of question you want answered will determine which ML algorithms you use. Additionally, the quality of the answer you receive will largely depend on the characteristics of your data.

Let's explore this through a practical exercise.

<p >
   <img src="../../images/unruly_data.jpg"
   width="700"/>
   <figcaption>Artwork by @allison_horst</figcaption>


<!--![Artwork by \@allison_horst](../../../../../../translated_images/unruly_data.0eedc7ced92d2d919cf5ea197bfe0fe9a30780c4bf7cdcf14ff4e9dc5a4c7267.en.jpg)<br>Artwork by \@allison_horst-->


## 1. Importing pumpkins data and loading the Tidyverse

We'll need the following packages to work through this lesson:

-   `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [set of R packages](https://www.tidyverse.org/packages) created to make data science quicker, simpler, and more enjoyable!

You can install them using:

`install.packages(c("tidyverse"))`

The script below verifies if you already have the necessary packages for this module and installs any missing ones for you.


In [None]:
suppressWarnings(if(!require("pacman")) install.packages("pacman"))
pacman::p_load(tidyverse)

Now, let's fire up some packages and load the [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) provided for this lesson!


In [None]:
# Load the core Tidyverse packages
library(tidyverse)

# Import the pumpkins data
pumpkins <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv")


# Get a glimpse and dimensions of the data
glimpse(pumpkins)


# Print the first 50 rows of the data set
pumpkins %>% 
  slice_head(n =50)

A quick `glimpse()` immediately reveals that there are missing values and a mix of strings (`chr`) and numeric data (`dbl`). The `Date` column is stored as a character type, and there's an unusual column called `Package` where the data includes a mix of `sacks`, `bins`, and other values. In short, the dataset is a bit messy üò§.

It's actually quite rare to receive a dataset that's perfectly ready to use for building a machine learning model right away. But don't worry‚Äîthis lesson will teach you how to clean and prepare a raw dataset using standard R libraries üßë‚Äçüîß. You'll also learn different techniques to visualize the data. üìàüìä
<br>

> Quick reminder: The pipe operator (`%>%`) allows you to perform operations in a logical sequence by passing an object forward into a function or expression. You can think of the pipe operator as saying "and then" in your code.


## 2. Check for missing data

One of the most common challenges data scientists face is handling incomplete or missing data. R uses a special sentinel value, `NA` (Not Available), to represent missing or unknown values.

How can we determine if the data frame contains missing values?
<br>
-   A simple approach is to use the base R function `anyNA`, which returns the logical values `TRUE` or `FALSE`.


In [None]:
pumpkins %>% 
  anyNA()

Great, there seems to be some missing data! That's a good place to start.

-   Another approach would be to use the function `is.na()` which identifies the missing elements in each column with a logical `TRUE`.


In [None]:
pumpkins %>% 
  is.na() %>% 
  head(n = 7)

Okay, got the job done, but with a large data frame like this, reviewing all the rows and columns individually would be inefficient and practically impossible üò¥.

-   A more practical approach would be to calculate the total number of missing values for each column:


In [None]:
pumpkins %>% 
  is.na() %>% 
  colSums()

Much better! There is missing data, but maybe it won't matter for the task at hand. Let's see what further analysis brings forth.

> In addition to its impressive collection of packages and functions, R also offers excellent documentation. For example, you can use `help(colSums)` or `?colSums` to learn more about the function.


## 3. Dplyr: A Grammar of Data Manipulation

<p>
   <img src="../../images/dplyr_wrangling.png"
   width="569"/>
   <figcaption>Illustration by @allison_horst</figcaption>


<!--![Illustration by \@allison_horst](../../../../../../translated_images/dplyr_wrangling.f5f99c64fd4580f1377fee3ea428b6f8fd073845ec0f8409d483cfe148f0984e.en.png)<br/>Illustration by \@allison_horst-->


[`dplyr`](https://dplyr.tidyverse.org/), a package in the Tidyverse, is a framework for data manipulation that offers a consistent set of functions to address the most common challenges in handling data. In this section, we will dive into some of the key functions provided by dplyr!


#### dplyr::select()

`select()` is a function in the `dplyr` package that allows you to choose which columns to keep or remove.

To simplify working with your data frame, you can use `select()` to drop multiple columns and retain only the ones you need.

For example, in this exercise, our analysis will focus on the columns `Package`, `Low Price`, `High Price`, and `Date`. Let's select these columns.


In [None]:
# Select desired columns
pumpkins <- pumpkins %>% 
  select(Package, `Low Price`, `High Price`, Date)


# Print data set
pumpkins %>% 
  slice_head(n = 5)

#### dplyr::mutate()

`mutate()` is a function in the `dplyr` package that allows you to create or modify columns while keeping the existing ones intact.

The general structure of `mutate` is:

`data %>%   mutate(new_column_name = what_it_contains)`

Let's explore `mutate` using the `Date` column by performing the following operations:

1. Convert the dates (currently stored as character type) into a month format (these are US dates, so the format is `MM/DD/YYYY`).

2. Extract the month from the dates into a new column.

In R, the [lubridate](https://lubridate.tidyverse.org/) package simplifies working with Date-time data. So, we'll use `dplyr::mutate()`, `lubridate::mdy()`, and `lubridate::month()` to accomplish the above tasks. We can drop the `Date` column since it won't be needed for further operations.


In [None]:
# Load lubridate
library(lubridate)

pumpkins <- pumpkins %>% 
  # Convert the Date column to a date object
  mutate(Date = mdy(Date)) %>% 
  # Extract month from Date
  mutate(Month = month(Date)) %>% 
  # Drop Date column
  select(-Date)

# View the first few rows
pumpkins %>% 
  slice_head(n = 7)

Woohoo! ü§©

Next, let's create a new column `Price`, which represents the average price of a pumpkin. Now, calculate the average of the `Low Price` and `High Price` columns to fill the new Price column.


In [None]:
# Create a new column Price
pumpkins <- pumpkins %>% 
  mutate(Price = (`Low Price` + `High Price`)/2)

# View the first few rows of the data
pumpkins %>% 
  slice_head(n = 5)

Yeees!üí™

"But wait!", you'll say after quickly browsing through the entire dataset with `View(pumpkins)`, "Something seems off here!"ü§î

If you examine the `Package` column, you'll notice that pumpkins are sold in various formats. Some are sold in `1 1/9 bushel` units, others in `1/2 bushel` units, some by the pumpkin, some by the pound, and some in large boxes of different sizes.

Let's check this:


In [None]:
# Verify the distinct observations in Package column
pumpkins %>% 
  distinct(Package)

Amazing!üëè

Pumpkins appear to be quite difficult to weigh consistently, so let's narrow them down by selecting only pumpkins that contain the word *bushel* in the `Package` column and store this in a new data frame `new_pumpkins`.


#### dplyr::filter() and stringr::str_detect()

[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html): creates a subset of the data that includes only the **rows** meeting your specified conditions. In this case, it filters for pumpkins with the string *bushel* in the `Package` column.

[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html): checks whether a specific pattern exists in a string.

The [`stringr`](https://github.com/tidyverse/stringr) package offers straightforward functions for common string manipulation tasks.


In [None]:
# Retain only pumpkins with "bushel"
new_pumpkins <- pumpkins %>% 
       filter(str_detect(Package, "bushel"))

# Get the dimensions of the new data
dim(new_pumpkins)

# View a few rows of the new data
new_pumpkins %>% 
  slice_head(n = 5)

You can see that we have approximately 415 rows of data focused on pumpkins by the bushel. ü§©


#### dplyr::case_when()

**But wait! There's one more thing to do**

Did you notice that the bushel amount changes for each row? You need to adjust the pricing so that it reflects the cost per bushel, rather than per 1 1/9 or 1/2 bushel. Time to do some calculations to standardize it.

We'll use the function [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to *modify* the Price column based on certain conditions. `case_when` lets you apply multiple `if_else()` statements in a vectorized way.


In [None]:
# Convert the price if the Package contains fractional bushel values
new_pumpkins <- new_pumpkins %>% 
  mutate(Price = case_when(
    str_detect(Package, "1 1/9") ~ Price/(1 + 1/9),
    str_detect(Package, "1/2") ~ Price/(1/2),
    TRUE ~ Price))

# View the first few rows of the data
new_pumpkins %>% 
  slice_head(n = 30)

Now, we can analyze the pricing per unit based on their bushel measurement. All this study of bushels of pumpkins, however, highlights how crucial it is to truly understand your data!

> ‚úÖ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), the weight of a bushel depends on the type of produce, as it's a measurement of volume. "For instance, a bushel of tomatoes is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds." It's all pretty complex! Instead of converting bushels to pounds, let's stick to pricing by the bushel. This whole exploration of pumpkin bushels, however, underscores how essential it is to understand the nature of your data!
>
> ‚úÖ Did you notice that pumpkins sold by the half-bushel are quite expensive? Can you figure out why? Hint: smaller pumpkins tend to be much pricier than larger ones, likely because there are far more of them per bushel, given the extra space taken up by one large hollow pie pumpkin.


Now finally, just for fun üíÅ‚Äç‚ôÄÔ∏è, let's move the Month column to the first position, i.e., before the Package column.

You can use `dplyr::relocate()` to adjust column positions.


In [None]:
# Create a new data frame new_pumpkins
new_pumpkins <- new_pumpkins %>% 
  relocate(Month, .before = Package)

new_pumpkins %>% 
  slice_head(n = 7)

Good job!üëå You now have a clean, organized dataset ready to build your new regression model!


## 4. Data visualization with ggplot2

<p >
   <img src="../../images/data-visualization.png"
   width="600"/>
   <figcaption>Infographic by Dasani Madipalli</figcaption>


<!--![Infographic by Dasani Madipalli](../../../../../../translated_images/data-visualization.54e56dded7c1a804d00d027543f2881cb32da73aeadda2d4a4f10f3497526114.en.png){width="600"}-->

There‚Äôs a *wise* saying that goes like this:

> "The simple graph has brought more information to the data analyst's mind than any other device." --- John Tukey

A key part of a data scientist's job is to showcase the quality and characteristics of the data they are working with. To achieve this, they often create engaging visualizations‚Äîsuch as plots, graphs, and charts‚Äîthat highlight various aspects of the data. These visualizations help uncover relationships and gaps that might otherwise remain hidden.

Visualizations can also guide the selection of the most suitable machine learning technique for the data. For instance, a scatterplot that appears to follow a linear pattern suggests that the data might be well-suited for a linear regression model.

R provides several systems for creating graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) stands out as one of the most elegant and versatile options. `ggplot2` enables you to build graphs by **combining independent components**.

Let‚Äôs begin with a simple scatterplot for the Price and Month columns.

In this case, we‚Äôll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), provide a dataset and aesthetic mapping (using [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)), and then add layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) to create scatterplots.


In [None]:
# Set a theme for the plots
theme_set(theme_light())

# Create a scatter plot
p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))
p + geom_point()

Is this a useful plot ü§∑? Does anything about it surprise you?

It's not particularly useful since all it does is show your data as a scatter of points for a specific month.
<br>


### **How do we make it useful?**

To display meaningful data in charts, you often need to organize the data in some way. For example, in our case, calculating the average price of pumpkins for each month would reveal more insights into the patterns within our data. This brings us to another quick look at **dplyr**:

#### `dplyr::group_by() %>% summarize()`

Grouped aggregation in R can be easily performed using

`dplyr::group_by() %>% summarize()`

-   `dplyr::group_by()` shifts the focus of analysis from the entire dataset to specific groups, such as by month.

-   `dplyr::summarize()` generates a new data frame with one column for each grouping variable and one column for each summary statistic you specify.

For instance, we can use `dplyr::group_by() %>% summarize()` to group the pumpkins based on the **Month** column and then calculate the **average price** for each month.


In [None]:
# Find the average price of pumpkins per month
new_pumpkins %>%
  group_by(Month) %>% 
  summarise(mean_price = mean(Price))

Succinct!‚ú®

Categorical features like months are best visualized with a bar plot üìä. The layers used for creating bar charts are `geom_bar()` and `geom_col()`. Check `?geom_bar` for more details.

Let‚Äôs create one!


In [None]:
# Find the average price of pumpkins per month then plot a bar chart
new_pumpkins %>%
  group_by(Month) %>% 
  summarise(mean_price = mean(Price)) %>% 
  ggplot(aes(x = Month, y = mean_price)) +
  geom_col(fill = "midnightblue", alpha = 0.7) +
  ylab("Pumpkin Price")

ü§©ü§© This is a much more useful data visualization! It appears to show that pumpkin prices peak in September and October. Does that align with your expectations? Why or why not?

Well done on completing the second lesson üëè! You prepared your data for building a model and discovered additional insights through visualizations!



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
