ML-For-Beginners/2-Regression/2-Data/README.md

# Build a regression model using Scikit-learn: prepare and visualize data

> ![Data visualization infographic](./images/data-visualization.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)

## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/11/)

## Introduction

Now that you are set up with the tools you need to start tackling machine learning model building with Scikit-learn, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.

In this lesson, you will learn:

- How to prepare your data for model-building.
- How to use Matplotlib for data visualization.

## Asking the right question of your data

The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.

Take a look at the [data](../data/US-pumpkins.csv) provided for this lesson. You can open this .csv file in VS Code. A quick skim immediately shows that there are blanks and a mix of strings and numeric data. There's also a strange column called 'Package' where the data is a mix between 'sacks', 'bins' and other values. The data, in fact, is a bit of a mess.

In fact, it is not very common to be gifted a dataset that is completely ready to use to create a ML model out of the box. In this lesson, you will learn how to prepare a raw dataset using standard Python libraries. You will also learn various techniques to visualize the data.

## Case study: 'the pumpkin market'

In this folder you will find a .csv file in the root `data` folder called [US-pumpkins.csv](../data/US-pumpkins.csv) which includes 1757 lines of data about the market for pumpkins, sorted into groupings by city. This is raw data extracted from the [Specialty Crops Terminal Markets Standard Reports](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) distributed by the United States Department of Agriculture.

### Preparing data

This data is in the public domain. It can be downloaded in many separate files, per city, from the USDA web site. To avoid too many separate files, we have concatenated all the city data into one spreadsheet, thus we have already _prepared_ the data a bit. Next, let's take a closer look at the data.

### The pumpkin data - early conclusions

What do you notice about this data? You already saw that there is a mix of strings, numbers, blanks and strange values that you need to make sense of.

What question can you ask of this data, using a Regression technique? What about "Predict the price of a pumpkin for sale during a given month". Looking again at the data, there are some changes you need to make to create the data structure necessary for the task. 
## Exercise - analyze the pumpkin data

Let's use [Pandas](https://pandas.pydata.org/), (the name stands for `Python Data Analysis`) a tool very useful for shaping data, to analyze and prepare this pumpkin data.

### First, check for missing dates

You will first need to take steps to check for missing dates:

1. Convert the dates to a month format (these are US dates, so the format is `MM/DD/YYYY`).
2. Extract the month to a new column.

Open the _notebook.ipynb_ file in Visual Studio Code and import the spreadsheet in to a new Pandas dataframe.

1. Use the `head()` function to view the first five rows.

    ```python
    import pandas as pd
    pumpkins = pd.read_csv('../../data/US-pumpkins.csv')
    pumpkins.head()
    ```

    ✅ What function would you use to view the last five rows?

1. Check if there is missing data in the current dataframe:

    ```python
    pumpkins.isnull().sum()
    ```

    There is missing data, but maybe it won't matter for the task at hand.

1. To make your dataframe easier to work with, drop several of its columns, using `drop()`, keeping only the columns you need: 

    ```python
    new_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']
    pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
    ```

### Second, determine average price of pumpkin

Think about how to determine the average price of a pumpkin in a given month. What columns would you pick for this task? Hint: you'll need 3 columns.

Solution: take the average of the `Low Price` and `High Price` columns to populate the new Price column, and convert the Date column to only show the month. Fortunately, according to the check above, there is no missing data for dates or prices.

1. To calculate the average, add the following code:

    ```python
    price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2
    
    month = pd.DatetimeIndex(pumpkins['Date']).month
    
    ```

   ✅ Feel free to print any data you'd like to check using `print(month)`.

2. Now, copy your converted data into a fresh Pandas dataframe:

    ```python
    new_pumpkins = pd.DataFrame({'Month': month, 'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'],'High Price': pumpkins['High Price'], 'Price': price})
    ```

    Printing out your dataframe will show you a clean, tidy dataset on which you can build your new regression model.

### But wait! There's something odd here

If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in '1 1/9 bushel' measures, and some in '1/2 bushel' measures, some per pumpkin, some per pound, and some in big boxes with varying widths.

> Pumpkins seem very hard to weigh consistently

Digging into the original data, it's interesting that anything with `Unit of Sale` equalling 'EACH' or 'PER BIN' also have the `Package` type per inch, per bin, or 'each'. Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string 'bushel' in their `Package` column.

1. Add a filter at the top of the file, under the initial .csv import:

    ```python
    pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)]
    ```

    If you print the data now, you can see that you are only getting the 415 or so rows of data containing pumpkins by the bushel.

### But wait! There's one more thing to do

Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, so do some math to standardize it.

1. Add these lines after the block creating the new_pumpkins dataframe:

    ```python
    new_pumpkins.loc[new_pumpkins['Package'].str.contains('1 1/9'), 'Price'] = price/(1 + 1/9)
    
    new_pumpkins.loc[new_pumpkins['Package'].str.contains('1/2'), 'Price'] = price/(1/2)
    ```

✅ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), a bushel's weight depends on the type of produce, as it's a volume measurement. "A bushel of tomatoes, for example, is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds." It's all pretty complicated! Let's not bother with making a bushel-to-pound conversion, and instead price by the bushel. All this study of bushels of pumpkins, however, goes to show how very important it is to understand the nature of your data!

Now, you can analyze the pricing per unit based on their bushel measurement. If you print out the data one more time, you can see how it's standardized.

✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.

## Visualization Strategies

Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover. 

Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.

One data visualization libary that works well in Jupyter notebooks is [Matplotlib](https://matplotlib.org/) (which you also saw in the previous lesson).

> Get more experience with data visualization in [these tutorials](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-15963-cxa).

## Exercise - experiment with Matplotlib

Try to create some basic plots to display the new dataframe you just created. What would a basic line plot show?

1. Import Matplotlib at the top of the file, under the Pandas import:

    ```python
    import matplotlib.pyplot as plt
    ```

1. Rerun the entire notebook to refresh.
1. At the bottom of the notebook, add a cell to plot the data as a box:

    ```python
    price = new_pumpkins.Price
    month = new_pumpkins.Month
    plt.scatter(price, month)
    plt.show()
    ```

    ![A scatterplot showing price to month relationship](./images/scatterplot.png)

    Is this a useful plot? Does anything about it surprise you?

    It's not particularly useful as all it does is display in your data as a spread of points in a given month.

### Make it useful

To get charts to display useful data, you usually need to group the data somehow. Let's try creating a plot where the y axis shows the months and the data demonstrates the distribution of data.

1. Add a cell to create a grouped bar chart:

    ```python
    new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
    plt.ylabel("Pumpkin Price")
    ```

    ![A bar chart showing price to month relationship](./images/barchart.png)

    This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?

---

## 🚀Challenge

Explore the different types of visualization that M Matplotlib offers. Which types are most appropriate for regression problems?

## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/12/)

## Review & Self Study

Take a look at the many ways to visualize data. Make a list of the various libraries available and note which are best for given types of tasks, for example 2D visualizations vs. 3D visualizations. What do you discover?

## Assignment

[Exploring visualization](assignment.md)
Scikit-learn spelling audit 3 years ago			`# Build a regression model using Scikit-learn: prepare and visualize data`
lessons 4 years ago
Scikit-learn spelling audit 3 years ago			`> ![Data visualization infographic](./images/data-visualization.png)`
Update README.md Adding the data visualization image 4 years ago			`> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)`

quiz renumbering, removing 5th NLP lesson, reordering Intro lessons 3 years ago			`## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/11/)`
lessons 4 years ago
editorial changes 3 years ago			`## Introduction`
lessons 4 years ago
Scikit-learn spelling audit 3 years ago			`Now that you are set up with the tools you need to start tackling machine learning model building with Scikit-learn, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.`
README standardization 4 years ago
quiz app 4 years ago			`In this lesson, you will learn:`
lessons 4 years ago
regression lesson audit 3 years ago			`- How to prepare your data for model-building.`
			`- How to use Matplotlib for data visualization.`
lessons 4 years ago
regression lesson audit 3 years ago			`## Asking the right question of your data`
editorial changes 3 years ago
regression lesson audit 3 years ago			`The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.`
editorial changes 3 years ago
regression lesson audit 3 years ago			`Take a look at the [data](../data/US-pumpkins.csv) provided for this lesson. You can open this .csv file in VS Code. A quick skim immediately shows that there are blanks and a mix of strings and numeric data. There's also a strange column called 'Package' where the data is a mix between 'sacks', 'bins' and other values. The data, in fact, is a bit of a mess.`
editorial changes 3 years ago
regression lesson audit 3 years ago			`In fact, it is not very common to be gifted a dataset that is completely ready to use to create a ML model out of the box. In this lesson, you will learn how to prepare a raw dataset using standard Python libraries. You will also learn various techniques to visualize the data.`
lesson 2 4 years ago
regression lesson audit 3 years ago			`## Case study: 'the pumpkin market'`
editorial changes 3 years ago
regression lesson audit 3 years ago			In this folder you will find a .csv file in the root `data` folder called [US-pumpkins.csv](../data/US-pumpkins.csv) which includes 1757 lines of data about the market for pumpkins, sorted into groupings by city. This is raw data extracted from the [Specialty Crops Terminal Markets Standard Reports](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) distributed by the United States Department of Agriculture.
lesson 2 4 years ago
editorial changes 3 years ago			`### Preparing data`

regression lesson audit 3 years ago			`This data is in the public domain. It can be downloaded in many separate files, per city, from the USDA web site. To avoid too many separate files, we have concatenated all the city data into one spreadsheet, thus we have already _prepared_ the data a bit. Next, let's take a closer look at the data.`
lesson 2 4 years ago
regression lesson audit 3 years ago			`### The pumpkin data - early conclusions`
lesson 2 4 years ago
regression lesson audit 3 years ago			`What do you notice about this data? You already saw that there is a mix of strings, numbers, blanks and strange values that you need to make sense of.`
editorial changes 3 years ago
regression lesson audit 3 years ago			`What question can you ask of this data, using a Regression technique? What about "Predict the price of a pumpkin for sale during a given month". Looking again at the data, there are some changes you need to make to create the data structure necessary for the task.`
			`## Exercise - analyze the pumpkin data`
editorial changes 3 years ago
			Let's use [Pandas](https://pandas.pydata.org/), (the name stands for `Python Data Analysis`) a tool very useful for shaping data, to analyze and prepare this pumpkin data.

			`### First, check for missing dates`

regression lesson audit 3 years ago			`You will first need to take steps to check for missing dates:`
editorial changes 3 years ago
regression lesson audit 3 years ago			1. Convert the dates to a month format (these are US dates, so the format is `MM/DD/YYYY`).
			`2. Extract the month to a new column.`
editorial changes 3 years ago
			`Open the _notebook.ipynb_ file in Visual Studio Code and import the spreadsheet in to a new Pandas dataframe.`

			1. Use the `head()` function to view the first five rows.
lessons 4 years ago
editorial changes 3 years ago			```python
			`import pandas as pd`
			`pumpkins = pd.read_csv('../../data/US-pumpkins.csv')`
			`pumpkins.head()`
			```
lessons 4 years ago
regression lesson audit 3 years ago			`✅ What function would you use to view the last five rows?`
lessons 4 years ago
editorial changes 3 years ago			`1. Check if there is missing data in the current dataframe:`
lesson 2 4 years ago
editorial changes 3 years ago			```python
			`pumpkins.isnull().sum()`
			```
lesson 2 4 years ago
editorial changes 3 years ago			`There is missing data, but maybe it won't matter for the task at hand.`
lesson 2 4 years ago
editorial changes 3 years ago			1. To make your dataframe easier to work with, drop several of its columns, using `drop()`, keeping only the columns you need:
lessons 4 years ago
editorial changes 3 years ago			```python
			`new_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']`
			`pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)`
			```
lessons 4 years ago
editorial changes 3 years ago			`### Second, determine average price of pumpkin`
lessons 4 years ago
editorial changes 3 years ago			`Think about how to determine the average price of a pumpkin in a given month. What columns would you pick for this task? Hint: you'll need 3 columns.`
lessons 4 years ago
editorial changes 3 years ago			Solution: take the average of the `Low Price` and `High Price` columns to populate the new Price column, and convert the Date column to only show the month. Fortunately, according to the check above, there is no missing data for dates or prices.
lessons 4 years ago
editorial changes 3 years ago			`1. To calculate the average, add the following code:`
lessons 4 years ago
editorial changes 3 years ago			```python
			`price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2`

			`month = pd.DatetimeIndex(pumpkins['Date']).month`

			```
lesson 2 4 years ago
regression lesson audit 3 years ago			✅ Feel free to print any data you'd like to check using `print(month)`.
lesson 2 4 years ago
regression lesson audit 3 years ago			`2. Now, copy your converted data into a fresh Pandas dataframe:`
lesson 2 4 years ago
editorial changes 3 years ago			```python
			`new_pumpkins = pd.DataFrame({'Month': month, 'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'],'High Price': pumpkins['High Price'], 'Price': price})`
			```
lesson 2 4 years ago
editorial changes 3 years ago			`Printing out your dataframe will show you a clean, tidy dataset on which you can build your new regression model.`
lesson 2 4 years ago
editorial changes 3 years ago			`### But wait! There's something odd here`
lesson 2 4 years ago
editorial changes 3 years ago			If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in '1 1/9 bushel' measures, and some in '1/2 bushel' measures, some per pumpkin, some per pound, and some in big boxes with varying widths.
lesson 2 4 years ago
editorial changes 3 years ago			`> Pumpkins seem very hard to weigh consistently`
lesson 2 4 years ago
regression lesson audit 3 years ago			Digging into the original data, it's interesting that anything with `Unit of Sale` equalling 'EACH' or 'PER BIN' also have the `Package` type per inch, per bin, or 'each'. Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string 'bushel' in their `Package` column.
lesson 2 4 years ago
editorial changes 3 years ago			`1. Add a filter at the top of the file, under the initial .csv import:`

			```python
			`pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)]`
			```

			`If you print the data now, you can see that you are only getting the 415 or so rows of data containing pumpkins by the bushel.`

regression lesson audit 3 years ago			`### But wait! There's one more thing to do`
editorial changes 3 years ago
			`Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, so do some math to standardize it.`

			`1. Add these lines after the block creating the new_pumpkins dataframe:`

			```python
			`new_pumpkins.loc[new_pumpkins['Package'].str.contains('1 1/9'), 'Price'] = price/(1 + 1/9)`

			`new_pumpkins.loc[new_pumpkins['Package'].str.contains('1/2'), 'Price'] = price/(1/2)`
			```
lesson 2 4 years ago
incorporating Rishit's review 4 years ago			✅ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), a bushel's weight depends on the type of produce, as it's a volume measurement. "A bushel of tomatoes, for example, is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds." It's all pretty complicated! Let's not bother with making a bushel-to-pound conversion, and instead price by the bushel. All this study of bushels of pumpkins, however, goes to show how very important it is to understand the nature of your data!
lesson 2 4 years ago
			`Now, you can analyze the pricing per unit based on their bushel measurement. If you print out the data one more time, you can see how it's standardized.`

			`✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.`
editorial changes 3 years ago
lesson 2 4 years ago			`## Visualization Strategies`

regression lesson audit 3 years ago			`Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.`

			`Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.`
lesson 2 4 years ago
			`One data visualization libary that works well in Jupyter notebooks is [Matplotlib](https://matplotlib.org/) (which you also saw in the previous lesson).`

			`> Get more experience with data visualization in [these tutorials](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-15963-cxa).`
editorial changes 3 years ago
			`## Exercise - experiment with Matplotlib`
lesson 2 4 years ago
assignment and challenge 4 years ago			`Try to create some basic plots to display the new dataframe you just created. What would a basic line plot show?`
lesson 2 4 years ago
editorial changes 3 years ago			`1. Import Matplotlib at the top of the file, under the Pandas import:`

			```python
			`import matplotlib.pyplot as plt`
			```

			`1. Rerun the entire notebook to refresh.`
			`1. At the bottom of the notebook, add a cell to plot the data as a box:`
lesson 2 4 years ago
editorial changes 3 years ago			```python
			`price = new_pumpkins.Price`
			`month = new_pumpkins.Month`
			`plt.scatter(price, month)`
			`plt.show()`
			```
lesson 2 4 years ago
editorial changes 3 years ago			`![A scatterplot showing price to month relationship](./images/scatterplot.png)`
lesson 2 4 years ago
editorial changes 3 years ago			`Is this a useful plot? Does anything about it surprise you?`
images for 3 lessons 4 years ago
editorial changes 3 years ago			`It's not particularly useful as all it does is display in your data as a spread of points in a given month.`
lesson 2 4 years ago
editorial changes 3 years ago			`### Make it useful`
lesson 2 4 years ago
editorial changes 3 years ago			`To get charts to display useful data, you usually need to group the data somehow. Let's try creating a plot where the y axis shows the months and the data demonstrates the distribution of data.`
lesson 2 4 years ago
editorial changes 3 years ago			`1. Add a cell to create a grouped bar chart:`
lessons 4 years ago
editorial changes 3 years ago			```python
			`new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')`
			`plt.ylabel("Pumpkin Price")`
			```
images for 3 lessons 4 years ago
editorial changes 3 years ago			`![A bar chart showing price to month relationship](./images/barchart.png)`

			`This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?`
lessons 4 years ago
adding spaces for Challenge callout 4 years ago			`---`
editorial changes 3 years ago
adding spaces for Challenge callout 4 years ago			`## 🚀Challenge`

regression lesson audit 3 years ago			`Explore the different types of visualization that M Matplotlib offers. Which types are most appropriate for regression problems?`
editorial changes 3 years ago
quiz renumbering, removing 5th NLP lesson, reordering Intro lessons 3 years ago			`## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/12/)`
lessons 4 years ago
			`## Review & Self Study`

assignment editing 4 years ago			`Take a look at the many ways to visualize data. Make a list of the various libraries available and note which are best for given types of tasks, for example 2D visualizations vs. 3D visualizations. What do you discover?`

editorial changes 3 years ago			`## Assignment`
Assignment callout made more clear 3 years ago
			`[Exploring visualization](assignment.md)`