ML-For-Beginners/2-Regression/3-Linear/README.md

# Build a regression model using Scikit-learn: regression two ways

![Linear vs polynomial regression infographic](./images/linear-polynomial.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
## [Pre-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/13/)

> ### [This lesson is available in R!](./solution/R/lesson_3-R.ipynb)
### Introduction 

So far you have explored what regression is with sample data gathered from the pumpkin pricing dataset that we will use throughout this lesson. You have also visualized it using Matplotlib. 

Now you are ready to dive deeper into regression for ML. In this lesson, you will learn more about two types of regression: _basic linear regression_ and _polynomial regression_, along with some of the math underlying these techniques.

> Throughout this curriculum, we assume minimal knowledge of math, and seek to make it accessible for students coming from other fields, so watch for notes, 🧮 callouts, diagrams, and other learning tools to aid in comprehension.

### Prerequisite

You should be familiar by now with the structure of the pumpkin data that we are examining. You can find it preloaded and pre-cleaned in this lesson's _notebook.ipynb_ file. In the file, the pumpkin price is displayed per bushel in a new dataframe.  Make sure you can run these notebooks in kernels in Visual Studio Code.

### Preparation

As a reminder, you are loading this data so as to ask questions of it. 

- When is the best time to buy pumpkins? 
- What price can I expect of a case of miniature pumpkins?
- Should I buy them in half-bushel baskets or by the 1 1/9 bushel box?
Let's keep digging into this data.

In the previous lesson, you created a Pandas dataframe and populated it with part of the original dataset, standardizing the pricing by the bushel. By doing that, however, you were only able to gather about 400 datapoints and only for the fall months. 

Take a look at the data that we preloaded in this lesson's accompanying notebook. The data is preloaded and an initial scatterplot is charted to show month data. Maybe we can get a little more detail about the nature of the data by cleaning it more.

## A linear regression line

As you learned in Lesson 1, the goal of a linear regression exercise is to be able to plot a line to:

- **Show variable relationships**. Show the relationship between variables
- **Make predictions**. Make accurate predictions on where a new datapoint would fall in relationship to that line. 
 
It is typical of **Least-Squares Regression** to draw this type of line. The term 'least-squares' means that all the datapoints surrounding the regression line are squared and then added up. Ideally, that final sum is as small as possible, because we want a low number of errors, or `least-squares`. 

We do so since we want to model a line that has the least cumulative distance from all of our data points. We also square the terms before adding them since we are concerned with its magnitude rather than its direction.

> **🧮 Show me the math** 
> 
> This line, called the _line of best fit_ can be expressed by [an equation](https://en.wikipedia.org/wiki/Simple_linear_regression): 
> 
> ```
> Y = a + bX
> ```
>
> `X` is the 'explanatory variable'. `Y` is the 'dependent variable'. The slope of the line is `b` and `a` is the y-intercept, which refers to the value of `Y` when `X = 0`. 
>
>![calculate the slope](images/slope.png)
>
> First, calculate the slope `b`. Infographic by [Jen Looper](https://twitter.com/jenlooper)
>
> In other words, and referring to our pumpkin data's original question: "predict the price of a pumpkin per bushel by month", `X` would refer to the price and `Y` would refer to the month of sale. 
>
>![complete the equation](images/calculation.png)
>
> Calculate the value of Y. If you're paying around $4, it must be April! Infographic by [Jen Looper](https://twitter.com/jenlooper)
>
> The math that calculates the line must demonstrate the slope of the line, which is also dependent on the intercept, or where `Y` is situated when `X = 0`.
>
> You can observe the method of calculation for these values on the [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) web site. Also visit [this Least-squares calculator](https://www.mathsisfun.com/data/least-squares-calculator.html) to watch how the numbers' values impact the line.

## Correlation

One more term to understand is the **Correlation Coefficient** between given X and Y variables. Using a scatterplot, you can quickly visualize this coefficient. A plot with datapoints scattered in a neat line have high correlation, but a plot with datapoints scattered everywhere between X and Y have a low correlation.

A good linear regression model will be one that has a high (nearer to 1 than 0) Correlation Coefficient using the Least-Squares Regression method with a line of regression.

✅ Run the notebook accompanying this lesson and look at the City to Price scatterplot. Does the data associating City to Price for pumpkin sales seem to have high or low correlation, according to your visual interpretation of the scatterplot? 


## Prepare your data for regression

Now that you have an understanding of the math behind this exercise, create a Regression model to see if you can predict which package of pumpkins will have the best pumpkin prices. Someone buying pumpkins for a holiday pumpkin patch might want this information to be able to optimize their purchases of pumpkin packages for the patch.

Since you'll use Scikit-learn, there's no reason to do this by hand (although you could!). In the main data-processing block of your lesson notebook, add a library from Scikit-learn to automatically convert all string data to numbers:

```python
from sklearn.preprocessing import LabelEncoder

new_pumpkins.iloc[:, 0:-1] = new_pumpkins.iloc[:, 0:-1].apply(LabelEncoder().fit_transform)
```

If you look at the new_pumpkins dataframe now, you see that all the strings are now numeric. This makes it harder for you to read but much more intelligible for Scikit-learn!
Now you can make more educated decisions (not just based on eyeballing a scatterplot) about the data that is best suited to regression.

Try to find a good correlation between two points of your data to potentially build a good predictive model. As it turns out, there's only weak correlation between the City and Price:

```python
print(new_pumpkins['City'].corr(new_pumpkins['Price']))
0.32363971816089226
```

However there's a bit better correlation between the Package and its Price. That makes sense, right? Normally, the bigger the produce box, the higher the price.

```python
print(new_pumpkins['Package'].corr(new_pumpkins['Price']))
0.6061712937226021
```

A good question to ask of this data will be: 'What price can I expect of a given pumpkin package?'

Let's build this regression model

## Building a linear model

Before building your model, do one more tidy-up of your data. Drop any null data and check once more what the data looks like.

```python
new_pumpkins.dropna(inplace=True)
new_pumpkins.info()
```

Then, create a new dataframe from this minimal set and print it out:

```python
new_columns = ['Package', 'Price']
lin_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')

lin_pumpkins
```

```output
	Package	Price
70	0	13.636364
71	0	16.363636
72	0	16.363636
73	0	15.454545
74	0	13.636364
...	...	...
1738	2	30.000000
1739	2	28.750000
1740	2	25.750000
1741	2	24.000000
1742	2	24.000000
415 rows × 2 columns
```

1. Now you can assign your X and y coordinate data:

   ```python
   X = lin_pumpkins.values[:, :1]
   y = lin_pumpkins.values[:, 1:2]
   ```
✅ What's going on here? You're using [Python slice notation](https://stackoverflow.com/questions/509211/understanding-slice-notation/509295#509295) to create arrays to populate `X` and `y`.

2. Next, start the regression model-building routines:

   ```python
   from sklearn.linear_model import LinearRegression
   from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
   from sklearn.model_selection import train_test_split

   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
   lin_reg = LinearRegression()
   lin_reg.fit(X_train,y_train)

   pred = lin_reg.predict(X_test)

   accuracy_score = lin_reg.score(X_train,y_train)
   print('Model Accuracy: ', accuracy_score)
   ```

   Because the correlation isn't particularly good, the model produced isn't terribly accurate.

   ```output
   Model Accuracy:  0.3315342327998987
   ```

3. You can visualize the line that's drawn in the process:

   ```python
   plt.scatter(X_test, y_test,  color='black')
   plt.plot(X_test, pred, color='blue', linewidth=3)

   plt.xlabel('Package')
   plt.ylabel('Price')

   plt.show()
   ```
   ![A scatterplot showing package to price relationship](./images/linear.png)

4. Test the model against a hypothetical variety:

   ```python
   lin_reg.predict( np.array([ [2.75] ]) )
   ```
   
   The returned price for this mythological Variety is:

   ```output
   array([[33.15655975]])
   ```

That number makes sense, if the logic of the regression line holds true.

🎃 Congratulations, you just created a model that can help predict the price of a few varieties of pumpkins. Your holiday pumpkin patch will be beautiful. But you can probably create a better model!
## Polynomial regression

Another type of linear regression is polynomial regression. While sometimes there's a linear relationship between variables - the bigger the pumpkin in volume, the higher the price - sometimes these relationships can't be plotted as a plane or straight line. 

✅ Here are [some more examples](https://online.stat.psu.edu/stat501/lesson/9/9.8) of data that could use polynomial regression

Take another look at the relationship between Variety to Price in the previous plot. Does this scatterplot seem like it should necessarily be analyzed by a straight line? Perhaps not. In this case, you can try polynomial regression.

✅ Polynomials are mathematical expressions that might consist of one or more variables and coefficients

Polynomial regression creates a curved line to better fit nonlinear data. 

1. Let's recreate a dataframe populated with a segment of the original pumpkin data:

   ```python
   new_columns = ['Variety', 'Package', 'City', 'Month', 'Price']
   poly_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')

   poly_pumpkins
   ```

A good way to visualize the correlations between data in dataframes is to display it in a 'coolwarm' chart:

2. Use the `Background_gradient()` method with `coolwarm` as its argument value:

   ```python
   corr = poly_pumpkins.corr()
   corr.style.background_gradient(cmap='coolwarm')
   ```
   This code creates a heatmap:
   ![A heatmap showing data correlation](./images/heatmap.png)

Looking at this chart, you can visualize the good correlation between Package and Price. So you should be able to create a somewhat better model than the last one.
### Create a pipeline

Scikit-learn includes a helpful API for building polynomial regression models - the `make_pipeline` [API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=pipeline#sklearn.pipeline.make_pipeline). A 'pipeline' is created which is a chain of estimators. In this case, the pipeline includes polynomial features, or predictions that form a nonlinear path.

1. Build out the X and y columns:

   ```python
   X=poly_pumpkins.iloc[:,3:4].values
   y=poly_pumpkins.iloc[:,4:5].values
   ```

2. Create the pipeline by calling the `make_pipeline()` method:

   ```python
   from sklearn.preprocessing import PolynomialFeatures
   from sklearn.pipeline import make_pipeline

   pipeline = make_pipeline(PolynomialFeatures(4), LinearRegression())

   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

   pipeline.fit(np.array(X_train), y_train)

   y_pred=pipeline.predict(X_test)
   ```

### Create a sequence

At this point, you need to create a new dataframe with _sorted_ data so that the pipeline can create a sequence.

Add the following code:

   ```python
   df = pd.DataFrame({'x': X_test[:,0], 'y': y_pred[:,0]})
   df.sort_values(by='x',inplace = True)
   points = pd.DataFrame(df).to_numpy()

   plt.plot(points[:, 0], points[:, 1],color="blue", linewidth=3)
   plt.xlabel('Package')
   plt.ylabel('Price')
   plt.scatter(X,y, color="black")
   plt.show()
   ```

You created a new dataframe by calling `pd.DataFrame`. Then you sorted the values by calling `sort_values()`. Finally you created a polynomial plot:

![A polynomial plot showing package to price relationship](./images/polynomial.png)

You can see a curved line that fits your data better. 

Let's check the model's accuracy:

   ```python
   accuracy_score = pipeline.score(X_train,y_train)
   print('Model Accuracy: ', accuracy_score)
   ```

   And voila!

   ```output
   Model Accuracy:  0.8537946517073784
   ```

That's better! Try to predict a price:

### Do a prediction

Can we input a new value and get a prediction?

Call `predict()` to make a prediction:
 
   ```python
   pipeline.predict( np.array([ [2.75] ]) )
   ```
   You are given this prediction:

   ```output
   array([[46.34509342]])
   ```

It does make sense, given the plot! And, if this is a better model than the previous one, looking at the same data, you need to budget for these more expensive pumpkins!

🏆 Well done! You created two regression models in one lesson. In the final section on regression, you will learn about logistic regression to determine categories. 

---
## 🚀Challenge

Test several different variables in this notebook to see how correlation corresponds to model accuracy.

## [Post-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/14/)

## Review & Self Study

In this lesson we learned about Linear Regression. There are other important types of Regression. Read about Stepwise, Ridge, Lasso and Elasticnet techniques. A good course to study to learn more is the [Stanford Statistical Learning course](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning)

## Assignment 

[Build a Model](assignment.md)
-												Regression 3 audit

											
										
										
											4 years ago
+								# Build a regression model using Scikit-learn: regression two ways
-												README standardization

											
										
										
											4 years ago
-												Regression 3 audit

											
										
										
											4 years ago
+								![Linear vs polynomial regression infographic](./images/linear-polynomial.png)
-												Update README.md

Adding linear vs polynomial sketchnote
											
										
										
											4 years ago
+								> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
-												path to quizzes needs to change

											
										
										
											4 years ago
+								## [Pre-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/13/)
-												linking R lessons (new folder structure)

											
										
										
											4 years ago
 								> ### [This lesson is available in R!](./solution/R/lesson_3-R.ipynb)
-												image linking tidy-up, temporary fix for docsify

											
										
										
											4 years ago
+								### Introduction
-												lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								So far you have explored what regression is with sample data gathered from the pumpkin pricing dataset that we will use throughout this lesson. You have also visualized it using Matplotlib.
-												editorial

											
										
										
											4 years ago
 								Now you are ready to dive deeper into regression for ML. In this lesson, you will learn more about two types of regression: _basic linear regression_ and _polynomial regression_, along with some of the math underlying these techniques.
-												lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								> Throughout this curriculum, we assume minimal knowledge of math, and seek to make it accessible for students coming from other fields, so watch for notes, 🧮 callouts, diagrams, and other learning tools to aid in comprehension.
-												lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								### Prerequisite
-												editorial

											
										
										
											4 years ago
-												Merge branch 'main' into regressio-linear
											
										
										
											4 years ago
+								You should be familiar by now with the structure of the pumpkin data that we are examining. You can find it preloaded and pre-cleaned in this lesson's _notebook.ipynb_ file. In the file, the pumpkin price is displayed per bushel in a new dataframe.  Make sure you can run these notebooks in kernels in Visual Studio Code.
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								### Preparation
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								As a reminder, you are loading this data so as to ask questions of it.
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								- When is the best time to buy pumpkins?
-												editorial

											
										
										
											4 years ago
+								- What price can I expect of a case of miniature pumpkins?
 								- Should I buy them in half-bushel baskets or by the 1 1/9 bushel box?
 								Let's keep digging into this data.
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								In the previous lesson, you created a Pandas dataframe and populated it with part of the original dataset, standardizing the pricing by the bushel. By doing that, however, you were only able to gather about 400 datapoints and only for the fall months.
 								Take a look at the data that we preloaded in this lesson's accompanying notebook. The data is preloaded and an initial scatterplot is charted to show month data. Maybe we can get a little more detail about the nature of the data by cleaning it more.
-												lessons

											
										
										
											4 years ago
-												Merge branch 'main' into regressio-linear
											
										
										
											4 years ago
+								## A linear regression line
-												lesson 3

											
										
										
											4 years ago
-												editorial

											
										
										
											4 years ago
+								As you learned in Lesson 1, the goal of a linear regression exercise is to be able to plot a line to:
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								- **Show variable relationships**. Show the relationship between variables
 								- **Make predictions**. Make accurate predictions on where a new datapoint would fall in relationship to that line.
 								It is typical of **Least-Squares Regression** to draw this type of line. The term 'least-squares' means that all the datapoints surrounding the regression line are squared and then added up. Ideally, that final sum is as small as possible, because we want a low number of errors, or `least-squares`.
-												lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								We do so since we want to model a line that has the least cumulative distance from all of our data points. We also square the terms before adding them since we are concerned with its magnitude rather than its direction.
-												lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								> **🧮 Show me the math**
 								>
-												some infographics

											
										
										
											4 years ago
+								> This line, called the _line of best fit_ can be expressed by [an equation](https://en.wikipedia.org/wiki/Simple_linear_regression):
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								>
 								> ```
 								> Y = a + bX
 								> ```
 								>
 								> `X` is the 'explanatory variable'. `Y` is the 'dependent variable'. The slope of the line is `b` and `a` is the y-intercept, which refers to the value of `Y` when `X = 0`.
 								>
-												some infographics

											
										
										
											4 years ago
+								>![calculate the slope](images/slope.png)
-												infographic spacing issue

											
										
										
											4 years ago
+								>
-												infographics and credits

											
										
										
											4 years ago
+								> First, calculate the slope `b`. Infographic by [Jen Looper](https://twitter.com/jenlooper)
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								>
 								> In other words, and referring to our pumpkin data's original question: "predict the price of a pumpkin per bushel by month", `X` would refer to the price and `Y` would refer to the month of sale.
-												some infographics

											
										
										
											4 years ago
+								>
 								>![complete the equation](images/calculation.png)
-												infographic spacing issue

											
										
										
											4 years ago
+								>
-												infographics and credits

											
										
										
											4 years ago
+								> Calculate the value of Y. If you're paying around $4, it must be April! Infographic by [Jen Looper](https://twitter.com/jenlooper)
-												some infographics

											
										
										
											4 years ago
+								>
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								> The math that calculates the line must demonstrate the slope of the line, which is also dependent on the intercept, or where `Y` is situated when `X = 0`.
 								>
 								> You can observe the method of calculation for these values on the [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) web site. Also visit [this Least-squares calculator](https://www.mathsisfun.com/data/least-squares-calculator.html) to watch how the numbers' values impact the line.
-												lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								## Correlation
-												lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								One more term to understand is the **Correlation Coefficient** between given X and Y variables. Using a scatterplot, you can quickly visualize this coefficient. A plot with datapoints scattered in a neat line have high correlation, but a plot with datapoints scattered everywhere between X and Y have a low correlation.
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								A good linear regression model will be one that has a high (nearer to 1 than 0) Correlation Coefficient using the Least-Squares Regression method with a line of regression.
-												lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								✅ Run the notebook accompanying this lesson and look at the City to Price scatterplot. Does the data associating City to Price for pumpkin sales seem to have high or low correlation, according to your visual interpretation of the scatterplot?
-												lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								## Prepare your data for regression
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Now that you have an understanding of the math behind this exercise, create a Regression model to see if you can predict which package of pumpkins will have the best pumpkin prices. Someone buying pumpkins for a holiday pumpkin patch might want this information to be able to optimize their purchases of pumpkin packages for the patch.
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Since you'll use Scikit-learn, there's no reason to do this by hand (although you could!). In the main data-processing block of your lesson notebook, add a library from Scikit-learn to automatically convert all string data to numbers:
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								```python
 								from sklearn.preprocessing import LabelEncoder
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								new_pumpkins.iloc[:, 0:-1] = new_pumpkins.iloc[:, 0:-1].apply(LabelEncoder().fit_transform)
 								```
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								If you look at the new_pumpkins dataframe now, you see that all the strings are now numeric. This makes it harder for you to read but much more intelligible for Scikit-learn!
-												editorial

											
										
										
											4 years ago
+								Now you can make more educated decisions (not just based on eyeballing a scatterplot) about the data that is best suited to regression.
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Try to find a good correlation between two points of your data to potentially build a good predictive model. As it turns out, there's only weak correlation between the City and Price:
-												images for 3 lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								```python
 								print(new_pumpkins['City'].corr(new_pumpkins['Price']))
 .32363971816089226
 								```
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								However there's a bit better correlation between the Package and its Price. That makes sense, right? Normally, the bigger the produce box, the higher the price.
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								```python
 								print(new_pumpkins['Package'].corr(new_pumpkins['Price']))
 .6061712937226021
 								```
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								A good question to ask of this data will be: 'What price can I expect of a given pumpkin package?'
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Let's build this regression model
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								## Building a linear model
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Before building your model, do one more tidy-up of your data. Drop any null data and check once more what the data looks like.
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								```python
 								new_pumpkins.dropna(inplace=True)
 								new_pumpkins.info()
 								```
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Then, create a new dataframe from this minimal set and print it out:
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								```python
 								new_columns = ['Package', 'Price']
 								lin_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								lin_pumpkins
 								```
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								```output
 									Package	Price
 	0	13.636364
 	0	16.363636
 	0	16.363636
 	0	15.454545
 	0	13.636364
 								...	...	...
 	2	30.000000
 	2	28.750000
 	2	25.750000
 	2	24.000000
 	2	24.000000
 rows × 2 columns
 								```
-												editorial

											
										
										
											4 years ago
 . Now you can assign your X and y coordinate data:
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```python
 								   X = lin_pumpkins.values[:, :1]
 								   y = lin_pumpkins.values[:, 1:2]
 								   ```
 								✅ What's going on here? You're using [Python slice notation](https://stackoverflow.com/questions/509211/understanding-slice-notation/509295#509295) to create arrays to populate `X` and `y`.
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+. Next, start the regression model-building routines:
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```python
 								   from sklearn.linear_model import LinearRegression
 								   from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
 								   from sklearn.model_selection import train_test_split
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
 								   lin_reg = LinearRegression()
 								   lin_reg.fit(X_train,y_train)
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   pred = lin_reg.predict(X_test)
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   accuracy_score = lin_reg.score(X_train,y_train)
 								   print('Model Accuracy: ', accuracy_score)
 								   ```
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   Because the correlation isn't particularly good, the model produced isn't terribly accurate.
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```output
 								   Model Accuracy:  0.3315342327998987
 								   ```
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+. You can visualize the line that's drawn in the process:
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```python
 								   plt.scatter(X_test, y_test,  color='black')
 								   plt.plot(X_test, pred, color='blue', linewidth=3)
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   plt.xlabel('Package')
 								   plt.ylabel('Price')
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   plt.show()
 								   ```
 								   ![A scatterplot showing package to price relationship](./images/linear.png)
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+. Test the model against a hypothetical variety:
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```python
 								   lin_reg.predict( np.array([ [2.75] ]) )
 								   ```
 								   The returned price for this mythological Variety is:
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```output
 								   array([[33.15655975]])
 								   ```
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								That number makes sense, if the logic of the regression line holds true.
 								🎃 Congratulations, you just created a model that can help predict the price of a few varieties of pumpkins. Your holiday pumpkin patch will be beautiful. But you can probably create a better model!
-												Regression 3 audit

											
										
										
											4 years ago
+								## Polynomial regression
-												lesson 3

											
										
										
											4 years ago
-												Regression 3 audit

											
										
										
											4 years ago
+								Another type of linear regression is polynomial regression. While sometimes there's a linear relationship between variables - the bigger the pumpkin in volume, the higher the price - sometimes these relationships can't be plotted as a plane or straight line.
-												polynomial regression notebook

											
										
										
											4 years ago
-												Regression 3 audit

											
										
										
											4 years ago
+								✅ Here are [some more examples](https://online.stat.psu.edu/stat501/lesson/9/9.8) of data that could use polynomial regression
-												polynomial regression notebook

											
										
										
											4 years ago
-												Regression 3 audit

											
										
										
											4 years ago
+								Take another look at the relationship between Variety to Price in the previous plot. Does this scatterplot seem like it should necessarily be analyzed by a straight line? Perhaps not. In this case, you can try polynomial regression.
-												lesson 3

											
										
										
											4 years ago
 								✅ Polynomials are mathematical expressions that might consist of one or more variables and coefficients
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Polynomial regression creates a curved line to better fit nonlinear data.
-												lesson 3

											
										
										
											4 years ago
-												editorial

											
										
										
											4 years ago
+. Let's recreate a dataframe populated with a segment of the original pumpkin data:
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```python
 								   new_columns = ['Variety', 'Package', 'City', 'Month', 'Price']
 								   poly_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   poly_pumpkins
 								   ```
-												polynomial regression notebook

											
										
										
											4 years ago
-												lesson 3

											
										
										
											4 years ago
+								A good way to visualize the correlations between data in dataframes is to display it in a 'coolwarm' chart:
-												tidy-up   for regression 3

											
										
										
											4 years ago
+. Use the `Background_gradient()` method with `coolwarm` as its argument value:
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```python
 								   corr = poly_pumpkins.corr()
 								   corr.style.background_gradient(cmap='coolwarm')
 								   ```
 								   This code creates a heatmap:
 								   ![A heatmap showing data correlation](./images/heatmap.png)
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Looking at this chart, you can visualize the good correlation between Package and Price. So you should be able to create a somewhat better model than the last one.
-												editorial

											
										
										
											4 years ago
+								### Create a pipeline
-												lesson 3

											
										
										
											4 years ago
-												Regression 3 audit

											
										
										
											4 years ago
+								Scikit-learn includes a helpful API for building polynomial regression models - the `make_pipeline` [API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=pipeline#sklearn.pipeline.make_pipeline). A 'pipeline' is created which is a chain of estimators. In this case, the pipeline includes polynomial features, or predictions that form a nonlinear path.
-												lesson 3

											
										
										
											4 years ago
-												editorial

											
										
										
											4 years ago
+. Build out the X and y columns:
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```python
 								   X=poly_pumpkins.iloc[:,3:4].values
 								   y=poly_pumpkins.iloc[:,4:5].values
 								   ```
 . Create the pipeline by calling the `make_pipeline()` method:
 								   ```python
 								   from sklearn.preprocessing import PolynomialFeatures
 								   from sklearn.pipeline import make_pipeline
 								   pipeline = make_pipeline(PolynomialFeatures(4), LinearRegression())
 								   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
 								   pipeline.fit(np.array(X_train), y_train)
 								   y_pred=pipeline.predict(X_test)
 								   ```
-												lesson 3

											
										
										
											4 years ago
-												editorial

											
										
										
											4 years ago
+								### Create a sequence
-												lesson 3

											
										
										
											4 years ago
-												editorial

											
										
										
											4 years ago
+								At this point, you need to create a new dataframe with _sorted_ data so that the pipeline can create a sequence.
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Add the following code:
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```python
 								   df = pd.DataFrame({'x': X_test[:,0], 'y': y_pred[:,0]})
 								   df.sort_values(by='x',inplace = True)
 								   points = pd.DataFrame(df).to_numpy()
-												images for 3 lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   plt.plot(points[:, 0], points[:, 1],color="blue", linewidth=3)
 								   plt.xlabel('Package')
 								   plt.ylabel('Price')
 								   plt.scatter(X,y, color="black")
 								   plt.show()
 								   ```
-												images for 3 lessons

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								You created a new dataframe by calling `pd.DataFrame`. Then you sorted the values by calling `sort_values()`. Finally you created a polynomial plot:
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								![A polynomial plot showing package to price relationship](./images/polynomial.png)
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								You can see a curved line that fits your data better.
-												lesson 3

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Let's check the model's accuracy:
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```python
 								   accuracy_score = pipeline.score(X_train,y_train)
 								   print('Model Accuracy: ', accuracy_score)
 								   ```
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   And voila!
 								   ```output
 								   Model Accuracy:  0.8537946517073784
 								   ```
-												lesson 3

											
										
										
											4 years ago
 								That's better! Try to predict a price:
-												editorial

											
										
										
											4 years ago
 								### Do a prediction
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Can we input a new value and get a prediction?
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								Call `predict()` to make a prediction:
 								   ```python
 								   pipeline.predict( np.array([ [2.75] ]) )
 								   ```
 								   You are given this prediction:
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								   ```output
 								   array([[46.34509342]])
 								   ```
-												editorial

											
										
										
											4 years ago
-												tidy-up   for regression 3

											
										
										
											4 years ago
+								It does make sense, given the plot! And, if this is a better model than the previous one, looking at the same data, you need to budget for these more expensive pumpkins!
-												editorial

											
										
										
											4 years ago
-												Regression 3 audit

											
										
										
											4 years ago
+								🏆 Well done! You created two regression models in one lesson. In the final section on regression, you will learn about logistic regression to determine categories.
-												lessons

											
										
										
											4 years ago
-												adding spaces for Challenge callout

											
										
										
											4 years ago
+								---
 								## 🚀Challenge
 								Test several different variables in this notebook to see how correlation corresponds to model accuracy.
-												lessons

											
										
										
											4 years ago
-												path to quizzes needs to change

											
										
										
											4 years ago
+								## [Post-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/14/)
-												lessons

											
										
										
											4 years ago
 								## Review & Self Study
-												incorporating Rishit's review

											
										
										
											4 years ago
+								In this lesson we learned about Linear Regression. There are other important types of Regression. Read about Stepwise, Ridge, Lasso and Elasticnet techniques. A good course to study to learn more is the [Stanford Statistical Learning course](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning)
-												presenting scatterplot

											
										
										
											4 years ago
-												Assignment callout made more clear

											
										
										
											4 years ago
+								## Assignment
-												Fixed a Typo

Removed an a statement which was repeated twice
											
										
										
											4 years ago
+								[Build a Model](assignment.md)