So far you have explored what regression is with sample data gathered from the pumpkin pricing dataset that we will use throughout this lesson. You have also visualized it using Matplotlib.
Now you are ready to dive deeper into regression for ML. In this lesson, you will learn more about two types of regression: _basic linear regression_ and _polynomial regression_, along with some of the math underlying these techniques.
> Throughout this curriculum, we assume minimal knowledge of math, and seek to make it accessible for students coming from other fields, so watch for notes, 🧮 callouts, diagrams, and other learning tools to aid in comprehension.
You should be familiar by now with the structure of the pumpkin data that we are examining. You can find it preloaded and pre-cleaned in this lesson's _notebook.ipynb_ file. In the file, the pumpkin price is displayed per bushel in a new dataframe. Make sure you can run these notebooks in kernels in Visual Studio Code.
您现在应该熟悉我们正在检查的南瓜数据的结构。您可以在本课的_notebook.ipynb_文件中找到它。 在这个文件中,南瓜的价格显示在一个新的dataframe 中。确保可以在Visual Studio Code代码的内核中运行这些notebooks。
### Preparation
### 准备
As a reminder, you are loading this data so as to ask questions of it.
提醒一下,您正在加载此数据以提出问题。
- When is the best time to buy pumpkins?
- 什么时候买南瓜最好?
- What price can I expect of a case of miniature pumpkins?
- 一箱微型南瓜的价格是多少?
- Should I buy them in half-bushel baskets or by the 1 1/9 bushel box?
- 我应该买半蒲式耳还是1 1/9蒲式耳?
Let's keep digging into this data.
让我们继续深入研究这些数据。
In the previous lesson, you created a Pandas dataframe and populated it with part of the original dataset, standardizing the pricing by the bushel. By doing that, however, you were only able to gather about 400 datapoints and only for the fall months.
Take a look at the data that we preloaded in this lesson's accompanying notebook. The data is preloaded and an initial scatterplot is charted to show month data. Maybe we can get a little more detail about the nature of the data by cleaning it more.
As you learned in Lesson 1, the goal of a linear regression exercise is to be able to plot a line to:
正如您在第1课中学到的,线性回归练习的目标是能够绘制一条线以便:
- **Show variable relationships**. Show the relationship between variables
- **显示变量关系**。 显示变量之间的关系
- **Make predictions**. Make accurate predictions on where a new datapoint would fall in relationship to that line.
- **作出预测**。 准确预测新数据点与该线的关系。
It is typical of **Least-Squares Regression** to draw this type of line. The term 'least-squares' means that all the datapoints surrounding the regression line are squared and then added up. Ideally, that final sum is as small as possible, because we want a low number of errors, or `least-squares`.
We do so since we want to model a line that has the least cumulative distance from all of our data points. We also square the terms before adding them since we are concerned with its magnitude rather than its direction.
> `X` is the 'explanatory variable'. `Y` is the 'dependent variable'. The slope of the line is `b` and `a` is the y-intercept, which refers to the value of `Y` when `X = 0`.
> In other words, and referring to our pumpkin data's original question: "predict the price of a pumpkin per bushel by month", `X` would refer to the price and `Y` would refer to the month of sale.
> The math that calculates the line must demonstrate the slope of the line, which is also dependent on the intercept, or where `Y` is situated when `X = 0`.
> 计算直线的数学必须证明直线的斜率,这也取决于截距,或者当`X = 0`时`Y`所在的位置。
>
>
> You can observe the method of calculation for these values on the [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) web site. Also visit [this Least-squares calculator](https://www.mathsisfun.com/data/least-squares-calculator.html) to watch how the numbers' values impact the line.
> 您可以在[Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html)网站上观察这些值的计算方法。另请访问[这个最小二乘计算器](https://www.mathsisfun.com/data/least-squares-calculator.html)以观察数字的值如何影响直线。
## Correlation
## 相关性
One more term to understand is the **Correlation Coefficient** between given X and Y variables. Using a scatterplot, you can quickly visualize this coefficient. A plot with datapoints scattered in a neat line have high correlation, but a plot with datapoints scattered everywhere between X and Y have a low correlation.
A good linear regression model will be one that has a high (nearer to 1 than 0) Correlation Coefficient using the Least-Squares Regression method with a line of regression.
一个好的线性回归模型将是一个用最小二乘回归法与直线回归得到的高(更接近于1)相关系数的模型。
✅ Run the notebook accompanying this lesson and look at the City to Price scatterplot. Does the data associating City to Price for pumpkin sales seem to have high or low correlation, according to your visual interpretation of the scatterplot?
✅ 运行本课随附的notebook并查看City to Price散点图。根据您对散点图的视觉解释,将南瓜销售的城市与价格相关联的数据似乎具有高相关性或低相关性?
Now that you have an understanding of the math behind this exercise, create a Regression model to see if you can predict which package of pumpkins will have the best pumpkin prices. Someone buying pumpkins for a holiday pumpkin patch might want this information to be able to optimize their purchases of pumpkin packages for the patch.
Since you'll use Scikit-learn, there's no reason to do this by hand (although you could!). In the main data-processing block of your lesson notebook, add a library from Scikit-learn to automatically convert all string data to numbers:
```python
```python
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelEncoder
@ -84,37 +83,38 @@ from sklearn.preprocessing import LabelEncoder
If you look at the new_pumpkins dataframe now, you see that all the strings are now numeric. This makes it harder for you to read but much more intelligible for Scikit-learn!
Now you can make more educated decisions (not just based on eyeballing a scatterplot) about the data that is best suited to regression.
现在,您可以对最适合回归的数据做出更有根据的决策(不仅仅是基于观察散点图)。
Try to find a good correlation between two points of your data to potentially build a good predictive model. As it turns out, there's only weak correlation between the City and Price:
However there's a bit better correlation between the Package and its Price. That makes sense, right? Normally, the bigger the produce box, the higher the price.
A good question to ask of this data will be: 'What price can I expect of a given pumpkin package?'
对这些数据提出的一个很好的问题是:“我可以期望给定的南瓜包装的价格是多少?”
Let's build this regression model
让我们建立这个回归模型
## Building a linear model
## 建立线性模型
Before building your model, do one more tidy-up of your data. Drop any null data and check once more what the data looks like.
在构建模型之前,再对数据进行一次整理。删除任何空数据并再次检查数据的样子。
```python
```python
new_pumpkins.dropna(inplace=True)
new_pumpkins.dropna(inplace=True)
new_pumpkins.info()
new_pumpkins.info()
```
```
Then, create a new dataframe from this minimal set and print it out:
然后,从这个最小集合创建一个新的dataframe并将其打印出来:
```python
```python
new_columns = ['Package', 'Price']
new_columns = ['Package', 'Price']
@ -139,15 +139,15 @@ lin_pumpkins
415 rows × 2 columns
415 rows × 2 columns
```
```
1. Now you can assign your X and y coordinate data:
1. 现在您可以分配X和y坐标数据:
```python
```python
X = lin_pumpkins.values[:, :1]
X = lin_pumpkins.values[:, :1]
y = lin_pumpkins.values[:, 1:2]
y = lin_pumpkins.values[:, 1:2]
```
```
✅ What's going on here? You're using [Python slice notation](https://stackoverflow.com/questions/509211/understanding-slice-notation/509295#509295) to create arrays to populate `X` and `y`.
2. Next, start the regression model-building routines:
2. 接下来,开始回归模型构建例程:
```python
```python
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LinearRegression
@ -164,13 +164,13 @@ lin_pumpkins
print('Model Accuracy: ', accuracy_score)
print('Model Accuracy: ', accuracy_score)
```
```
Because the correlation isn't particularly good, the model produced isn't terribly accurate.
因为相关性不是特别好,所以生成的模型不是非常准确。
```output
```output
Model Accuracy: 0.3315342327998987
Model Accuracy: 0.3315342327998987
```
```
3. You can visualize the line that's drawn in the process:
3. 您可以将过程中绘制的线条可视化:
```python
```python
plt.scatter(X_test, y_test, color='black')
plt.scatter(X_test, y_test, color='black')
@ -181,37 +181,37 @@ lin_pumpkins
plt.show()
plt.show()
```
```
![A scatterplot showing package to price relationship](./images/linear.png)
![散点图显示包装与价格的关系](../images/linear.png)
4. Test the model against a hypothetical variety:
4. 针对假设的品种测试模型:
```python
```python
lin_reg.predict( np.array([ [2.75] ]) )
lin_reg.predict( np.array([ [2.75] ]) )
```
```
The returned price for this mythological Variety is:
这个神话般的品种的价格是:
```output
```output
array([[33.15655975]])
array([[33.15655975]])
```
```
That number makes sense, if the logic of the regression line holds true.
如果回归线的逻辑成立,这个数字是有意义的。
🎃 Congratulations, you just created a model that can help predict the price of a few varieties of pumpkins. Your holiday pumpkin patch will be beautiful. But you can probably create a better model!
Another type of linear regression is polynomial regression. While sometimes there's a linear relationship between variables - the bigger the pumpkin in volume, the higher the price - sometimes these relationships can't be plotted as a plane or straight line.
## 多项式回归
✅ Here are [some more examples](https://online.stat.psu.edu/stat501/lesson/9/9.8) of data that could use polynomial regression
Take another look at the relationship between Variety to Price in the previous plot. Does this scatterplot seem like it should necessarily be analyzed by a straight line? Perhaps not. In this case, you can try polynomial regression.
Looking at this chart, you can visualize the good correlation between Package and Price. So you should be able to create a somewhat better model than the last one.
### 创建管道
### Create a pipeline
Scikit-learn includes a helpful API for building polynomial regression models - the`make_pipeline` [API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=pipeline#sklearn.pipeline.make_pipeline). A 'pipeline' is created which is a chain of estimators. In this case, the pipeline includes polynomial features, or predictions that form a nonlinear path.
![A polynomial plot showing package to price relationship](./images/polynomial.png)
![显示包装与价格关系的多项式图](../images/polynomial.png)
You can see a curved line that fits your data better.
您可以看到更适合您的数据的曲线。
Let's check the model's accuracy:
让我们检查模型的准确性:
```python
```python
accuracy_score = pipeline.score(X_train,y_train)
accuracy_score = pipeline.score(X_train,y_train)
print('Model Accuracy: ', accuracy_score)
print('Model Accuracy: ', accuracy_score)
```
```
And voila!
瞧!
```output
```output
Model Accuracy: 0.8537946517073784
Model Accuracy: 0.8537946517073784
```
```
That's better! Try to predict a price:
这样好多了!试着预测一个价格:
### Do a prediction
### 做个预测
Can we input a new value and get a prediction?
我们可以输入一个新值并得到一个预测吗?
Call `predict()` to make a prediction:
调用`predict()`进行预测:
```python
```python
pipeline.predict( np.array([ [2.75] ]) )
pipeline.predict( np.array([ [2.75] ]) )
```
```
You are given this prediction:
你会得到这样的预测:
```output
```output
array([[46.34509342]])
array([[46.34509342]])
```
```
It does make sense, given the plot! And, if this is a better model than the previous one, looking at the same data, you need to budget for these more expensive pumpkins!
🏆 Well done! You created two regression models in one lesson. In the final section on regression, you will learn about logistic regression to determine categories.
🏆 干得不错!您在一节课中创建了两个回归模型。在回归的最后一节中,您将了解逻辑回归以确定类别。
---
---
## 🚀Challenge
## 🚀挑战
Test several different variables in this notebook to see how correlation corresponds to model accuracy.
In this lesson we learned about Linear Regression. There are other important types of Regression. Read about Stepwise, Ridge, Lasso and Elasticnet techniques. A good course to study to learn more is the [Stanford Statistical Learning course](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning)