In this final lesson on Regression, one of the basic _classic_ ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?
✅ Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa)
Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: `Color`.
## 前提
Let's build a logistic regression model to predict that, given some variables, _what color a given pumpkin is likely to be_ (orange 🎃 or white 👻).
使用南瓜数据后,我们现在对它已经足够熟悉了,可以意识到我们可以使用一个二元类别:`Color`。
> Why are we talking about binary classification in a lesson grouping about regression? Only for linguistic convenience, as logistic regression is [really a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), albeit a linear-based one. Learn about other ways to classify data in the next lesson group.
For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.
## 定义问题
> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!
Logistic regression differs from linear regression, which you learned about previously, in a few important ways.
## 关于逻辑回归
### Binary classification
逻辑回归在一些重要方面与您之前了解的线性回归不同。
Logistic regression does not offer the same features as linear regression. The former offers a prediction about a binary category ("orange or not orange") whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, _how much its price will rise_.
- **Multinomial**, which involves having more than one category - "Orange, White, and Striped".
还有其他类型的逻辑回归,包括多项和有序:
- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).
![Multinomial vs ordinal regression](./images/multinomial-ordinal.png)
- **多项**,涉及多个类别 - “橙色、白色和条纹”。
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.
Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.
By now you have loaded up the [starter notebook](./notebook.ipynb) with pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including `Color`. Let's visualize the dataframe in the notebook using a different library: [Seaborn](https://seaborn.pydata.org/index.html), which is built on Matplotlib which we used earlier.
@ -101,56 +102,56 @@ Seaborn offers some neat ways to visualize your data. For example, you can compa
g.map(sns.scatterplot)
g.map(sns.scatterplot)
```
```
![A grid of visualized data](images/grid.png)
![可视化数据网格](images/grid.png)
By observing data side-by-side, you can see how the Color data relates to the other columns.
通过并列观察数据,您可以看到颜色数据与其他列的关系。
✅ Given this scatterplot grid, what are some interesting explorations you can envision?
✅ 鉴于此散点图网格,您可以设想哪些有趣的探索?
### Use a swarm plot
### 使用分类散点图
Since Color is a binary category (Orange or Not), it's called 'categorical data' and needs 'a more [specialized approach](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) to visualization'. There are other ways to visualize the relationship of this category with other variables.
A 'violin' type plot is useful as you can easily visualize the way that data in the two categories is distributed. Violin plots don't work so well with smaller datasets as the distribution is displayed more 'smoothly'.
1. As parameters `x=Color`, `kind="violin"` and call `catplot()`:
1. 作为参数`x=Color`、`kind="violin"`并调用`catplot()`:
```python
```python
sns.catplot(x="Color", y="Item Size",
sns.catplot(x="Color", y="Item Size",
kind="violin", data=new_pumpkins)
kind="violin", data=new_pumpkins)
```
```
![a violin type chart](images/violin.png)
![小提琴图](images/violin.png)
✅ Try creating this plot, and other Seaborn plots, using other variables.
✅ 尝试使用其他变量创建此图和其他Seaborn图。
Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.
> Remember how linear regression often used ordinary least squares to arrive at a value? Logistic regression relies on the concept of 'maximum likelihood' using [sigmoid functions](https://wikipedia.org/wiki/Sigmoid_function). A 'Sigmoid Function' on a plot looks like an 'S' shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like thus:
> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class '1' of the binary choice. If not, it will be classified as '0'.
Take a look at your model's scoreboard. It's not too bad, considering you have only about 1000 rows of data:
看看你的模型的记分板。考虑到您只有大约1000行数据,这还不错:
```output
```output
precision recall f1-score support
precision recall f1-score support
@ -200,63 +201,63 @@ Building a model to find these binary classification is surprisingly straightfor
0 0 0 1 0 1 0 0 1 0 0 0 1 0]
0 0 0 1 0 1 0 0 1 0 0 0 1 0]
```
```
## Better comprehension via a confusion matrix
## 通过混淆矩阵更好地理解
While you can get a scoreboard report [terms](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report) by printing out the items above, you might be able to understand your model more easily by using a [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) to help us understand how the model is performing.
> 🎓 A '[confusion matrix](https://wikipedia.org/wiki/Confusion_matrix)' (or 'error matrix') is a table that expresses your model's true vs. false positives and negatives, thus gauging the accuracy of predictions.
- If your model predicts something as a pumpkin and it belongs to category 'pumpkin' in reality we call it a true positive, shown by the top left number.
- If your model predicts something as not a pumpkin and it belongs to category 'pumpkin' in reality we call it a false positive, shown by the top right number.
- If your model predicts something as a pumpkin and it belongs to category 'not-a-pumpkin' in reality we call it a false negative, shown by the bottom left number.
- If your model predicts something as not a pumpkin and it belongs to category 'not-a-pumpkin' in reality we call it a true negative, shown by the bottom right number.
> Infographic by [Jen Looper](https://twitter.com/jenlooper)
> 作者[Jen Looper](https://twitter.com/jenlooper)
As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.
✅ Q: According to the confusion matrix, how did the model do? A: Not too bad; there are a good number of true positives but also several false negatives.
✅ Q:根据混淆矩阵,模型怎么样? A:还不错;有很多真阳性,但也有一些假阴性。
Let's revisit the terms we saw earlier with the help of the confusion matrix's mapping of TP/TN and FP/FN:
让我们借助混淆矩阵对TP/TN和FP/FN的映射,重新审视一下我们之前看到的术语:
🎓 Precision: TP/(TP + FN) The fraction of relevant instances among the retrieved instances (e.g. which labels were well-labeled)
🎓 准确率:TP/(TP+FN)检索实例中相关实例的分数(例如,哪些标签标记得很好)
🎓 Recall: TP/(TP + FP) The fraction of relevant instances that were retrieved, whether well-labeled or not
🎓 召回率: TP/(TP + FP) 检索到的相关实例的比例,无论是否标记良好
🎓 f1-score: (2 * precision * recall)/(precision + recall) A weighted average of the precision and recall, with best being 1 and worst being 0
🎓 Macro Avg: The calculation of the unweighted mean metrics for each label, not taking label imbalance into account.
🎓 宏平均值: 计算每个标签的未加权平均指标,不考虑标签不平衡。
🎓 Weighted Avg: The calculation of the mean metrics for each label, taking label imbalance into account by weighting them by their support (the number of true instances for each label).
🎓 加权平均值:计算每个标签的平均指标,通过按支持度(每个标签的真实实例数)加权来考虑标签不平衡。
✅ Can you think which metric you should watch if you want your model to reduce the number of false negatives?
✅ 如果你想让你的模型减少假阴性的数量,你能想出应该关注哪个指标吗?
## Visualize the ROC curve of this model
## 可视化该模型的ROC曲线
This is not a bad model; its accuracy is in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables.
Using Seaborn again, plot the model's [Receiving Operating Characteristic](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html?highlight=roc) or ROC. ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. "ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis." Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly:
Finally, use Scikit-learn's [`roc_auc_score` API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html?highlight=roc_auc#sklearn.metrics.roc_auc_score) to compute the actual 'Area Under the Curve' (AUC):
The result is `0.6976998904709748`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is _pretty good_.
In future lessons on classifications, you will learn how to iterate to improve your model's scores. But for now, congratulations! You've completed these regression lessons!
There's a lot more to unpack regarding logistic regression! But the best way to learn is to experiment. Find a dataset that lends itself to this type of analysis and build a model with it. What do you learn? tip: try [Kaggle](https://kaggle.com) for interesting datasets.
Read the first few pages of [this paper from Stanford](https://web.stanford.edu/~jurafsky/slp3/5.pdf) on some practical uses for logistic regression. Think about tasks that are better suited for one or the other type of regression tasks that we have studied up to this point. What would work best?