Merge pull request #39 from softchris/regression-logistic

editorial
5 years ago · 1a68cb0c01
parent 96dfcd7598 9f7910bcd2
commit 1a68cb0c01
1 changed files with 132 additions and 100 deletions
--- a/2-Regression/4-Logistic/README.md
+++ b/2-Regression/4-Logistic/README.md
@ -4,107 +4,137 @@
 > Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
 ## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/15/)

-### Introduction
+## Introduction

-In this final lesson on Regression, one of the basic 'classic' ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not? 
+In this final lesson on Regression, one of the basic _classic_ ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not? 

 In this lesson, you will learn:
+
 - A new library for data visualization
 - Techniques for logistic regression

-Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa)
+✅ Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa)
 ## Prerequisite

-Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: Color. Let's build a logistic regression model to predict that, given some variables, what color a given pumpkin is likely to be (orange 🎃 or white 👻). 
+Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: `Color`.
+
+Let's build a logistic regression model to predict that, given some variables, _what color a given pumpkin is likely to be_ (orange 🎃 or white 👻).

 > Why are we talking about binary classification in a lesson grouping about regression? Only for linguistic convenience, as logistic regression is [really a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), albeit a linear-based one. Learn about other ways to classify data in the next lesson group.

+## Define the question
+
 For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.

 > 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!
+
 ## About logistic regression

 Logistic regression differs from linear regression, which you learned about previously, in a few important ways.
+
 ### Binary classification

-Logistic regression does not offer the same features as linear regression. The former offers a prediction about a binary category ("orange or not orange") whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, how much its price will rise.
+Logistic regression does not offer the same features as linear regression. The former offers a prediction about a binary category ("orange or not orange") whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, _how much its price will rise_.

 ![Pumpkin classification Model](./images/pumpkin-classifier.png)
 > Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
 ### Other classifications

-There are other types of logistic regression, including multinomial and ordinal. Multinomial involves having more than one categories - "Orange, White, and Striped". Ordinal involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).
+There are other types of logistic regression, including multinomial and ordinal:
+
+- **Multinomial**, involves having more than one category - "Orange, White, and Striped".
+- **Ordinal**, involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).

 ![Multinomial vs ordinal regression](./images/multinomial-ordinal.png)
 > Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)

 ### It's still linear

-Even though this type of Regression is all about category predictions, it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.
+Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.

 ### Variables DO NOT have to correlate

 Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.
+
 ### You need a lot of clean data

 Logistic regression will give more accurate results if you use more data; our small dataset is not optimal for this task, so keep that in mind.

 ✅ Think about the types of data that would lend themselves well to logistic regression
-## Tidy the data
+
+## Exercise - tidy the data

 First, clean the data a bit, dropping null values and selecting only some of the columns:

-```python
-from sklearn.preprocessing import LabelEncoder
+1. Add the following code:

-new_columns = ['Color','Origin','Item Size','Variety','City Name','Package']
+    ```python
+    from sklearn.preprocessing import LabelEncoder
+    
+    new_columns = ['Color','Origin','Item Size','Variety','City Name','Package']
+    
+    new_pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
+    
+    new_pumpkins.dropna(inplace=True)
+    
+    new_pumpkins = new_pumpkins.apply(LabelEncoder().fit_transform)
+    ```

-new_pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
+    You can always take a peek at your new dataframe:

-new_pumpkins.dropna(inplace=True)
+    ```python
+    new_pumpkins.info
+    ```

-new_pumpkins = new_pumpkins.apply(LabelEncoder().fit_transform)
-```
+### Visualization - side-by-side grid

-You can always take a peek at your new dataframe:
+By now you have loaded up the [starter notebook](./notebook.ipynb) with pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including `Color`. Let's visualize the dataframe in the notebook using a different library: [Seaborn](https://seaborn.pydata.org/index.html), which is built on Matplotlib which we used earlier. 

-```python
-new_pumpkins.info
-```
-### Visualization
+Seaborn offers some neat ways to visualize your data. For example, you can compare distributions of the data for each point in a side-by side grid.

-By now you have loaded up the [starter notebook](./notebook.ipynb) with pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including Color. Let's visualize the dataframe in the notebook using a different library: [Seaborn](https://seaborn.pydata.org/index.html), which is built on Matplotlib which we used earlier. Seaborn offers some neat ways to visualize your data. For example, you can compare distributions of the data for each point in a side-by side grid.
+1. Create such a grid by instantiating a `PairGrid`, using our pumpkin data `new_pumpkins`, followed by calling `map()`:

-```python
-import seaborn as sns
+    ```python
+    import seaborn as sns
+    
+    g = sns.PairGrid(new_pumpkins)
+    g.map(sns.scatterplot)
+    ```

-g = sns.PairGrid(new_pumpkins)
-g.map(sns.scatterplot)
-```
+    ![A grid of visualized data](images/grid.png)

-![A grid of visualized data](images/grid.png)
+    By observing data side-by-side, you can see how the Color data relates to the other columns.

-By observing data side-by-side, you can see how the Color data relates to the other columns. 
+    ✅ Given this scatterplot grid, what are some interesting explorations you can envision?

-✅ Given this scatterplot grid, what are some interesting explorations you can envision?
+### Use a swarm plot

-Since Color is a binary category (Orange or Not), it's called 'categorical data' and needs 'a more [specialized approach](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) to visualization'. There are other ways to visualize the relationship of this category with other variables. You can visualize variables side-by-side with Seaborn plots. Try a 'swarm' plot to show the distribution of values:
+Since Color is a binary category (Orange or Not), it's called 'categorical data' and needs 'a more [specialized approach](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) to visualization'. There are other ways to visualize the relationship of this category with other variables. 

-```python
-sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
-```
+You can visualize variables side-by-side with Seaborn plots.
+
+1. Try a 'swarm' plot to show the distribution of values:
+
+    ```python
+    sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
+    ```

-![A swarm of visualized data](images/swarm.png)
+    ![A swarm of visualized data](images/swarm.png)
+
+### Violin plot

 A 'violin' type plot is useful as you can easily visualize the way that data in the two categories is distributed. Violin plots don't work so well with smaller datasets as the distribution is displayed more 'smoothly'.

-```python
-sns.catplot(x="Color", y="Item Size",
-            kind="violin", data=new_pumpkins)
-```
-![a violin type chart](images/violin.png)
+1. As parameters `x=Color`, `kind="violin"` and call `catplot()`:

-✅ Try creating this plot, and other Seaborn plots, using other variables.
+    ```python
+    sns.catplot(x="Color", y="Item Size",
+                kind="violin", data=new_pumpkins)
+    ```
+
+    ![a violin type chart](images/violin.png)
+
+    ✅ Try creating this plot, and other Seaborn plots, using other variables.

 Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.

@ -120,75 +150,77 @@ Now that we have an idea of the relationship between the binary categories of co

 Building a model to find these binary classification is surprisingly straightforward in Scikit-learn.

-Select the variables you want to use in your classification model and split the training and test sets:
-
-```python
-from sklearn.model_selection import train_test_split
-
-Selected_features = ['Origin','Item Size','Variety','City Name','Package']
-
-X = new_pumpkins[Selected_features]
-y = new_pumpkins['Color']
-
-X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
-
-```
-
-Now you can train your model and print out its result:
-
-```python
-from sklearn.model_selection import train_test_split
-from sklearn.metrics import accuracy_score, classification_report 
-from sklearn.linear_model import LogisticRegression
-
-model = LogisticRegression()
-model.fit(X_train, y_train)
-predictions = model.predict(X_test)
-
-print(classification_report(y_test, predictions))
-print('Predicted labels: ', predictions)
-print('Accuracy: ', accuracy_score(y_test, predictions))
-```
-
-Take a look at your model's scoreboard. It's not too bad, considering you have only about 1000 rows of data:
-
-```
-                   precision    recall  f1-score   support
-
-           0       0.85      0.95      0.90       166
-           1       0.38      0.15      0.22        33
-
-    accuracy                           0.82       199
-   macro avg       0.62      0.55      0.56       199
-weighted avg       0.77      0.82      0.78       199
-
-Predicted labels:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
- 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 1 0 1 0 0 1 0 0 0 1 0]
-```
+1. Select the variables you want to use in your classification model and split the training and test sets calling `train_test_split()`:
+
+    ```python
+    from sklearn.model_selection import train_test_split
+    
+    Selected_features = ['Origin','Item Size','Variety','City Name','Package']
+    
+    X = new_pumpkins[Selected_features]
+    y = new_pumpkins['Color']
+    
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
+    
+    ```
+
+1. Now you can train your model, by calling `fit()` with your training data, and print out its result:
+
+    ```python
+    from sklearn.model_selection import train_test_split
+    from sklearn.metrics import accuracy_score, classification_report 
+    from sklearn.linear_model import LogisticRegression
+    
+    model = LogisticRegression()
+    model.fit(X_train, y_train)
+    predictions = model.predict(X_test)
+    
+    print(classification_report(y_test, predictions))
+    print('Predicted labels: ', predictions)
+    print('Accuracy: ', accuracy_score(y_test, predictions))
+    ```
+
+    Take a look at your model's scoreboard. It's not too bad, considering you have only about 1000 rows of data:
+
+    ```output
+                       precision    recall  f1-score   support
+    
+               0       0.85      0.95      0.90       166
+               1       0.38      0.15      0.22        33
+    
+        accuracy                           0.82       199
+       macro avg       0.62      0.55      0.56       199
+    weighted avg       0.77      0.82      0.78       199
+    
+    Predicted labels:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
+     0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+     1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
+     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
+     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+     0 0 0 1 0 1 0 0 1 0 0 0 1 0]
+    ```

 ## Better comprehension via a confusion matrix

-While you can get a scoreboard report [terms](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report) by printing out the items above, you might be able to understand your model more easily by using  a [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) to help us understand how the model is performing.
+While you can get a scoreboard report [terms](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report) by printing out the items above, you might be able to understand your model more easily by using a [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) to help us understand how the model is performing.

 > 🎓 A '[confusion matrix](https://wikipedia.org/wiki/Confusion_matrix)' (or 'error matrix') is a table that expresses your model's true vs. false positives and negatives, thus gauging the accuracy of predictions.

-```python
-from sklearn.metrics import confusion_matrix
-confusion_matrix(y_test, predictions)
-```
+1. To use a confusion metrics, call `confusin_matrix()`:

-Take a look at your model's confusion matrix:
+    ```python
+    from sklearn.metrics import confusion_matrix
+    confusion_matrix(y_test, predictions)
+    ```

-```
-array([[162,   4],
-       [ 33,   0]])
-```
+    Take a look at your model's confusion matrix:
+
+    ```output
+    array([[162,   4],
+           [ 33,   0]])
+    ```

-What's going on here? Let's say our model is asked to classify items between two binary categories, category 'pumpkin' and category 'not-a-pumpkin'. 
+What's going on here? Let's say our model is asked to classify items between two binary categories, category 'pumpkin' and category 'not-a-pumpkin'.

 - If your model predicts something as a pumpkin and it belongs to category 'pumpkin' in reality we call it a true positive, shown by the top left number. 
 - If your model predicts something as not a pumpkin and it belongs to category 'pumpkin' in reality we call it a false positive, shown by the top right number.