Merge pull request #646 from microsoft/ml_for_beginners_review_2

Logistic regression README refactoring
2 years ago · 9c925bf772
parent ecd4469f5e ae407eb771
commit 9c925bf772
7 changed files with 418 additions and 218 deletions
--- a/2-Regression/4-Logistic/README.md
+++ b/2-Regression/4-Logistic/README.md
@ -26,9 +26,9 @@ Let's build a logistic regression model to predict that, given some variables, _

 ## Define the question

-For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.
+For our purposes, we will express this as a binary: 'White' or 'Not White'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.

-> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!
+> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking! So we could also reformulate our question as: 'Ghost' or 'Not Ghost'. 👻

 ## About logistic regression

@ -49,10 +49,6 @@ There are other types of logistic regression, including multinomial and ordinal:

 ![Multinomial vs ordinal regression](./images/multinomial-vs-ordinal.png)

-### It's still linear
-
-Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.
-
 ### Variables DO NOT have to correlate

 Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.
@ -70,78 +66,144 @@ First, clean the data a bit, dropping null values and selecting only some of the
 1. Add the following code:

    ```python
-    from sklearn.preprocessing import LabelEncoder
-    
-    new_columns = ['Color','Origin','Item Size','Variety','City Name','Package']
-    
-    new_pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
-    
-    new_pumpkins.dropna(inplace=True)
-    
-    new_pumpkins = new_pumpkins.apply(LabelEncoder().fit_transform)
+  
+    columns_to_select = ['City Name','Package','Variety', 'Origin','Item Size', 'Color']
+    pumpkins = full_pumpkins.loc[:, columns_to_select]
+
+    pumpkins.dropna(inplace=True)
    ```

    You can always take a peek at your new dataframe:

    ```python
-    new_pumpkins.info
+    pumpkins.info
    ```

-### Visualization - side-by-side grid
+### Visualization - categorical plot

 By now you have loaded up the [starter notebook](./notebook.ipynb) with pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including `Color`. Let's visualize the dataframe in the notebook using a different library: [Seaborn](https://seaborn.pydata.org/index.html), which is built on Matplotlib which we used earlier. 

-Seaborn offers some neat ways to visualize your data. For example, you can compare distributions of the data for each point in a side-by-side grid.
+Seaborn offers some neat ways to visualize your data. For example, you can compare distributions of the data for each `Variety` and `Color` in a categorical plot.

-1. Create such a grid by instantiating a `PairGrid`, using our pumpkin data `new_pumpkins`, followed by calling `map()`:
+1. Create such a plot by using the `catplot` function, using our pumpkin data `pumpkins`, and specifying a color mapping for each pumpkin category (orange or white):

    ```python
    import seaborn as sns
    
-    g = sns.PairGrid(new_pumpkins)
-    g.map(sns.scatterplot)
+    palette = {
+    'ORANGE': 'orange',
+    'WHITE': 'wheat',
+    }
+
+    sns.catplot(
+    data=pumpkins, y="Variety", hue="Color", kind="count",
+    palette=palette, 
+    )
    ```

-    ![A grid of visualized data](images/grid.png)
+    ![A grid of visualized data](images/pumpkins_catplot_1.png)

-    By observing data side-by-side, you can see how the Color data relates to the other columns.
+    By observing the data, you can see how the Color data relates to Variety.

-    ✅ Given this scatterplot grid, what are some interesting explorations you can envision?
+    ✅ Given this categorical plot, what are some interesting explorations you can envision?

-### Use a swarm plot
+### Data pre-processing: feature and label encoding
+Our pumpkins dataset contains string values for all its columns. Working with categorical data is intuitive for humans but not for machines. Machine learning algorithms work well with numbers. There's why encoding is a very important step in the data pre-processing phase, since it enables to turn categorical data into numerical data, without losing any information. A good encoding leads to build a good model.

-Since Color is a binary category (Orange or Not), it's called 'categorical data' and needs 'a more [specialized approach](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) to visualization'. There are other ways to visualize the relationship of this category with other variables. 
+For feature encoding there are two main types of encoders:

-You can visualize variables side-by-side with Seaborn plots.
+1. Ordinal encoder: it suits well for ordinal variables, which are categorical variables where their data follows a logical ordering, like the `Item Size` column in our dataset. It creates a mapping such that each category is represented by a number, which is the order of the category in the column.

-1. Try a 'swarm' plot to show the distribution of values:
+    ```python
+    from sklearn.preprocessing import OrdinalEncoder
+
+    item_size_categories = [['sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo']]
+    ordinal_features = ['Item Size']
+    ordinal_encoder = OrdinalEncoder(categories=item_size_categories)
+    ```
+
+2. Categorical encoder: it suits well for nominal variables, which are categorical variables where their data does not follow a logical ordering, like all the features different from `Item Size` in our dataset. It is a one-hot encoding, which means that each category is represented by a binary column: the encoded variable is equal to 1 if the pumpkin belongs to that Variety and 0 otherwise.

    ```python
-    sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
+    from sklearn.preprocessing import OneHotEncoder
+
+    categorical_features = ['City Name', 'Package', 'Variety', 'Origin']
+    categorical_encoder = OneHotEncoder(sparse_output=False)
    ```
+Then, `ColumnTransformer` is used to combine multiple encoders into a single step and apply them to the appropriate columns.
+
+```python
+    from sklearn.compose import ColumnTransformer
+    
+    ct = ColumnTransformer(transformers=[
+        ('ord', ordinal_encoder, ordinal_features),
+        ('cat', categorical_encoder, categorical_features)
+        ])
+    
+    ct.set_output(transform='pandas')
+    encoded_features = ct.fit_transform(pumpkins)
+```
+On the other hand, to encode the label, we use the scikit-learn `LabelEncoder` class, which is a utility class to help normalize labels such that they contain only values between 0 and n_classes-1 (here, 0 and 1).
+
+```python
+    from sklearn.preprocessing import LabelEncoder

-    ![A swarm of visualized data](images/swarm.png)
+    label_encoder = LabelEncoder()
+    encoded_label = label_encoder.fit_transform(pumpkins['Color'])
+```
+Once we have encoded the features and the label, we can merge them into a new dataframe `encoded_pumpkins`.
+
+```python
+    encoded_pumpkins = encoded_features.assign(Color=encoded_label)
+```
+✅ What are the advantages of using an ordinal encoder for the `Item Size` column?
+
+### Analyse relationships between variables
+
+Now that we have pre-processed our data, we can analyse the relationships between the features and the label to grasp an idea of how well the model will be able to predict the label given the features.
+The best way to perform this kind of analysis is plotting the data. We'll be using again the Seaborn `catplot` function, to visualize the relationships between `Item Size`,  `Variety` and `Color` in a categorical plot. To better plot the data we'll be using the encoded `Item Size` column and the unencoded `Variety` column.

-### Violin plot
+```python
+    palette = {
+    'ORANGE': 'orange',
+    'WHITE': 'wheat',
+    }
+    pumpkins['Item Size'] = encoded_pumpkins['ord__Item Size']
+
+    g = sns.catplot(
+        data=pumpkins,
+        x="Item Size", y="Color", row='Variety',
+        kind="box", orient="h",
+        sharex=False, margin_titles=True,
+        height=1.8, aspect=4, palette=palette,
+    )
+    g.set(xlabel="Item Size", ylabel="").set(xlim=(0,6))
+    g.set_titles(row_template="{row_name}")
+```
+![A catplot of visualized data](images/pumpkins_catplot_2.png)
+
+### Use a swarm plot
+
+Since Color is a binary category (White or Not), it needs 'a [specialized approach](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) to visualization'. There are other ways to visualize the relationship of this category with other variables. 

-A 'violin' type plot is useful as you can easily visualize the way that data in the two categories is distributed. Violin plots don't work so well with smaller datasets as the distribution is displayed more 'smoothly'.
+You can visualize variables side-by-side with Seaborn plots.

-1. As parameters `x=Color`, `kind="violin"` and call `catplot()`:
+1. Try a 'swarm' plot to show the distribution of values:

    ```python
-    sns.catplot(x="Color", y="Item Size",
-                kind="violin", data=new_pumpkins)
+    palette = {
+    0: 'orange',
+    1: 'wheat'
+    }
+    sns.swarmplot(x="Color", y="ord__Item Size", data=encoded_pumpkins, palette=palette)
    ```

-    ![a violin type chart](images/violin.png)
-
-    ✅ Try creating this plot, and other Seaborn plots, using other variables.
+    ![A swarm of visualized data](images/swarm_2.png)

-Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.

 > **🧮 Show Me The Math**
 >
-> Remember how linear regression often used ordinary least squares to arrive at a value? Logistic regression relies on the concept of 'maximum likelihood' using [sigmoid functions](https://wikipedia.org/wiki/Sigmoid_function). A 'Sigmoid Function' on a plot looks like an 'S' shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:
+> Logistic regression relies on the concept of 'maximum likelihood' using [sigmoid functions](https://wikipedia.org/wiki/Sigmoid_function). A 'Sigmoid Function' on a plot looks like an 'S' shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:
 >
 > ![logistic function](images/sigmoid.png)
 >
@ -156,49 +218,47 @@ Building a model to find these binary classification is surprisingly straightfor
    ```python
    from sklearn.model_selection import train_test_split
    
-    Selected_features = ['Origin','Item Size','Variety','City Name','Package']
-    
-    X = new_pumpkins[Selected_features]
-    y = new_pumpkins['Color']
-    
+    X = encoded_pumpkins[encoded_pumpkins.columns.difference(['Color'])]
+    y = encoded_pumpkins['Color']
+
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    
    ```

-1. Now you can train your model, by calling `fit()` with your training data, and print out its result:
+2. Now you can train your model, by calling `fit()` with your training data, and print out its result:

    ```python
-    from sklearn.model_selection import train_test_split
-    from sklearn.metrics import accuracy_score, classification_report 
+    from sklearn.metrics import f1_score, classification_report 
    from sklearn.linear_model import LogisticRegression
-    
+
    model = LogisticRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
-    
+
    print(classification_report(y_test, predictions))
    print('Predicted labels: ', predictions)
-    print('Accuracy: ', accuracy_score(y_test, predictions))
+    print('F1-score: ', f1_score(y_test, predictions))
    ```

-    Take a look at your model's scoreboard. It's not too bad, considering you have only about 1000 rows of data:
+    Take a look at your model's scoreboard. It's not bad, considering you have only about 1000 rows of data:

    ```output
                       precision    recall  f1-score   support
    
-               0       0.85      0.95      0.90       166
-               1       0.38      0.15      0.22        33
+                    0       0.94      0.98      0.96       166
+                    1       0.85      0.67      0.75        33
    
-        accuracy                           0.82       199
-       macro avg       0.62      0.55      0.56       199
-    weighted avg       0.77      0.82      0.78       199
+        accuracy                                0.92       199
+        macro avg           0.89      0.82      0.85       199
+        weighted avg        0.92      0.92      0.92       199
    
-    Predicted labels:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
-     0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-     1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
-     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
-     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-     0 0 0 1 0 1 0 0 1 0 0 0 1 0]
+        Predicted labels:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
+        0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+        1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0
+        0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
+        0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
+        0 0 0 1 0 0 0 0 0 0 0 0 1 1]
+        F1-score:  0.7457627118644068
    ```

 ## Better comprehension via a confusion matrix
@ -218,7 +278,7 @@ While you can get a scoreboard report [terms](https://scikit-learn.org/stable/mo

    ```output
    array([[162,   4],
-           [ 33,   0]])
+           [ 11,  22]])
    ```

 In Scikit-learn, confusion matrices Rows (axis 0) are actual labels and columns (axis 1) are predicted labels.
@ -228,22 +288,22 @@ In Scikit-learn, confusion matrices Rows (axis 0) are actual labels and columns
 |   0   |  TN   |  FP   |
 |   1   |  FN   |  TP   |

-What's going on here? Let's say our model is asked to classify pumpkins between two binary categories, category 'orange' and category 'not-orange'.
+What's going on here? Let's say our model is asked to classify pumpkins between two binary categories, category 'white' and category 'not-white'.

- If your model predicts a pumpkin as not orange and it belongs to category 'not-orange' in reality we call it a true negative, shown by the top left number.
- If your model predicts a pumpkin as orange and it belongs to category 'not-orange' in reality we call it a false negative, shown by the bottom left number. 
- If your model predicts a pumpkin as not orange and it belongs to category 'orange' in reality we call it a false positive, shown by the top right number. 
- If your model predicts a pumpkin as orange and it belongs to category 'orange' in reality we call it a true positive, shown by the bottom right number.
+- If your model predicts a pumpkin as not white and it belongs to category 'not-white' in reality we call it a true negative, shown by the top left number.
+- If your model predicts a pumpkin as white and it belongs to category 'not-white' in reality we call it a false negative, shown by the bottom left number. 
+- If your model predicts a pumpkin as not white and it belongs to category 'white' in reality we call it a false positive, shown by the top right number. 
+- If your model predicts a pumpkin as white and it belongs to category 'white' in reality we call it a true positive, shown by the bottom right number.

 As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.

-How does the confusion matrix relate to precision and recall? Remember, the classification report printed above showed precision (0.83) and recall (0.98).
+How does the confusion matrix relate to precision and recall? Remember, the classification report printed above showed precision (0.85) and recall (0.67).

-Precision = tp / (tp + fp) = 162 / (162 + 33) = 0.8307692307692308
+Precision = tp / (tp + fp) = 22 / (22 + 4) = 0.8461538461538461

-Recall = tp / (tp + fn) = 162 / (162 + 4) = 0.9759036144578314
+Recall = tp / (tp + fn) = 22 / (22 + 11) = 0.6666666666666666

-✅ Q: According to the confusion matrix, how did the model do? A: Not too bad; there are a good number of true negatives but also several false negatives. 
+✅ Q: According to the confusion matrix, how did the model do? A: Not bad; there are a good number of true negatives but also a few false negatives. 

 Let's revisit the terms we saw earlier with the help of the confusion matrix's mapping of TP/TN and FP/FN:

@ -265,22 +325,28 @@ Let's revisit the terms we saw earlier with the help of the confusion matrix's m

 ## Visualize the ROC curve of this model

-This is not a bad model; its accuracy is in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables.
-
-Let's do one more visualization to see the so-called 'ROC' score:
+Let's do one more visualization to see the so-called 'ROC' curve:

 ```python
 from sklearn.metrics import roc_curve, roc_auc_score
+import matplotlib
+import matplotlib.pyplot as plt
+%matplotlib inline

 y_scores = model.predict_proba(X_test)
-# calculate ROC curve
 fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
-sns.lineplot([0, 1], [0, 1])
-sns.lineplot(fpr, tpr)
+
+fig = plt.figure(figsize=(6, 6))
+plt.plot([0, 1], [0, 1], 'k--')
+plt.plot(fpr, tpr)
+plt.xlabel('False Positive Rate')
+plt.ylabel('True Positive Rate')
+plt.title('ROC Curve')
+plt.show()
 ```
-Using Seaborn again, plot the model's [Receiving Operating Characteristic](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html?highlight=roc) or ROC. ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. "ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis." Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly:
+Using Matplotlib, plot the model's [Receiving Operating Characteristic](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html?highlight=roc) or ROC. ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. "ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis." Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly:

-![ROC](./images/ROC.png)
+![ROC](./images/ROC_2.png)

 Finally, use Scikit-learn's [`roc_auc_score` API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html?highlight=roc_auc#sklearn.metrics.roc_auc_score) to compute the actual 'Area Under the Curve' (AUC):

@ -288,7 +354,7 @@ Finally, use Scikit-learn's [`roc_auc_score` API](https://scikit-learn.org/stabl
 auc = roc_auc_score(y_test,y_scores[:,1])
 print(auc)
 ```
-The result is `0.6976998904709748`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is _pretty good_. 
+The result is `0.9749908725812341`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is _pretty good_. 

 In future lessons on classifications, you will learn how to iterate to improve your model's scores. But for now, congratulations! You've completed these regression lessons!

--- a/2-Regression/4-Logistic/images/ROC_2.png
+++ b/2-Regression/4-Logistic/images/ROC_2.png
--- a/2-Regression/4-Logistic/images/pumpkins_catplot_1.png
+++ b/2-Regression/4-Logistic/images/pumpkins_catplot_1.png
--- a/2-Regression/4-Logistic/images/pumpkins_catplot_2.png
+++ b/2-Regression/4-Logistic/images/pumpkins_catplot_2.png
--- a/2-Regression/4-Logistic/images/swarm_2.png
+++ b/2-Regression/4-Logistic/images/swarm_2.png
--- a/2-Regression/4-Logistic/notebook.ipynb
+++ b/2-Regression/4-Logistic/notebook.ipynb
@ -1,41 +1,15 @@
 {
- "metadata": {
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.0"
-  },
-  "orig_nbformat": 2,
-  "kernelspec": {
-   "name": "python37364bit8d3b438fb5fc4430a93ac2cb74d693a7",
-   "display_name": "Python 3.7.0 64-bit ('3.7')"
-  },
-  "metadata": {
-   "interpreter": {
-    "hash": "70b38d7a306a849643e446cd70466270a13445e5987dfa1344ef2b127438fa4d"
-   }
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2,
 "cells": [
  {
+   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "## Pumpkin Varieties and Color\n",
    "\n",
    "Load up required libraries and dataset. Convert the data to a dataframe containing a subset of the data: \n",
    "\n",
    "Let's look at the relationship between color and variety"
-   ],
-   "cell_type": "markdown",
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
@ -43,8 +17,175 @@
   "metadata": {},
   "outputs": [
    {
-     "output_type": "execute_result",
     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>City Name</th>\n",
+       "      <th>Type</th>\n",
+       "      <th>Package</th>\n",
+       "      <th>Variety</th>\n",
+       "      <th>Sub Variety</th>\n",
+       "      <th>Grade</th>\n",
+       "      <th>Date</th>\n",
+       "      <th>Low Price</th>\n",
+       "      <th>High Price</th>\n",
+       "      <th>Mostly Low</th>\n",
+       "      <th>...</th>\n",
+       "      <th>Unit of Sale</th>\n",
+       "      <th>Quality</th>\n",
+       "      <th>Condition</th>\n",
+       "      <th>Appearance</th>\n",
+       "      <th>Storage</th>\n",
+       "      <th>Crop</th>\n",
+       "      <th>Repack</th>\n",
+       "      <th>Trans Mode</th>\n",
+       "      <th>Unnamed: 24</th>\n",
+       "      <th>Unnamed: 25</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>BALTIMORE</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24 inch bins</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>4/29/17</td>\n",
+       "      <td>270.0</td>\n",
+       "      <td>280.0</td>\n",
+       "      <td>270.0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>E</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>BALTIMORE</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24 inch bins</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5/6/17</td>\n",
+       "      <td>270.0</td>\n",
+       "      <td>280.0</td>\n",
+       "      <td>270.0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>E</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>BALTIMORE</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24 inch bins</td>\n",
+       "      <td>HOWDEN TYPE</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>9/24/16</td>\n",
+       "      <td>160.0</td>\n",
+       "      <td>160.0</td>\n",
+       "      <td>160.0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>N</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>BALTIMORE</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24 inch bins</td>\n",
+       "      <td>HOWDEN TYPE</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>9/24/16</td>\n",
+       "      <td>160.0</td>\n",
+       "      <td>160.0</td>\n",
+       "      <td>160.0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>N</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>BALTIMORE</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24 inch bins</td>\n",
+       "      <td>HOWDEN TYPE</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>11/5/16</td>\n",
+       "      <td>90.0</td>\n",
+       "      <td>100.0</td>\n",
+       "      <td>90.0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>N</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 26 columns</p>\n",
+       "</div>"
+      ],
      "text/plain": [
       "   City Name Type       Package      Variety Sub Variety  Grade     Date  \\\n",
       "0  BALTIMORE  NaN  24 inch bins          NaN         NaN    NaN  4/29/17   \n",
@ -68,28 +209,48 @@
       "4        NaN     NaN   NaN      N         NaN          NaN          NaN  \n",
       "\n",
       "[5 rows x 26 columns]"
-      ],
-      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>City Name</th>\n      <th>Type</th>\n      <th>Package</th>\n      <th>Variety</th>\n      <th>Sub Variety</th>\n      <th>Grade</th>\n      <th>Date</th>\n      <th>Low Price</th>\n      <th>High Price</th>\n      <th>Mostly Low</th>\n      <th>...</th>\n      <th>Unit of Sale</th>\n      <th>Quality</th>\n      <th>Condition</th>\n      <th>Appearance</th>\n      <th>Storage</th>\n      <th>Crop</th>\n      <th>Repack</th>\n      <th>Trans Mode</th>\n      <th>Unnamed: 24</th>\n      <th>Unnamed: 25</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>BALTIMORE</td>\n      <td>NaN</td>\n      <td>24 inch bins</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>4/29/17</td>\n      <td>270.0</td>\n      <td>280.0</td>\n      <td>270.0</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>E</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>BALTIMORE</td>\n      <td>NaN</td>\n      <td>24 inch bins</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>5/6/17</td>\n      <td>270.0</td>\n      <td>280.0</td>\n      <td>270.0</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>E</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>BALTIMORE</td>\n      <td>NaN</td>\n      <td>24 inch bins</td>\n      <td>HOWDEN TYPE</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>9/24/16</td>\n      <td>160.0</td>\n      <td>160.0</td>\n      <td>160.0</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>N</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>BALTIMORE</td>\n      <td>NaN</td>\n      <td>24 inch bins</td>\n      <td>HOWDEN TYPE</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>9/24/16</td>\n      <td>160.0</td>\n      <td>160.0</td>\n      <td>160.0</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>N</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>BALTIMORE</td>\n      <td>NaN</td>\n      <td>24 inch bins</td>\n      <td>HOWDEN TYPE</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>11/5/16</td>\n      <td>90.0</td>\n      <td>100.0</td>\n      <td>90.0</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>N</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 26 columns</p>\n</div>"
+      ]
     },
+     "execution_count": 1,
     "metadata": {},
-     "execution_count": 1
+     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
-    "pumpkins = pd.read_csv('../data/US-pumpkins.csv')\n",
+    "full_pumpkins = pd.read_csv('../data/US-pumpkins.csv')\n",
    "\n",
-    "pumpkins.head()\n"
+    "full_pumpkins.head()\n"
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
- ]
-}
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.1"
+  },
+  "metadata": {
+   "interpreter": {
+    "hash": "70b38d7a306a849643e446cd70466270a13445e5987dfa1344ef2b127438fa4d"
+   }
+  },
+  "orig_nbformat": 2
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/2-Regression/4-Logistic/solution/notebook.ipynb
+++ b/2-Regression/4-Logistic/solution/notebook.ipynb