editorial

3 years ago · 01f80e7cdb
parent 7b0c52d643
commit 01f80e7cdb
1 changed files with 157 additions and 126 deletions
--- a/4-Classification/2-Classifiers-1/README.md
+++ b/4-Classification/2-Classifiers-1/README.md
@ -1,75 +1,83 @@
 # Cuisine classifiers 1

-In this lesson, you will use the dataset you saved from the last lesson full of balanced, clean data all about cuisines. You will use this dataset with a variety of classifiers to predict a given national cuisine based on a group of ingredients. While doing so, you'll learn more about some of the ways that algorithms can be leveraged for classification tasks.
+In this lesson, you will use the dataset you saved from the last lesson full of balanced, clean data all about cuisines.
+
+You will use this dataset with a variety of classifiers to _predict a given national cuisine based on a group of ingredients_. While doing so, you'll learn more about some of the ways that algorithms can be leveraged for classification tasks.

 ## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/21/)
 # Preparation

 Assuming you completed [Lesson 1](../1-Introduction/README.md), make sure that a _cleaned_cuisines.csv_ file exists in the root `/data` folder for these four lessons.

-Working in this lesson's _notebook.ipynb_ folder, import that file along with the Pandas library:
-
-```python
-import pandas as pd
-cuisines_df = pd.read_csv("../../data/cleaned_cuisine.csv")
-cuisines_df.head()
-```
-The data looks like this:
-
-|     | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
-| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
-| 0   | 0          | indian  | 0      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 0      | 0        |
-| 1   | 1          | indian  | 1      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 0      | 0        |
-| 2   | 2          | indian  | 0      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 0      | 0        |
-| 3   | 3          | indian  | 0      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 0      | 0        |
-| 4   | 4          | indian  | 0      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 1      | 0        |
-
-Now, import several more libraries:
-
-```python
-from sklearn.linear_model import LogisticRegression
-from sklearn.model_selection import train_test_split, cross_val_score
-from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
-from sklearn.svm import SVC
-import numpy as np
-```
-
-Divide the X and y coordinates into two dataframes for training. `cuisine` can be the labels dataframe:
-
-```python
-cuisines_label_df = cuisines_df['cuisine']
-cuisines_label_df.head()
-```
-
-It will look like this:
-
-```
-0    indian
-1    indian
-2    indian
-3    indian
-4    indian
-Name: cuisine, dtype: object
-```
-
-Drop that `Unnamed: 0` column and the `cuisine` column and save the rest of the data as trainable features:
-
-```python
-cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
-cuisines_feature_df.head()
-```
-
-Your features look like this:
-
-| almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke |  ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood |  yam | yeast | yogurt | zucchini |     |
-| -----: | -------: | ----: | ---------: | ----: | -----------: | ------: | -------: | --------: | --------: | ---: | ------: | ----------: | ---------: | ----------------------: | ---: | ---: | ---: | ----: | -----: | -------: | --- |
-|      0 |        0 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        0 | 0   |
-|      1 |        1 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        0 | 0   |
-|      2 |        0 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        0 | 0   |
-|      3 |        0 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        0 | 0   |
-|      4 |        0 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        1 | 0   |
+## Exercise - predict a national cuisine
+
+1. Working in this lesson's _notebook.ipynb_ folder, import that file along with the Pandas library:
+
+    ```python
+    import pandas as pd
+    cuisines_df = pd.read_csv("../../data/cleaned_cuisine.csv")
+    cuisines_df.head()
+    ```
+
+    The data looks like this:
+
+    ```output
+    |     | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
+    | --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
+    | 0   | 0          | indian  | 0      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 0      | 0        |
+    | 1   | 1          | indian  | 1      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 0      | 0        |
+    | 2   | 2          | indian  | 0      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 0      | 0        |
+    | 3   | 3          | indian  | 0      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 0      | 0        |
+    | 4   | 4          | indian  | 0      | 0        | 0     | 0          | 0     | 0            | 0       | 0        | ... | 0       | 0           | 0          | 0                       | 0    | 0    | 0   | 0     | 1      | 0        |
+    ```
+
+1. Now, import several more libraries:
+
+    ```python
+    from sklearn.linear_model import LogisticRegression
+    from sklearn.model_selection import train_test_split, cross_val_score
+    from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
+    from sklearn.svm import SVC
+    import numpy as np
+    ```
+
+1. Divide the X and y coordinates into two dataframes for training. `cuisine` can be the labels dataframe:
+
+    ```python
+    cuisines_label_df = cuisines_df['cuisine']
+    cuisines_label_df.head()
+    ```
+
+    It will look like this:
+
+    ```output
+    0    indian
+    1    indian
+    2    indian
+    3    indian
+    4    indian
+    Name: cuisine, dtype: object
+    ```
+
+1. Drop that `Unnamed: 0` column and the `cuisine` column, calling `drop()`. Save the rest of the data as trainable features:
+
+    ```python
+    cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
+    cuisines_feature_df.head()
+    ```
+
+    Your features look like this:
+
+    | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke |  ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood |  yam | yeast | yogurt | zucchini |     |
+    | -----: | -------: | ----: | ---------: | ----: | -----------: | ------: | -------: | --------: | --------: | ---: | ------: | ----------: | ---------: | ----------------------: | ---: | ---: | ---: | ----: | -----: | -------: | --- |
+    |      0 |        0 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        0 | 0   |
+    |      1 |        1 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        0 | 0   |
+    |      2 |        0 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        0 | 0   |
+    |      3 |        0 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        0 | 0   |
+    |      4 |        0 |     0 |          0 |     0 |            0 |       0 |        0 |         0 |         0 |    0 |     ... |           0 |          0 |                       0 |    0 |    0 |    0 |     0 |      0 |        1 | 0   |

 Now you are ready to train your model!
+
 ## Choosing your classifier

 Now that your data is clean and ready for training, you have to decide which algorithm to use for the job. 
@ -87,6 +95,8 @@ Scikit-learn groups classification under Supervised Learning, and in that catego

 > You can also use [neural networks to classify data](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification), but that is outside the scope of this lesson.

+### What classifier to go with?
+
 So, which classifier should you choose? Often, running through several and looking for a good result is a way to test. Scikit-learn offers a [side-by-side comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) on a created dataset, comparing KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, showing the results visualized: 

 ![comparison of classifiers](images/comparison.png)
@ -94,6 +104,8 @@ So, which classifier should you choose? Often, running through several and looki

 > AutoML solves this problem neatly by running these comparisons in the cloud, allowing you to choose the best algorithm for your data. Try it [here](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-15963-cxa)

+### A better approach
+
 A better way than wildly guessing, however, is to follow the ideas on this downloadable [ML Cheat sheet](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-15963-cxa). Here, we discover that, for our multiclass problem, we have some choices:

 ![cheatsheet for multiclass problems](images/cheatsheet.png)
@ -101,24 +113,25 @@ A better way than wildly guessing, however, is to follow the ideas on this downl

 ✅ Download this cheat sheet, print it out, and hang it on your wall!

-Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task. We do not use a two-class classifier, so that rules out one-vs-all. A decision tree might work, or logistic regression for multiclass data. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us. 
+### Reasoning

-We can focus on logistic regression for our first training trial since you recently learned about the latter in a previous lesson.
-## Train your model
+Let's see if we can reason our way through different approaches given the constraints we have:

-Let's train a model. Split your data into training and testing groups:
+- **Neural networks are too heavy**. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task.
+- **No two-class classifier**. We do not use a two-class classifier, so that rules out one-vs-all. 
+- **Decision tree or logistic regression could work**. A decision tree might work, or logistic regression for multiclass data. 
+- **Multiclass , wrong fit**. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us.

-```python
-X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
-```
+### Using Scikit 

-There are many ways to use the LogisticRegression library in Scikit-learn. Take a look at the [parameters to pass](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).  
+We will be using SciKit to analyze our data. However, there are many ways to use Logistic Regression in Scikit-learn. Take a look at the [parameters to pass](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).  

-According to the docs, "In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)"
+Essentially there are two important parameters `multi_class` and `solver`, that we need to specify, when we ask SciKit to perform a Logistic Regression. The `multi_class` value applies a certain behavior. The value of the solver is what algorithm to use. Not all solvers can be paired with all `multi_class` values.

-Since you are using the multiclass case, you need to choose what scheme to use and what 'solver' to set. 
+According to the docs, in the multiclass case, the training algorithm:

-Use LogisticRegression with a multiclass setting and the liblinear solver to train.
+- **Uses the one-vs-rest (OvR) scheme**, if the `multi_class` option is set to `ovr`
+- **Uses the cross-entropy loss**, if the `multi_class` option is set to `multinomial`. (Currently the `multinomial` option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)"

 > 🎓 The 'scheme' here can either be 'ovr' (one-vs-rest) or 'multinomial'. Since logistic regression is really designed to support binary classification, these schemes allow it to better handle multiclass classification tasks. [source](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)

@ -128,74 +141,92 @@ Scikit-learn offers this table to explain how solvers handle different challenge

 ![solvers](images/solvers.png)

-```python
-lr = LogisticRegression(multi_class='ovr',solver='liblinear')
-model = lr.fit(X_train, np.ravel(y_train))
+## Exercise - split the data
+
+We can focus on logistic regression for our first training trial since you recently learned about the latter in a previous lesson.
+Split your data into training and testing groups by calling `train_test_split()`:

-accuracy = model.score(X_test, y_test)
-print ("Accuracy is {}".format(accuracy))
+```python
+X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
 ```

-✅ Try a different solver like `lbfgs`, which is often set as default
+## Exercise - apply logistic regression

-> Note, use Pandas [`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) function to flatten your data when needed.
+Since you are using the multiclass case, you need to choose what _scheme_ to use and what _solver_ to set. Use LogisticRegression with a multiclass setting and the **liblinear** solver to train.

-The accuracy is good at over 80%!
+1. Create a logistic regression with multi_class set to `ovr` and the solver set to `liblinear`:

-You can see this model in action by testing one row of data (#50):
+    ```python
+    lr = LogisticRegression(multi_class='ovr',solver='liblinear')
+    model = lr.fit(X_train, np.ravel(y_train))
    
-```python
-print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
-print(f'cuisine: {y_test.iloc[50]}')
-```
-The result is printed:
-```
-ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
-cuisine: indian
-```
+    accuracy = model.score(X_test, y_test)
+    print ("Accuracy is {}".format(accuracy))
+    ```

-✅ Try a different row number and check the results
+    ✅ Try a different solver like `lbfgs`, which is often set as default

-Digging deeper, you can check for the accuracy of this prediction:
+    > Note, use Pandas [`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) function to flatten your data when needed.

-```python
-test= X_test.iloc[50].values.reshape(-1, 1).T
-proba = model.predict_proba(test)
-classes = model.classes_
-resultdf = pd.DataFrame(data=proba, columns=classes)
+    The accuracy is good at over **80%**!

-topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
-topPrediction.head()
-```
-The result is printed - Indian cuisine is its best guess, with good probability:
+1. You can see this model in action by testing one row of data (#50):

-|          |        0 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| -------: | -------: | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-|   indian | 0.715851 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-|  chinese | 0.229475 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| japanese | 0.029763 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-|   korean | 0.017277 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-|     thai | 0.007634 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    ```python
+    print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
+    print(f'cuisine: {y_test.iloc[50]}')
+    ```

-✅ Can you explain why the model is pretty sure this is an Indian cuisine?
+    The result is printed:

-Get more detail by printing a classification report, as you did in the regression lessons:
+   ```output
+   ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
+   cuisine: indian
+   ```

-```python
-y_pred = model.predict(X_test)
-print(classification_report(y_test,y_pred))
-```
+   ✅ Try a different row number and check the results
+
+1. Digging deeper, you can check for the accuracy of this prediction:
+
+    ```python
+    test= X_test.iloc[50].values.reshape(-1, 1).T
+    proba = model.predict_proba(test)
+    classes = model.classes_
+    resultdf = pd.DataFrame(data=proba, columns=classes)
+    
+    topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
+    topPrediction.head()
+    ```
+
+    The result is printed - Indian cuisine is its best guess, with good probability:
+
+    |          |        0 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | -------: | -------: | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+    |   indian | 0.715851 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    |  chinese | 0.229475 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | japanese | 0.029763 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    |   korean | 0.017277 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    |     thai | 0.007634 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+
+    ✅ Can you explain why the model is pretty sure this is an Indian cuisine?
+
+1. Get more detail by printing a classification report, as you did in the regression lessons:
+
+    ```python
+    y_pred = model.predict(X_test)
+    print(classification_report(y_test,y_pred))
+    ```

-| precision    | recall | f1-score | support |      |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| ------------ | ------ | -------- | ------- | ---- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| chinese      | 0.73   | 0.71     | 0.72    | 229  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| indian       | 0.91   | 0.93     | 0.92    | 254  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| japanese     | 0.70   | 0.75     | 0.72    | 220  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| korean       | 0.86   | 0.76     | 0.81    | 242  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| thai         | 0.79   | 0.85     | 0.82    | 254  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| accuracy     | 0.80   | 1199     |         |      |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| macro avg    | 0.80   | 0.80     | 0.80    | 1199 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-| weighted avg | 0.80   | 0.80     | 0.80    | 1199 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | precision    | recall | f1-score | support |      |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | ------------ | ------ | -------- | ------- | ---- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+    | chinese      | 0.73   | 0.71     | 0.72    | 229  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | indian       | 0.91   | 0.93     | 0.92    | 254  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | japanese     | 0.70   | 0.75     | 0.72    | 220  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | korean       | 0.86   | 0.76     | 0.81    | 242  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | thai         | 0.79   | 0.85     | 0.82    | 254  |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | accuracy     | 0.80   | 1199     |         |      |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | macro avg    | 0.80   | 0.80     | 0.80    | 1199 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+    | weighted avg | 0.80   | 0.80     | 0.80    | 1199 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |

 ## 🚀Challenge