classification 2

4 years ago · 056f9a73b1
parent 29719bef72
commit 056f9a73b1
7 changed files with 73 additions and 29 deletions
--- a/2-Regression/4-Logistic/README.md
+++ b/2-Regression/4-Logistic/README.md
@ -15,7 +15,9 @@ In this lesson, you will learn:
 Deepen your understanding of working with this type of Regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa)
 ## Prerequisite

-Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one small category that we can work with: Color. Let's build a Logistic Regression model to predict that, given some variables, what color a given pumpkin will be (orange 🎃 or white 👻). 
+Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: Color. Let's build a Logistic Regression model to predict that, given some variables, what color a given pumpkin will be (orange 🎃 or white 👻). 
+
+> Why are we talking about binary classification in a lesson grouping about regression? Only for convenience, as Logistic Regression is [really a Classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), albeit a linear-based one. Learn other ways to classify data in the next lesson group.

 For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.

--- a/4-Classification/2-Classifiers-1/README.md
+++ b/4-Classification/2-Classifiers-1/README.md
@ -91,11 +91,23 @@ You can also use [neural networks to classify](https://scikit-learn.org/stable/m
 So, which classifier should you choose? Often, running through several and looking for a good result is a way to test. Scikit-Learn offers a [side-by-side comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) on a created dataset, comparing KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, showing the results visualized: 

 ![comparison of classifiers](images/comparison.png)
+> Plots generated on Scikit-Learn's documentation

 > AutoML solves this problem neatly by running these comparisons in the cloud, allowing you to choose the best algorithm for your data. Try it [here](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-15963-cxa)

-✅ Todo: knowledge check
+A better way than wildly guessing, however, is to follow the ideas on this downloadable [ML Cheat sheet](https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-15963-cxa). Here, we discover that, for our multiclass problem, we have some choices:

+![cheatsheet for multiclass problems](images/cheatsheet.png)
+> A section of Microsoft's Algorithm Cheat Sheet, detailing multiclass classification options
+
+✅ Download this cheat sheet, print it out, and hang it on your wall!
+
+Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task. We do not use a two-class classifier, so that rules out One-vs-All. A 
+Decision Tree might work, or Logistic Regression for multiclass data. The Multiclass Boosted Decision Tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us. 
+
+We can focus on Decision Trees and Logistic Regression.
+
+Let's focus on Logistic Regression for our first training trial since you recently learned about it in a previous lesson.
 ## Train your model

 Let's train that model. Split your data into training and testing groups:
@ -104,18 +116,34 @@ Let's train that model. Split your data into training and testing groups:
 X_train, X_test, y_train, y_test = train_test_split(recipes_feature_df, recipes_label_df, test_size=0.3)
 ```

-Use LogisticRegression with a multiclass setting and the lbfgs solver to train.
+There are many ways to use the LogisticRegression library in Scikit-Learn. Take a look at the [parameters to pass](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).  
+
+According to the docs, "In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)"
+
+Since you are using the multiclass case, you need to choose what scheme to use and what 'solver' to set. 
+
+Use LogisticRegression with a multiclass setting and the liblinear solver to train.
+
+> 🎓 The 'scheme' here can either be 'ovr' (one-vs-rest) or 'multinomial'. Since Logistic Regression is really designed to support binary classification, these schemes allow it to better handle multiclass classification tasks. [source](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)

-✅ Todo: explain ravel
+> 🎓 The 'solver' is defined as "the algorithm to use in the optimization problem". [source](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression). 
+
+Scikit-Learn offers this table to explain how solvers handle different challenges presented by different kinds of data structures:
+
+![solvers](images/solvers.png)

 ```python
-lr = LogisticRegression(multi_class='ovr',solver='lbfgs')
+lr = LogisticRegression(multi_class='ovr',solver='liblinear')
 model = lr.fit(X_train, np.ravel(y_train))

 accuracy = model.score(X_test, y_test)
 print ("Accuracy is {}".format(accuracy))
 ```

+✅ Try a different solver like `lbfgs`, which is often set as default
+
+> Note, use Pandas [`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) function to flatten your data when needed.
+
 The accuracy is good at over 80%!

 You can see this model in action by testing one row of data (#50):
@ -130,8 +158,7 @@ ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_o
 cuisine: indian
 ```

-✅ Try a different row number!
-
+✅ Try a different row number and check the results

 Digging deeper, you can check for the accuracy of this prediction:

@ -173,15 +200,15 @@ print(classification_report(y_test,y_pred))
 | accuracy     | 0.80   | 1199     |         |      |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
 | macro avg    | 0.80   | 0.80     | 0.80    | 1199 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
 | weighted avg | 0.80   | 0.80     | 0.80    | 1199 |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
-## 🚀Challenge

-Add a challenge for students to work on collaboratively in class to enhance the project
+## 🚀Challenge

-Optional: add a screenshot of the completed lesson's UI if appropriate
+In this lesson, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the many options Scikit-Learn provides to classify data. Dig deeper into the concept of 'solver' to understand what goes on behind the scenes.

 ## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/20/)
 ## Review & Self Study

+Dig a little more into the math behind Logistic Regression in [this lesson](https://people.eecs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2006.pdf)
 ## Assignment 

 [Assignment Name](assignment.md)
--- a/4-Classification/2-Classifiers-1/images/cheatsheet.png
+++ b/4-Classification/2-Classifiers-1/images/cheatsheet.png
--- a/4-Classification/2-Classifiers-1/images/solvers.png
+++ b/4-Classification/2-Classifiers-1/images/solvers.png
--- a/4-Classification/2-Classifiers-1/solution/notebook.ipynb
+++ b/4-Classification/2-Classifiers-1/solution/notebook.ipynb
@ -151,12 +151,12 @@
     "output_type": "stream",
     "name": "stdout",
     "text": [
-      "Accuracy is 0.8023352793994996\n"
+      "Accuracy is 0.7906588824020017\n"
     ]
    }
   ],
   "source": [
-    "lr = LogisticRegression(multi_class='ovr',solver='lbfgs')\n",
+    "lr = LogisticRegression(multi_class='ovr',solver='liblinear')\n",
    "model = lr.fit(X_train, np.ravel(y_train))\n",
    "\n",
    "accuracy = model.score(X_test, y_test)\n",
@ -165,14 +165,14 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
-      "ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')\ncuisine: indian\n"
+      "ingredients: Index(['basil', 'coconut', 'coriander', 'cumin', 'fenugreek', 'pepper',\n       'turmeric'],\n      dtype='object')\ncuisine: thai\n"
     ]
    }
   ],
@ -184,7 +184,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
@ -192,16 +192,16 @@
     "data": {
      "text/plain": [
       "                 0\n",
-       "indian    0.715851\n",
-       "chinese   0.229475\n",
-       "japanese  0.029763\n",
-       "korean    0.017277\n",
-       "thai      0.007634"
+       "thai      0.857884\n",
+       "indian    0.105667\n",
+       "japanese  0.033860\n",
+       "chinese   0.002365\n",
+       "korean    0.000224"
      ],
-      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>indian</th>\n      <td>0.715851</td>\n    </tr>\n    <tr>\n      <th>chinese</th>\n      <td>0.229475</td>\n    </tr>\n    <tr>\n      <th>japanese</th>\n      <td>0.029763</td>\n    </tr>\n    <tr>\n      <th>korean</th>\n      <td>0.017277</td>\n    </tr>\n    <tr>\n      <th>thai</th>\n      <td>0.007634</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
+      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>thai</th>\n      <td>0.857884</td>\n    </tr>\n    <tr>\n      <th>indian</th>\n      <td>0.105667</td>\n    </tr>\n    <tr>\n      <th>japanese</th>\n      <td>0.033860</td>\n    </tr>\n    <tr>\n      <th>chinese</th>\n      <td>0.002365</td>\n    </tr>\n    <tr>\n      <th>korean</th>\n      <td>0.000224</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
-     "execution_count": 24
+     "execution_count": 19
    }
   ],
   "source": [
@ -227,7 +227,7 @@
     "output_type": "stream",
     "name": "stdout",
     "text": [
-      "              precision    recall  f1-score   support\n\n     chinese       0.73      0.71      0.72       229\n      indian       0.91      0.93      0.92       254\n    japanese       0.70      0.75      0.72       220\n      korean       0.86      0.76      0.81       242\n        thai       0.79      0.85      0.82       254\n\n    accuracy                           0.80      1199\n   macro avg       0.80      0.80      0.80      1199\nweighted avg       0.80      0.80      0.80      1199\n\n"
+      "              precision    recall  f1-score   support\n\n     chinese       0.72      0.69      0.71       239\n      indian       0.90      0.89      0.89       244\n    japanese       0.74      0.75      0.75       232\n      korean       0.82      0.77      0.79       229\n        thai       0.78      0.85      0.81       255\n\n    accuracy                           0.79      1199\n   macro avg       0.79      0.79      0.79      1199\nweighted avg       0.79      0.79      0.79      1199\n\n"
     ]
    }
   ],
@ -279,10 +279,10 @@
     "output_type": "stream",
     "name": "stdout",
     "text": [
-      "Accuracy (train) for L1 logistic: 79.5% \n",
-      "Accuracy (train) for L2 logistic (Multinomial): 80.1% \n",
-      "Accuracy (train) for L2 logistic (OvR): 80.7% \n",
-      "Accuracy (train) for Linear SVC: 78.5% \n"
+      "Accuracy (train) for L1 logistic: 78.0% \n",
+      "Accuracy (train) for L2 logistic (Multinomial): 78.5% \n",
+      "Accuracy (train) for L2 logistic (OvR): 78.7% \n",
+      "Accuracy (train) for Linear SVC: 78.4% \n"
     ]
    }
   ],
@ -311,7 +311,8 @@
  },
  "kernelspec": {
   "name": "python37364bit8d3b438fb5fc4430a93ac2cb74d693a7",
-   "display_name": "Python 3.7.0 64-bit ('3.7')"
+   "display_name": "Python 3.7.3 64-bit",
+   "language": "python"
  },
  "language_info": {
   "codemirror_mode": {
--- a/4-Classification/3-Classifiers-2/README.md
+++ b/4-Classification/3-Classifiers-2/README.md
@ -3,7 +3,21 @@
 In this second Classification lesson, you will explore more ways to classify data, and the ramifications for choosing one over the other.
 ## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/21/)

-Describe what we will learn
+Scikit-Learn offers a similar, but more granular cheat sheet that can further help narrow down your estimators (another term for classifiers):
+
+![ML Map from Scikit-Learn](images/map.png)
+
+This map is very helpful as you can 'walk' along its paths to a decision:
+
+- We have >50 samples
+- We want to predict a category
+- We have labeled data
+- We have fewer than 100K samples
+- We can choose a Linear SVC
+- If that doesn't work, since we have numeric data
+- We can try a KNeighbors Classifier and if that doesn't work, try SVC and Ensemble Classifiers
+
+This is a terrific trail to try. For our first foray, explore how well 

 ### Introduction

--- a/4-Classification/3-Classifiers-2/images/map.png
+++ b/4-Classification/3-Classifiers-2/images/map.png