parent
8db62fff9c
commit
ea8b3193ff
@ -1,67 +1,198 @@
|
|||||||
# Recipe Classifiers 2
|
# Recipe Classifiers 2
|
||||||
|
|
||||||
In this second Classification lesson, you will explore more ways to classify data, and the ramifications for choosing one over the other.
|
In this second Classification lesson, you will explore more ways to classify numeric data, and the ramifications for choosing one over the other.
|
||||||
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/21/)
|
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/21/)
|
||||||
|
|
||||||
Scikit-Learn offers a similar, but more granular cheat sheet that can further help narrow down your estimators (another term for classifiers):
|
### Prerequisite
|
||||||
|
|
||||||
|
We assume that you have completed the previous lessons and have a cleaned dataset in your `data` folder called `cleaned_cuisine.csv` in the root of this 4-lesson folder.
|
||||||
|
### Preparation
|
||||||
|
|
||||||
|
We have loaded your `notebook.ipynb` file with the cleaned dataset and have divided it into X and y dataframes, ready for the model building process.
|
||||||
|
|
||||||
|
## A Classification Map
|
||||||
|
|
||||||
|
Previously, you learned about the various options you have when classifying data using Microsoft's cheat sheet. Scikit-Learn offers a similar, but more granular cheat sheet that can further help narrow down your estimators (another term for classifiers):
|
||||||
|
|
||||||

|

|
||||||
|
> Tip: [visit this map online](https://scikit-learn.org/stable/tutorial/machine_learning_map/) and click along the path to read documentation.
|
||||||
|
|
||||||
This map is very helpful as you can 'walk' along its paths to a decision:
|
This map is very helpful once you have a clear grasp of your data, as you can 'walk' along its paths to a decision:
|
||||||
|
|
||||||
- We have >50 samples
|
- We have >50 samples
|
||||||
- We want to predict a category
|
- We want to predict a category
|
||||||
- We have labeled data
|
- We have labeled data
|
||||||
- We have fewer than 100K samples
|
- We have fewer than 100K samples
|
||||||
- We can choose a Linear SVC
|
- ✨ We can choose a Linear SVC
|
||||||
- If that doesn't work, since we have numeric data
|
- If that doesn't work, since we have numeric data
|
||||||
- We can try a KNeighbors Classifier and if that doesn't work, try SVC and Ensemble Classifiers
|
- We can try a ✨ KNeighbors Classifier
|
||||||
|
- If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers
|
||||||
|
|
||||||
|
This is a terrific trail to try. Following this path, we should start by importing some libraries to use:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from sklearn.neighbors import KNeighborsClassifier
|
||||||
|
from sklearn.linear_model import LogisticRegression
|
||||||
|
from sklearn.svm import SVC
|
||||||
|
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
|
||||||
|
from sklearn.model_selection import train_test_split, cross_val_score
|
||||||
|
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
|
||||||
|
import numpy as np
|
||||||
|
```
|
||||||
|
Split your training and test data:
|
||||||
|
|
||||||
This is a terrific trail to try. For our first foray, explore how well
|
```python
|
||||||
|
X_train, X_test, y_train, y_test = train_test_split(recipes_feature_df, recipes_label_df, test_size=0.3)
|
||||||
|
```
|
||||||
|
## Linear SVC Classifier
|
||||||
|
|
||||||
### Introduction
|
Start by creating an array of classifiers. You will add progressively to this array as we test. Start with a Linear SVC:
|
||||||
|
|
||||||
Describe what will be covered
|
```python
|
||||||
|
C = 10
|
||||||
|
# Create different classifiers.
|
||||||
|
classifiers = {
|
||||||
|
'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Train your model using the Linear SVC and print out a report:
|
||||||
|
|
||||||
> Notes
|
```python
|
||||||
|
n_classifiers = len(classifiers)
|
||||||
|
|
||||||
### Prerequisite
|
for index, (name, classifier) in enumerate(classifiers.items()):
|
||||||
|
classifier.fit(X_train, np.ravel(y_train))
|
||||||
|
|
||||||
What steps should have been covered before this lesson?
|
y_pred = classifier.predict(X_test)
|
||||||
|
accuracy = accuracy_score(y_test, y_pred)
|
||||||
|
print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
|
||||||
|
print(classification_report(y_test,y_pred))
|
||||||
|
```
|
||||||
|
The result is pretty good:
|
||||||
|
|
||||||
### Preparation
|
```
|
||||||
|
Accuracy (train) for Linear SVC: 78.6%
|
||||||
|
precision recall f1-score support
|
||||||
|
|
||||||
|
chinese 0.71 0.67 0.69 242
|
||||||
|
indian 0.88 0.86 0.87 234
|
||||||
|
japanese 0.79 0.74 0.76 254
|
||||||
|
korean 0.85 0.81 0.83 242
|
||||||
|
thai 0.71 0.86 0.78 227
|
||||||
|
|
||||||
|
accuracy 0.79 1199
|
||||||
|
macro avg 0.79 0.79 0.79 1199
|
||||||
|
weighted avg 0.79 0.79 0.79 1199
|
||||||
|
```
|
||||||
|
|
||||||
|
✅ Learn about Linear SVC
|
||||||
|
|
||||||
Preparatory steps to start this lesson
|
Support-Vector Clustering (SVC) is a child of the Support-Vector machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. The kernel can be one of [several](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC); here we set it to 'linear' to ensure that we leverage Linear SVC. Probability defaults to 'false'; here we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities.
|
||||||
|
## K-Neighbors Classifier
|
||||||
|
|
||||||
---
|
The previous classifier was good, and worked well with the data, but maybe we can get better accuracy. Try a K-Neighbors Classifer. Add a line to your classifier array (add a comma after the Linear SVC item):
|
||||||
|
|
||||||
[Step through content in blocks]
|
```python
|
||||||
|
'KNN classifier': KNeighborsClassifier(C),
|
||||||
|
```
|
||||||
|
The result is a little worse:
|
||||||
|
|
||||||
## [Topic 1]
|
```
|
||||||
|
Accuracy (train) for KNN classifier: 73.8%
|
||||||
|
precision recall f1-score support
|
||||||
|
|
||||||
|
chinese 0.64 0.67 0.66 242
|
||||||
|
indian 0.86 0.78 0.82 234
|
||||||
|
japanese 0.66 0.83 0.74 254
|
||||||
|
korean 0.94 0.58 0.72 242
|
||||||
|
thai 0.71 0.82 0.76 227
|
||||||
|
|
||||||
|
accuracy 0.74 1199
|
||||||
|
macro avg 0.76 0.74 0.74 1199
|
||||||
|
weighted avg 0.76 0.74 0.74 1199
|
||||||
|
```
|
||||||
|
✅ Learn about [K-Neighbors](https://scikit-learn.org/stable/modules/neighbors.html#neighbors)
|
||||||
|
|
||||||
### Task:
|
K-Neighbors is part of the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data.
|
||||||
|
## Support Vector Classifier
|
||||||
|
|
||||||
Work together to progressively enhance your codebase to build the project with shared code:
|
Let's try for a little better accuracy with a Support Vector Classifier. Add a comma after the K-Neighbors item, and then add this line:
|
||||||
|
|
||||||
```html
|
```python
|
||||||
code blocks
|
'SVC': SVC(),
|
||||||
```
|
```
|
||||||
|
The result is quite good!
|
||||||
|
|
||||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
```
|
||||||
|
Accuracy (train) for SVC: 83.2%
|
||||||
|
precision recall f1-score support
|
||||||
|
|
||||||
|
chinese 0.79 0.74 0.76 242
|
||||||
|
indian 0.88 0.90 0.89 234
|
||||||
|
japanese 0.87 0.81 0.84 254
|
||||||
|
korean 0.91 0.82 0.86 242
|
||||||
|
thai 0.74 0.90 0.81 227
|
||||||
|
|
||||||
|
accuracy 0.83 1199
|
||||||
|
macro avg 0.84 0.83 0.83 1199
|
||||||
|
weighted avg 0.84 0.83 0.83 1199
|
||||||
|
```
|
||||||
|
|
||||||
## [Topic 2]
|
✅ Learn about [Support-Vectors](https://scikit-learn.org/stable/modules/svm.html#svm)
|
||||||
|
|
||||||
## [Topic 3]
|
Support-Vector Classifiers are part of the [Support-Vector Machine](https://en.wikipedia.org/wiki/Support-vector_machine) family of ML methods that are used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.
|
||||||
|
## Ensemble Classifiers
|
||||||
|
|
||||||
## 🚀Challenge
|
Let's follow the path to the very end, even though the previous test was quite good. Let's try some 'Ensemble Classifiers, specifically Random Forest and AdaBoost:
|
||||||
|
|
||||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
```
|
||||||
|
'RFST': RandomForestClassifier(n_estimators=100),
|
||||||
|
'ADA': AdaBoostClassifier(n_estimators=100)
|
||||||
|
```
|
||||||
|
The result is very good, especially for Random Forest:
|
||||||
|
|
||||||
|
```
|
||||||
|
Accuracy (train) for RFST: 84.5%
|
||||||
|
precision recall f1-score support
|
||||||
|
|
||||||
|
chinese 0.80 0.77 0.78 242
|
||||||
|
indian 0.89 0.92 0.90 234
|
||||||
|
japanese 0.86 0.84 0.85 254
|
||||||
|
korean 0.88 0.83 0.85 242
|
||||||
|
thai 0.80 0.87 0.83 227
|
||||||
|
|
||||||
|
accuracy 0.84 1199
|
||||||
|
macro avg 0.85 0.85 0.84 1199
|
||||||
|
weighted avg 0.85 0.84 0.84 1199
|
||||||
|
|
||||||
|
Accuracy (train) for ADA: 72.4%
|
||||||
|
precision recall f1-score support
|
||||||
|
|
||||||
|
chinese 0.64 0.49 0.56 242
|
||||||
|
indian 0.91 0.83 0.87 234
|
||||||
|
japanese 0.68 0.69 0.69 254
|
||||||
|
korean 0.73 0.79 0.76 242
|
||||||
|
thai 0.67 0.83 0.74 227
|
||||||
|
|
||||||
|
accuracy 0.72 1199
|
||||||
|
macro avg 0.73 0.73 0.72 1199
|
||||||
|
weighted avg 0.73 0.72 0.72 1199
|
||||||
|
```
|
||||||
|
✅ Learn about [Ensemble Classifiers](https://scikit-learn.org/stable/modules/ensemble.html)
|
||||||
|
|
||||||
|
This method of Machine Learning "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost.
|
||||||
|
|
||||||
|
- [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forest), an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter is set to the number of trees.
|
||||||
|
|
||||||
|
- [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.
|
||||||
|
## 🚀Challenge
|
||||||
|
|
||||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
Each of these techniques has a large number of parameters that you can tweak. Research each one's default parameters and think about what tweaking these parameters would mean for the model's quality.
|
||||||
|
|
||||||
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/22/)
|
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/22/)
|
||||||
## Review & Self Study
|
## Review & Self Study
|
||||||
|
|
||||||
|
There's a lot of jargon in these lessons, so take a minute to review [this list](https://docs.microsoft.com/en-us/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-15963-cxa) of useful terminology!
|
||||||
## Assignment
|
## Assignment
|
||||||
|
|
||||||
[Assignment Name](assignment.md)
|
[Parameter Play](assignment.md)
|
||||||
|
@ -1,9 +1,11 @@
|
|||||||
# [Assignment Name]
|
# Parameter Play
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
There are a lot of parameters that are set by default when working with these classifiers. Intellisense in VS Code can help you dig into them. Adopt one of the ML Classification Techniques in this lesson and retrain models tweaking various parameter values. Build a notebook explaining why some changes help the model quality while others degrade it. Be detailed in your answer.
|
||||||
|
|
||||||
## Rubric
|
## Rubric
|
||||||
|
|
||||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||||
| -------- | --------- | -------- | ----------------- |
|
| -------- | ---------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- | ----------------------------- |
|
||||||
| | | | |
|
| | A notebook is presented with a classifier fully built up and its parameters tweaked and changes explained in textboxes | A notebook is partially presented or poorly explained | A notebook is buggy or flawed |
|
||||||
|
Loading…
Reference in new issue