Merge pull request #43 from softchris/classification-classifiers-2

editorial
pull/45/head
Jen Looper 4 years ago committed by GitHub
commit 7426e7d5fe
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,11 +1,13 @@
# Cuisine classifiers 2 # Cuisine classifiers 2
In this second classification lesson, you will explore more ways to classify numeric data. You will also learn about the ramifications for choosing one over the other. In this second classification lesson, you will explore more ways to classify numeric data. You will also learn about the ramifications for choosing one over the other.
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/23/) ## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/23/)
### Prerequisite ### Prerequisite
We assume that you have completed the previous lessons and have a cleaned dataset in your `data` folder called _cleaned_cuisine.csv_ in the root of this 4-lesson folder. We assume that you have completed the previous lessons and have a cleaned dataset in your `data` folder called _cleaned_cuisine.csv_ in the root of this 4-lesson folder.
### Preparation ### Preparation
We have loaded your _notebook.ipynb_ file with the cleaned dataset and have divided it into X and y dataframes, ready for the model building process. We have loaded your _notebook.ipynb_ file with the cleaned dataset and have divided it into X and y dataframes, ready for the model building process.
@ -17,6 +19,8 @@ Previously, you learned about the various options you have when classifying data
![ML Map from Scikit-learn](images/map.png) ![ML Map from Scikit-learn](images/map.png)
> Tip: [visit this map online](https://scikit-learn.org/stable/tutorial/machine_learning_map/) and click along the path to read documentation. > Tip: [visit this map online](https://scikit-learn.org/stable/tutorial/machine_learning_map/) and click along the path to read documentation.
### The plan
This map is very helpful once you have a clear grasp of your data, as you can 'walk' along its paths to a decision: This map is very helpful once you have a clear grasp of your data, as you can 'walk' along its paths to a decision:
- We have >50 samples - We have >50 samples
@ -28,130 +32,157 @@ This map is very helpful once you have a clear grasp of your data, as you can 'w
- We can try a ✨ KNeighbors Classifier - We can try a ✨ KNeighbors Classifier
- If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers - If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers
This is a very helpful trail to follow. Following this path, we should start by importing some libraries to use: This is a very helpful trail to follow.
```python ## Exercise - split the data
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
import numpy as np
```
Split your training and test data:
```python Following this path, we should start by importing some libraries to use.
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
```
## Linear SVC classifier
Start by creating an array of classifiers. You will add progressively to this array as we test. Start with a Linear SVC: 1. Import the needed libraries:
```python ```python
C = 10 from sklearn.neighbors import KNeighborsClassifier
# Create different classifiers. from sklearn.linear_model import LogisticRegression
classifiers = { from sklearn.svm import SVC
'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0) from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
} from sklearn.model_selection import train_test_split, cross_val_score
``` from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
Train your model using the Linear SVC and print out a report: import numpy as np
```
```python 1. Split your training and test data:
n_classifiers = len(classifiers)
for index, (name, classifier) in enumerate(classifiers.items()): ```python
classifier.fit(X_train, np.ravel(y_train)) X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
```
y_pred = classifier.predict(X_test) ## Linear SVC classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
print(classification_report(y_test,y_pred))
```
The result is pretty good:
``` Support-Vector clustering (SVC) is a child of the Support-Vector machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. The kernel can be one of [several](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC); here we set it to 'linear' to ensure that we leverage linear SVC. Probability defaults to 'false'; here we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities.
Accuracy (train) for Linear SVC: 78.6%
precision recall f1-score support
chinese 0.71 0.67 0.69 242 ### Exercise - apply a linear SVC
indian 0.88 0.86 0.87 234
japanese 0.79 0.74 0.76 254 Start by creating an array of classifiers. You will add progressively to this array as we test.
korean 0.85 0.81 0.83 242
thai 0.71 0.86 0.78 227 1. Start with a Linear SVC:
```python
C = 10
# Create different classifiers.
classifiers = {
'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)
}
```
2. Train your model using the Linear SVC and print out a report:
```python
n_classifiers = len(classifiers)
for index, (name, classifier) in enumerate(classifiers.items()):
classifier.fit(X_train, np.ravel(y_train))
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
print(classification_report(y_test,y_pred))
```
The result is pretty good:
```output
Accuracy (train) for Linear SVC: 78.6%
precision recall f1-score support
chinese 0.71 0.67 0.69 242
indian 0.88 0.86 0.87 234
japanese 0.79 0.74 0.76 254
korean 0.85 0.81 0.83 242
thai 0.71 0.86 0.78 227
accuracy 0.79 1199
macro avg 0.79 0.79 0.79 1199
weighted avg 0.79 0.79 0.79 1199
```
accuracy 0.79 1199 ## K-Neighbors classifier
macro avg 0.79 0.79 0.79 1199
weighted avg 0.79 0.79 0.79 1199
```
✅ Learn about Linear SVC K-Neighbors is part of the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data.
Support-Vector clustering (SVC) is a child of the Support-Vector machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. The kernel can be one of [several](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC); here we set it to 'linear' to ensure that we leverage linear SVC. Probability defaults to 'false'; here we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities. ### Exercise - apply the K-Neighbors classifier
## K-Neighbors classifier
The previous classifier was good, and worked well with the data, but maybe we can get better accuracy. Try a K-Neighbors classifier. Add a line to your classifier array (add a comma after the Linear SVC item): The previous classifier was good, and worked well with the data, but maybe we can get better accuracy. Try a K-Neighbors classifier.
```python 1. Add a line to your classifier array (add a comma after the Linear SVC item):
'KNN classifier': KNeighborsClassifier(C),
```
The result is a little worse:
``` ```python
Accuracy (train) for KNN classifier: 73.8% 'KNN classifier': KNeighborsClassifier(C),
precision recall f1-score support ```
chinese 0.64 0.67 0.66 242 The result is a little worse:
indian 0.86 0.78 0.82 234
japanese 0.66 0.83 0.74 254
korean 0.94 0.58 0.72 242
thai 0.71 0.82 0.76 227
accuracy 0.74 1199 ```output
macro avg 0.76 0.74 0.74 1199 Accuracy (train) for KNN classifier: 73.8%
weighted avg 0.76 0.74 0.74 1199 precision recall f1-score support
```
✅ Learn about [K-Neighbors](https://scikit-learn.org/stable/modules/neighbors.html#neighbors) chinese 0.64 0.67 0.66 242
indian 0.86 0.78 0.82 234
japanese 0.66 0.83 0.74 254
korean 0.94 0.58 0.72 242
thai 0.71 0.82 0.76 227
accuracy 0.74 1199
macro avg 0.76 0.74 0.74 1199
weighted avg 0.76 0.74 0.74 1199
```
✅ Learn about [K-Neighbors](https://scikit-learn.org/stable/modules/neighbors.html#neighbors)
K-Neighbors is part of the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data.
## Support Vector Classifier ## Support Vector Classifier
Let's try for a little better accuracy with a Support Vector Classifier. Add a comma after the K-Neighbors item, and then add this line: Support-Vector classifiers are part of the [Support-Vector Machine](https://wikipedia.org/wiki/Support-vector_machine) family of ML methods that are used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.
```python ### Exercise - apply a Support Vector Classifier
'SVC': SVC(),
```
The result is quite good!
``` Let's try for a little better accuracy with a Support Vector Classifier.
Accuracy (train) for SVC: 83.2%
precision recall f1-score support
chinese 0.79 0.74 0.76 242 1. Add a comma after the K-Neighbors item, and then add this line:
indian 0.88 0.90 0.89 234
japanese 0.87 0.81 0.84 254
korean 0.91 0.82 0.86 242
thai 0.74 0.90 0.81 227
accuracy 0.83 1199 ```python
macro avg 0.84 0.83 0.83 1199 'SVC': SVC(),
weighted avg 0.84 0.83 0.83 1199 ```
```
✅ Learn about [Support-Vectors](https://scikit-learn.org/stable/modules/svm.html#svm) The result is quite good!
```output
Accuracy (train) for SVC: 83.2%
precision recall f1-score support
chinese 0.79 0.74 0.76 242
indian 0.88 0.90 0.89 234
japanese 0.87 0.81 0.84 254
korean 0.91 0.82 0.86 242
thai 0.74 0.90 0.81 227
accuracy 0.83 1199
macro avg 0.84 0.83 0.83 1199
weighted avg 0.84 0.83 0.83 1199
```
✅ Learn about [Support-Vectors](https://scikit-learn.org/stable/modules/svm.html#svm)
Support-Vector classifiers are part of the [Support-Vector Machine](https://wikipedia.org/wiki/Support-vector_machine) family of ML methods that are used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.
## Ensemble Classifiers ## Ensemble Classifiers
Let's follow the path to the very end, even though the previous test was quite good. Let's try some 'Ensemble Classifiers, specifically Random Forest and AdaBoost: Let's follow the path to the very end, even though the previous test was quite good. Let's try some 'Ensemble Classifiers, specifically Random Forest and AdaBoost:
``` ```python
'RFST': RandomForestClassifier(n_estimators=100), 'RFST': RandomForestClassifier(n_estimators=100),
'ADA': AdaBoostClassifier(n_estimators=100) 'ADA': AdaBoostClassifier(n_estimators=100)
``` ```
The result is very good, especially for Random Forest: The result is very good, especially for Random Forest:
``` ```output
Accuracy (train) for RFST: 84.5% Accuracy (train) for RFST: 84.5%
precision recall f1-score support precision recall f1-score support
@ -178,6 +209,7 @@ Accuracy (train) for ADA: 72.4%
macro avg 0.73 0.73 0.72 1199 macro avg 0.73 0.73 0.72 1199
weighted avg 0.73 0.72 0.72 1199 weighted avg 0.73 0.72 0.72 1199
``` ```
✅ Learn about [Ensemble Classifiers](https://scikit-learn.org/stable/modules/ensemble.html) ✅ Learn about [Ensemble Classifiers](https://scikit-learn.org/stable/modules/ensemble.html)
This method of Machine Learning "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost. This method of Machine Learning "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost.
@ -187,14 +219,17 @@ This method of Machine Learning "combines the predictions of several base estima
- [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct. - [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.
--- ---
## 🚀Challenge ## 🚀Challenge
Each of these techniques has a large number of parameters that you can tweak. Research each one's default parameters and think about what tweaking these parameters would mean for the model's quality. Each of these techniques has a large number of parameters that you can tweak. Research each one's default parameters and think about what tweaking these parameters would mean for the model's quality.
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/24/) ## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/24/)
## Review & Self Study ## Review & Self Study
There's a lot of jargon in these lessons, so take a minute to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-15963-cxa) of useful terminology! There's a lot of jargon in these lessons, so take a minute to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-15963-cxa) of useful terminology!
## Assignment ## Assignment
[Parameter play](assignment.md) [Parameter play](assignment.md)

Loading…
Cancel
Save