11 KiB
Cuisine classifiers 2
In this second classification lesson, you will explore additional methods for classifying numeric data. You will also learn about the implications of choosing one classifier over another.
Pre-lecture quiz
Prerequisite
We assume that you have completed the previous lessons and have a cleaned dataset in your data
folder named cleaned_cuisines.csv in the root of this 4-lesson folder.
Preparation
We have preloaded your notebook.ipynb file with the cleaned dataset and divided it into X and y dataframes, ready for the model-building process.
A classification map
Previously, you learned about the various options available for classifying data using Microsoft's cheat sheet. Scikit-learn provides a similar but more detailed cheat sheet that can help you further narrow down your choice of estimators (another term for classifiers):
Tip: visit this map online and click along the path to read documentation.
The plan
This map is very useful once you have a clear understanding of your data, as you can follow its paths to make a decision:
- We have >50 samples
- We want to predict a category
- We have labeled data
- We have fewer than 100K samples
- ✨ We can choose a Linear SVC
- If that doesn't work, since we have numeric data:
- We can try a ✨ KNeighbors Classifier
- If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers
- We can try a ✨ KNeighbors Classifier
This is a very helpful guide to follow.
Exercise - split the data
Following this path, we should start by importing some libraries for use.
-
Import the necessary libraries:
from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve import numpy as np
-
Split your training and test data:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
Linear SVC classifier
Support-Vector Clustering (SVC) is part of the Support-Vector Machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to determine how to cluster the labels. The 'C' parameter refers to 'regularization,' which controls the influence of parameters. The kernel can be one of several; here, we set it to 'linear' to ensure we use linear SVC. Probability defaults to 'false'; here, we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data and obtain probabilities.
Exercise - apply a linear SVC
Start by creating an array of classifiers. You will add to this array progressively as we test.
-
Start with a Linear SVC:
C = 10 # Create different classifiers. classifiers = { 'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0) }
-
Train your model using the Linear SVC and print out a report:
n_classifiers = len(classifiers) for index, (name, classifier) in enumerate(classifiers.items()): classifier.fit(X_train, np.ravel(y_train)) y_pred = classifier.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100)) print(classification_report(y_test,y_pred))
The result is quite good:
Accuracy (train) for Linear SVC: 78.6% precision recall f1-score support chinese 0.71 0.67 0.69 242 indian 0.88 0.86 0.87 234 japanese 0.79 0.74 0.76 254 korean 0.85 0.81 0.83 242 thai 0.71 0.86 0.78 227 accuracy 0.79 1199 macro avg 0.79 0.79 0.79 1199 weighted avg 0.79 0.79 0.79 1199
K-Neighbors classifier
K-Neighbors belongs to the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created, and data is grouped around these points so that generalized labels can be predicted for the data.
Exercise - apply the K-Neighbors classifier
The previous classifier performed well with the data, but perhaps we can achieve better accuracy. Try a K-Neighbors classifier.
-
Add a line to your classifier array (add a comma after the Linear SVC item):
'KNN classifier': KNeighborsClassifier(C),
The result is slightly worse:
Accuracy (train) for KNN classifier: 73.8% precision recall f1-score support chinese 0.64 0.67 0.66 242 indian 0.86 0.78 0.82 234 japanese 0.66 0.83 0.74 254 korean 0.94 0.58 0.72 242 thai 0.71 0.82 0.76 227 accuracy 0.74 1199 macro avg 0.76 0.74 0.74 1199 weighted avg 0.76 0.74 0.74 1199
✅ Learn about K-Neighbors
Support Vector Classifier
Support-Vector Classifiers are part of the Support-Vector Machine family of ML methods used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.
Exercise - apply a Support Vector Classifier
Let's aim for slightly better accuracy with a Support Vector Classifier.
-
Add a comma after the K-Neighbors item, and then add this line:
'SVC': SVC(),
The result is quite good!
Accuracy (train) for SVC: 83.2% precision recall f1-score support chinese 0.79 0.74 0.76 242 indian 0.88 0.90 0.89 234 japanese 0.87 0.81 0.84 254 korean 0.91 0.82 0.86 242 thai 0.74 0.90 0.81 227 accuracy 0.83 1199 macro avg 0.84 0.83 0.83 1199 weighted avg 0.84 0.83 0.83 1199
✅ Learn about Support-Vectors
Ensemble Classifiers
Let's follow the path to the very end, even though the previous test performed well. Let's try some 'Ensemble Classifiers,' specifically Random Forest and AdaBoost:
'RFST': RandomForestClassifier(n_estimators=100),
'ADA': AdaBoostClassifier(n_estimators=100)
The result is excellent, especially for Random Forest:
Accuracy (train) for RFST: 84.5%
precision recall f1-score support
chinese 0.80 0.77 0.78 242
indian 0.89 0.92 0.90 234
japanese 0.86 0.84 0.85 254
korean 0.88 0.83 0.85 242
thai 0.80 0.87 0.83 227
accuracy 0.84 1199
macro avg 0.85 0.85 0.84 1199
weighted avg 0.85 0.84 0.84 1199
Accuracy (train) for ADA: 72.4%
precision recall f1-score support
chinese 0.64 0.49 0.56 242
indian 0.91 0.83 0.87 234
japanese 0.68 0.69 0.69 254
korean 0.73 0.79 0.76 242
thai 0.67 0.83 0.74 227
accuracy 0.72 1199
macro avg 0.73 0.73 0.72 1199
weighted avg 0.73 0.72 0.72 1199
✅ Learn about Ensemble Classifiers
This Machine Learning method "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost.
-
Random Forest, an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter specifies the number of trees.
-
AdaBoost fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.
🚀Challenge
Each of these techniques has numerous parameters that you can adjust. Research the default parameters for each one and consider how tweaking these parameters might affect the model's quality.
Post-lecture quiz
Review & Self Study
There is a lot of terminology in these lessons, so take a moment to review this list of useful terms!
Assignment
Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.