11 KiB

Raw Permalink Blame History

Cuisine classifiers 2

In this second classification lesson, you will explore additional methods for classifying numeric data. You will also learn about the implications of choosing one classifier over another.

Pre-lecture quiz

Prerequisite

We assume that you have completed the previous lessons and have a cleaned dataset in your data folder named cleaned_cuisines.csv in the root of this 4-lesson folder.

Preparation

We have preloaded your notebook.ipynb file with the cleaned dataset and divided it into X and y dataframes, ready for the model-building process.

A classification map

Previously, you learned about the various options available for classifying data using Microsoft's cheat sheet. Scikit-learn provides a similar but more detailed cheat sheet that can help you further narrow down your choice of estimators (another term for classifiers):

Tip: visit this map online and click along the path to read documentation.

The plan

This map is very useful once you have a clear understanding of your data, as you can follow its paths to make a decision:

We have >50 samples
We want to predict a category
We have labeled data
We have fewer than 100K samples
✨ We can choose a Linear SVC
If that doesn't work, since we have numeric data:
- We can try a ✨ KNeighbors Classifier
  - If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers

This is a very helpful guide to follow.

Exercise - split the data

Following this path, we should start by importing some libraries for use.

Import the necessary libraries:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
import numpy as np

Split your training and test data:

X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Linear SVC classifier

Support-Vector Clustering (SVC) is part of the Support-Vector Machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to determine how to cluster the labels. The 'C' parameter refers to 'regularization,' which controls the influence of parameters. The kernel can be one of several; here, we set it to 'linear' to ensure we use linear SVC. Probability defaults to 'false'; here, we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data and obtain probabilities.

Exercise - apply a linear SVC

Start by creating an array of classifiers. You will add to this array progressively as we test.

Start with a Linear SVC:

C = 10
# Create different classifiers.
classifiers = {
    'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)
}

Train your model using the Linear SVC and print out a report:

n_classifiers = len(classifiers)

for index, (name, classifier) in enumerate(classifiers.items()):
    classifier.fit(X_train, np.ravel(y_train))

    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
    print(classification_report(y_test,y_pred))

The result is quite good:

Accuracy (train) for Linear SVC: 78.6% 
              precision    recall  f1-score   support

     chinese       0.71      0.67      0.69       242
      indian       0.88      0.86      0.87       234
    japanese       0.79      0.74      0.76       254
      korean       0.85      0.81      0.83       242
        thai       0.71      0.86      0.78       227

    accuracy                           0.79      1199
   macro avg       0.79      0.79      0.79      1199
weighted avg       0.79      0.79      0.79      1199

K-Neighbors classifier

K-Neighbors belongs to the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created, and data is grouped around these points so that generalized labels can be predicted for the data.

Exercise - apply the K-Neighbors classifier

The previous classifier performed well with the data, but perhaps we can achieve better accuracy. Try a K-Neighbors classifier.

Add a line to your classifier array (add a comma after the Linear SVC item):

'KNN classifier': KNeighborsClassifier(C),

The result is slightly worse:

Accuracy (train) for KNN classifier: 73.8% 
              precision    recall  f1-score   support

     chinese       0.64      0.67      0.66       242
      indian       0.86      0.78      0.82       234
    japanese       0.66      0.83      0.74       254
      korean       0.94      0.58      0.72       242
        thai       0.71      0.82      0.76       227

    accuracy                           0.74      1199
   macro avg       0.76      0.74      0.74      1199
weighted avg       0.76      0.74      0.74      1199

✅ Learn about K-Neighbors

Support Vector Classifier

Support-Vector Classifiers are part of the Support-Vector Machine family of ML methods used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.

Exercise - apply a Support Vector Classifier

Let's aim for slightly better accuracy with a Support Vector Classifier.

Add a comma after the K-Neighbors item, and then add this line:

'SVC': SVC(),

The result is quite good!

Accuracy (train) for SVC: 83.2% 
              precision    recall  f1-score   support

     chinese       0.79      0.74      0.76       242
      indian       0.88      0.90      0.89       234
    japanese       0.87      0.81      0.84       254
      korean       0.91      0.82      0.86       242
        thai       0.74      0.90      0.81       227

    accuracy                           0.83      1199
   macro avg       0.84      0.83      0.83      1199
weighted avg       0.84      0.83      0.83      1199

✅ Learn about Support-Vectors

Ensemble Classifiers

Let's follow the path to the very end, even though the previous test performed well. Let's try some 'Ensemble Classifiers,' specifically Random Forest and AdaBoost:

  'RFST': RandomForestClassifier(n_estimators=100),
  'ADA': AdaBoostClassifier(n_estimators=100)

The result is excellent, especially for Random Forest:

Accuracy (train) for RFST: 84.5% 
              precision    recall  f1-score   support

     chinese       0.80      0.77      0.78       242
      indian       0.89      0.92      0.90       234
    japanese       0.86      0.84      0.85       254
      korean       0.88      0.83      0.85       242
        thai       0.80      0.87      0.83       227

    accuracy                           0.84      1199
   macro avg       0.85      0.85      0.84      1199
weighted avg       0.85      0.84      0.84      1199

Accuracy (train) for ADA: 72.4% 
              precision    recall  f1-score   support

     chinese       0.64      0.49      0.56       242
      indian       0.91      0.83      0.87       234
    japanese       0.68      0.69      0.69       254
      korean       0.73      0.79      0.76       242
        thai       0.67      0.83      0.74       227

    accuracy                           0.72      1199
   macro avg       0.73      0.73      0.72      1199
weighted avg       0.73      0.72      0.72      1199

✅ Learn about Ensemble Classifiers

This Machine Learning method "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost.

Random Forest, an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter specifies the number of trees.
AdaBoost fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.

🚀Challenge

Each of these techniques has numerous parameters that you can adjust. Research the default parameters for each one and consider how tweaking these parameters might affect the model's quality.

Post-lecture quiz

Review & Self Study

There is a lot of terminology in these lessons, so take a moment to review this list of useful terms!

Assignment

Parameter play

Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

11 KiB Raw Permalink Blame History