You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/en/4-Classification/3-Classifiers-2
leestott fad44a68c4
🌐 Update translations via Co-op Translator
3 weeks ago
..
solution 🌐 Update translations via Co-op Translator 3 weeks ago
README.md 🌐 Update translations via Co-op Translator 3 weeks ago
assignment.md 🌐 Update translations via Co-op Translator 3 weeks ago
notebook.ipynb 🌐 Update translations via Co-op Translator 3 weeks ago

README.md

Cuisine classifiers 2

In this second classification lesson, you will explore additional methods for classifying numeric data. You will also learn about the implications of choosing one classifier over another.

Pre-lecture quiz

Prerequisite

We assume that you have completed the previous lessons and have a cleaned dataset in your data folder named cleaned_cuisines.csv in the root of this 4-lesson folder.

Preparation

We have preloaded your notebook.ipynb file with the cleaned dataset and divided it into X and y dataframes, ready for the model-building process.

A classification map

Previously, you learned about the various options available for classifying data using Microsoft's cheat sheet. Scikit-learn provides a similar but more detailed cheat sheet that can help you further narrow down your choice of estimators (another term for classifiers):

ML Map from Scikit-learn

Tip: visit this map online and click along the path to read documentation.

The plan

This map is very useful once you have a clear understanding of your data, as you can follow its paths to make a decision:

  • We have >50 samples
  • We want to predict a category
  • We have labeled data
  • We have fewer than 100K samples
  • We can choose a Linear SVC
  • If that doesn't work, since we have numeric data:
    • We can try a KNeighbors Classifier
      • If that doesn't work, try SVC and Ensemble Classifiers

This is a very helpful guide to follow.

Exercise - split the data

Following this path, we should start by importing some libraries for use.

  1. Import the necessary libraries:

    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
    import numpy as np
    
  2. Split your training and test data:

    X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
    

Linear SVC classifier

Support-Vector Clustering (SVC) is part of the Support-Vector Machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to determine how to cluster the labels. The 'C' parameter refers to 'regularization,' which controls the influence of parameters. The kernel can be one of several; here, we set it to 'linear' to ensure we use linear SVC. Probability defaults to 'false'; here, we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data and obtain probabilities.

Exercise - apply a linear SVC

Start by creating an array of classifiers. You will add to this array progressively as we test.

  1. Start with a Linear SVC:

    C = 10
    # Create different classifiers.
    classifiers = {
        'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)
    }
    
  2. Train your model using the Linear SVC and print out a report:

    n_classifiers = len(classifiers)
    
    for index, (name, classifier) in enumerate(classifiers.items()):
        classifier.fit(X_train, np.ravel(y_train))
    
        y_pred = classifier.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
        print(classification_report(y_test,y_pred))
    

    The result is quite good:

    Accuracy (train) for Linear SVC: 78.6% 
                  precision    recall  f1-score   support
    
         chinese       0.71      0.67      0.69       242
          indian       0.88      0.86      0.87       234
        japanese       0.79      0.74      0.76       254
          korean       0.85      0.81      0.83       242
            thai       0.71      0.86      0.78       227
    
        accuracy                           0.79      1199
       macro avg       0.79      0.79      0.79      1199
    weighted avg       0.79      0.79      0.79      1199
    

K-Neighbors classifier

K-Neighbors belongs to the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created, and data is grouped around these points so that generalized labels can be predicted for the data.

Exercise - apply the K-Neighbors classifier

The previous classifier performed well with the data, but perhaps we can achieve better accuracy. Try a K-Neighbors classifier.

  1. Add a line to your classifier array (add a comma after the Linear SVC item):

    'KNN classifier': KNeighborsClassifier(C),
    

    The result is slightly worse:

    Accuracy (train) for KNN classifier: 73.8% 
                  precision    recall  f1-score   support
    
         chinese       0.64      0.67      0.66       242
          indian       0.86      0.78      0.82       234
        japanese       0.66      0.83      0.74       254
          korean       0.94      0.58      0.72       242
            thai       0.71      0.82      0.76       227
    
        accuracy                           0.74      1199
       macro avg       0.76      0.74      0.74      1199
    weighted avg       0.76      0.74      0.74      1199
    

    Learn about K-Neighbors

Support Vector Classifier

Support-Vector Classifiers are part of the Support-Vector Machine family of ML methods used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.

Exercise - apply a Support Vector Classifier

Let's aim for slightly better accuracy with a Support Vector Classifier.

  1. Add a comma after the K-Neighbors item, and then add this line:

    'SVC': SVC(),
    

    The result is quite good!

    Accuracy (train) for SVC: 83.2% 
                  precision    recall  f1-score   support
    
         chinese       0.79      0.74      0.76       242
          indian       0.88      0.90      0.89       234
        japanese       0.87      0.81      0.84       254
          korean       0.91      0.82      0.86       242
            thai       0.74      0.90      0.81       227
    
        accuracy                           0.83      1199
       macro avg       0.84      0.83      0.83      1199
    weighted avg       0.84      0.83      0.83      1199
    

    Learn about Support-Vectors

Ensemble Classifiers

Let's follow the path to the very end, even though the previous test performed well. Let's try some 'Ensemble Classifiers,' specifically Random Forest and AdaBoost:

  'RFST': RandomForestClassifier(n_estimators=100),
  'ADA': AdaBoostClassifier(n_estimators=100)

The result is excellent, especially for Random Forest:

Accuracy (train) for RFST: 84.5% 
              precision    recall  f1-score   support

     chinese       0.80      0.77      0.78       242
      indian       0.89      0.92      0.90       234
    japanese       0.86      0.84      0.85       254
      korean       0.88      0.83      0.85       242
        thai       0.80      0.87      0.83       227

    accuracy                           0.84      1199
   macro avg       0.85      0.85      0.84      1199
weighted avg       0.85      0.84      0.84      1199

Accuracy (train) for ADA: 72.4% 
              precision    recall  f1-score   support

     chinese       0.64      0.49      0.56       242
      indian       0.91      0.83      0.87       234
    japanese       0.68      0.69      0.69       254
      korean       0.73      0.79      0.76       242
        thai       0.67      0.83      0.74       227

    accuracy                           0.72      1199
   macro avg       0.73      0.73      0.72      1199
weighted avg       0.73      0.72      0.72      1199

Learn about Ensemble Classifiers

This Machine Learning method "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost.

  • Random Forest, an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter specifies the number of trees.

  • AdaBoost fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.


🚀Challenge

Each of these techniques has numerous parameters that you can adjust. Research the default parameters for each one and consider how tweaking these parameters might affect the model's quality.

Post-lecture quiz

Review & Self Study

There is a lot of terminology in these lessons, so take a moment to review this list of useful terms!

Assignment

Parameter play


Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.