ML-For-Beginners/4-Classification/3-Classifiers-2/README.md

# Cuisine classifiers 2

In this second classification lesson, you will explore more ways to classify numeric data. You will also learn about the ramifications for choosing one classifier over the other.

## [Pre-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/23/)

### Prerequisite

We assume that you have completed the previous lessons and have a cleaned dataset in your `data` folder called _cleaned_cuisines.csv_ in the root of this 4-lesson folder.

### Preparation

We have loaded your _notebook.ipynb_ file with the cleaned dataset and have divided it into X and y dataframes, ready for the model building process.

## A classification map

Previously, you learned about the various options you have when classifying data using Microsoft's cheat sheet. Scikit-learn offers a similar, but more granular cheat sheet that can further help narrow down your estimators (another term for classifiers):

![ML Map from Scikit-learn](images/map.png)
> Tip: [visit this map online](https://scikit-learn.org/stable/tutorial/machine_learning_map/) and click along the path to read documentation.

### The plan

This map is very helpful once you have a clear grasp of your data, as you can 'walk' along its paths to a decision:

- We have >50 samples
- We want to predict a category
- We have labeled data
- We have fewer than 100K samples
- ✨ We can choose a Linear SVC
- If that doesn't work, since we have numeric data
    - We can try a ✨ KNeighbors Classifier 
      - If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers

This is a very helpful trail to follow.

## Exercise - split the data

Following this path, we should start by importing some libraries to use.

1. Import the needed libraries:

    ```python
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
    import numpy as np
    ```

1. Split your training and test data:

    ```python
    X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
    ```

## Linear SVC classifier

Support-Vector clustering (SVC) is a child of the Support-Vector machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. The kernel can be one of [several](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC); here we set it to 'linear' to ensure that we leverage linear SVC. Probability defaults to 'false'; here we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities.

### Exercise - apply a linear SVC

Start by creating an array of classifiers. You will add progressively to this array as we test. 

1. Start with a Linear SVC:

    ```python
    C = 10
    # Create different classifiers.
    classifiers = {
        'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)
    }
    ```

2. Train your model using the Linear SVC and print out a report:

    ```python
    n_classifiers = len(classifiers)
    
    for index, (name, classifier) in enumerate(classifiers.items()):
        classifier.fit(X_train, np.ravel(y_train))
    
        y_pred = classifier.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
        print(classification_report(y_test,y_pred))
    ```

    The result is pretty good:

    ```output
    Accuracy (train) for Linear SVC: 78.6% 
                  precision    recall  f1-score   support
    
         chinese       0.71      0.67      0.69       242
          indian       0.88      0.86      0.87       234
        japanese       0.79      0.74      0.76       254
          korean       0.85      0.81      0.83       242
            thai       0.71      0.86      0.78       227
    
        accuracy                           0.79      1199
       macro avg       0.79      0.79      0.79      1199
    weighted avg       0.79      0.79      0.79      1199
    ```

## K-Neighbors classifier

K-Neighbors is part of the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data.

### Exercise - apply the K-Neighbors classifier

The previous classifier was good, and worked well with the data, but maybe we can get better accuracy. Try a K-Neighbors classifier.

1. Add a line to your classifier array (add a comma after the Linear SVC item):

    ```python
    'KNN classifier': KNeighborsClassifier(C),
    ```

    The result is a little worse:

    ```output
    Accuracy (train) for KNN classifier: 73.8% 
                  precision    recall  f1-score   support
    
         chinese       0.64      0.67      0.66       242
          indian       0.86      0.78      0.82       234
        japanese       0.66      0.83      0.74       254
          korean       0.94      0.58      0.72       242
            thai       0.71      0.82      0.76       227
    
        accuracy                           0.74      1199
       macro avg       0.76      0.74      0.74      1199
    weighted avg       0.76      0.74      0.74      1199
    ```

    ✅ Learn about [K-Neighbors](https://scikit-learn.org/stable/modules/neighbors.html#neighbors)

## Support Vector Classifier

Support-Vector classifiers are part of the [Support-Vector Machine](https://wikipedia.org/wiki/Support-vector_machine) family of ML methods that are used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.

### Exercise - apply a Support Vector Classifier

Let's try for a little better accuracy with a Support Vector Classifier.

1. Add a comma after the K-Neighbors item, and then add this line:

    ```python
    'SVC': SVC(),
    ```

    The result is quite good!

    ```output
    Accuracy (train) for SVC: 83.2% 
                  precision    recall  f1-score   support
    
         chinese       0.79      0.74      0.76       242
          indian       0.88      0.90      0.89       234
        japanese       0.87      0.81      0.84       254
          korean       0.91      0.82      0.86       242
            thai       0.74      0.90      0.81       227
    
        accuracy                           0.83      1199
       macro avg       0.84      0.83      0.83      1199
    weighted avg       0.84      0.83      0.83      1199
    ```

    ✅ Learn about [Support-Vectors](https://scikit-learn.org/stable/modules/svm.html#svm)

## Ensemble Classifiers

Let's follow the path to the very end, even though the previous test was quite good. Let's try some 'Ensemble Classifiers, specifically Random Forest and AdaBoost:

```python
  'RFST': RandomForestClassifier(n_estimators=100),
  'ADA': AdaBoostClassifier(n_estimators=100)
```

The result is very good, especially for Random Forest:

```output
Accuracy (train) for RFST: 84.5% 
              precision    recall  f1-score   support

     chinese       0.80      0.77      0.78       242
      indian       0.89      0.92      0.90       234
    japanese       0.86      0.84      0.85       254
      korean       0.88      0.83      0.85       242
        thai       0.80      0.87      0.83       227

    accuracy                           0.84      1199
   macro avg       0.85      0.85      0.84      1199
weighted avg       0.85      0.84      0.84      1199

Accuracy (train) for ADA: 72.4% 
              precision    recall  f1-score   support

     chinese       0.64      0.49      0.56       242
      indian       0.91      0.83      0.87       234
    japanese       0.68      0.69      0.69       254
      korean       0.73      0.79      0.76       242
        thai       0.67      0.83      0.74       227

    accuracy                           0.72      1199
   macro avg       0.73      0.73      0.72      1199
weighted avg       0.73      0.72      0.72      1199
```

✅ Learn about [Ensemble Classifiers](https://scikit-learn.org/stable/modules/ensemble.html)

This method of Machine Learning "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost. 

- [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forest), an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter is set to the number of trees.

- [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.

---

## 🚀Challenge

Each of these techniques has a large number of parameters that you can tweak. Research each one's default parameters and think about what tweaking these parameters would mean for the model's quality.

## [Post-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/24/)

## Review & Self Study

There's a lot of jargon in these lessons, so take a minute to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-15963-cxa) of useful terminology!

## Assignment 

[Parameter play](assignment.md)
classification 3 audit 3 years ago			`# Cuisine classifiers 2`
lessons 4 years ago
editorial tweaking 3 years ago			`In this second classification lesson, you will explore more ways to classify numeric data. You will also learn about the ramifications for choosing one classifier over the other.`
editorial 3 years ago
path to quizzes needs to change 3 years ago			`## [Pre-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/23/)`
lessons 4 years ago
classifier 3! 3 years ago			`### Prerequisite`

changing classification file name to cleaned_cuisines 3 years ago			We assume that you have completed the previous lessons and have a cleaned dataset in your `data` folder called _cleaned_cuisines.csv_ in the root of this 4-lesson folder.
editorial 3 years ago
classifier 3! 3 years ago			`### Preparation`

classification 3 audit 3 years ago			`We have loaded your _notebook.ipynb_ file with the cleaned dataset and have divided it into X and y dataframes, ready for the model building process.`
classifier 3! 3 years ago
classification 3 audit 3 years ago			`## A classification map`
classifier 3! 3 years ago
Scikit-learn spelling audit 3 years ago			`Previously, you learned about the various options you have when classifying data using Microsoft's cheat sheet. Scikit-learn offers a similar, but more granular cheat sheet that can further help narrow down your estimators (another term for classifiers):`
classification 2 3 years ago
Scikit-learn spelling audit 3 years ago			`![ML Map from Scikit-learn](images/map.png)`
classifier 3! 3 years ago			`> Tip: [visit this map online](https://scikit-learn.org/stable/tutorial/machine_learning_map/) and click along the path to read documentation.`
classification 2 3 years ago
editorial 3 years ago			`### The plan`

classifier 3! 3 years ago			`This map is very helpful once you have a clear grasp of your data, as you can 'walk' along its paths to a decision:`
classification 2 3 years ago
			`- We have >50 samples`
			`- We want to predict a category`
			`- We have labeled data`
			`- We have fewer than 100K samples`
classifier 3! 3 years ago			`- ✨ We can choose a Linear SVC`
classification 2 3 years ago			`- If that doesn't work, since we have numeric data`
classifier 3! 3 years ago			`- We can try a ✨ KNeighbors Classifier`
			`- If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers`

editorial 3 years ago			`This is a very helpful trail to follow.`
classifier 3! 3 years ago
editorial 3 years ago			`## Exercise - split the data`
classification 2 3 years ago
editorial 3 years ago			`Following this path, we should start by importing some libraries to use.`
lessons 4 years ago
editorial 3 years ago			`1. Import the needed libraries:`
lessons 4 years ago
editorial 3 years ago			```python
			`from sklearn.neighbors import KNeighborsClassifier`
			`from sklearn.linear_model import LogisticRegression`
			`from sklearn.svm import SVC`
			`from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier`
			`from sklearn.model_selection import train_test_split, cross_val_score`
			`from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve`
			`import numpy as np`
			```
lessons 4 years ago
editorial 3 years ago			`1. Split your training and test data:`
lessons 4 years ago
editorial 3 years ago			```python
			`X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)`
			```
lessons 4 years ago
editorial 3 years ago			`## Linear SVC classifier`
lessons 4 years ago
editorial 3 years ago			Support-Vector clustering (SVC) is a child of the Support-Vector machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. The kernel can be one of [several](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC); here we set it to 'linear' to ensure that we leverage linear SVC. Probability defaults to 'false'; here we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities.
classifier 3! 3 years ago
editorial 3 years ago			`### Exercise - apply a linear SVC`

			`Start by creating an array of classifiers. You will add progressively to this array as we test.`

			`1. Start with a Linear SVC:`

			```python
			`C = 10`
			`# Create different classifiers.`
			`classifiers = {`
			`'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)`
			`}`
			```

			`2. Train your model using the Linear SVC and print out a report:`

			```python
			`n_classifiers = len(classifiers)`

			`for index, (name, classifier) in enumerate(classifiers.items()):`
			`classifier.fit(X_train, np.ravel(y_train))`

			`y_pred = classifier.predict(X_test)`
			`accuracy = accuracy_score(y_test, y_pred)`
			`print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))`
			`print(classification_report(y_test,y_pred))`
			```

			`The result is pretty good:`

			```output
			`Accuracy (train) for Linear SVC: 78.6%`
			`precision recall f1-score support`

			`chinese 0.71 0.67 0.69 242`
			`indian 0.88 0.86 0.87 234`
			`japanese 0.79 0.74 0.76 254`
			`korean 0.85 0.81 0.83 242`
			`thai 0.71 0.86 0.78 227`

			`accuracy 0.79 1199`
			`macro avg 0.79 0.79 0.79 1199`
			`weighted avg 0.79 0.79 0.79 1199`
			```
classifier 3! 3 years ago
editorial 3 years ago			`## K-Neighbors classifier`
classifier 3! 3 years ago
editorial 3 years ago			`K-Neighbors is part of the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data.`
lessons 4 years ago
editorial 3 years ago			`### Exercise - apply the K-Neighbors classifier`
lessons 4 years ago
editorial 3 years ago			`The previous classifier was good, and worked well with the data, but maybe we can get better accuracy. Try a K-Neighbors classifier.`
lessons 4 years ago
editorial 3 years ago			`1. Add a line to your classifier array (add a comma after the Linear SVC item):`
lessons 4 years ago
editorial 3 years ago			```python
			`'KNN classifier': KNeighborsClassifier(C),`
			```
classifier 3! 3 years ago
editorial 3 years ago			`The result is a little worse:`
classifier 3! 3 years ago
editorial 3 years ago			```output
			`Accuracy (train) for KNN classifier: 73.8%`
			`precision recall f1-score support`

			`chinese 0.64 0.67 0.66 242`
			`indian 0.86 0.78 0.82 234`
			`japanese 0.66 0.83 0.74 254`
			`korean 0.94 0.58 0.72 242`
			`thai 0.71 0.82 0.76 227`

			`accuracy 0.74 1199`
			`macro avg 0.76 0.74 0.74 1199`
			`weighted avg 0.76 0.74 0.74 1199`
			```

			`✅ Learn about [K-Neighbors](https://scikit-learn.org/stable/modules/neighbors.html#neighbors)`
lessons 4 years ago
classifier 3! 3 years ago			`## Support Vector Classifier`
lessons 4 years ago
editorial 3 years ago			`Support-Vector classifiers are part of the [Support-Vector Machine](https://wikipedia.org/wiki/Support-vector_machine) family of ML methods that are used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.`
lessons 4 years ago
editorial 3 years ago			`### Exercise - apply a Support Vector Classifier`
lessons 4 years ago
editorial 3 years ago			`Let's try for a little better accuracy with a Support Vector Classifier.`
classifier 3! 3 years ago
editorial 3 years ago			`1. Add a comma after the K-Neighbors item, and then add this line:`
classifier 3! 3 years ago
editorial 3 years ago			```python
			`'SVC': SVC(),`
			```
lessons 4 years ago
editorial 3 years ago			`The result is quite good!`

			```output
			`Accuracy (train) for SVC: 83.2%`
			`precision recall f1-score support`

			`chinese 0.79 0.74 0.76 242`
			`indian 0.88 0.90 0.89 234`
			`japanese 0.87 0.81 0.84 254`
			`korean 0.91 0.82 0.86 242`
			`thai 0.74 0.90 0.81 227`

			`accuracy 0.83 1199`
			`macro avg 0.84 0.83 0.83 1199`
			`weighted avg 0.84 0.83 0.83 1199`
			```

			`✅ Learn about [Support-Vectors](https://scikit-learn.org/stable/modules/svm.html#svm)`
lessons 4 years ago
classifier 3! 3 years ago			`## Ensemble Classifiers`
lessons 4 years ago
classifier 3! 3 years ago			`Let's follow the path to the very end, even though the previous test was quite good. Let's try some 'Ensemble Classifiers, specifically Random Forest and AdaBoost:`
Challenge typography edit 3 years ago
editorial 3 years ago			```python
add chapter 4-3 zh translation and tune some trans (#420) Co-authored-by: vicoooo26 <vico24826@gmail.com> 3 years ago			`'RFST': RandomForestClassifier(n_estimators=100),`
editorial 3 years ago			`'ADA': AdaBoostClassifier(n_estimators=100)`
classifier 3! 3 years ago			```
editorial 3 years ago
classifier 3! 3 years ago			`The result is very good, especially for Random Forest:`

editorial 3 years ago			```output
classifier 3! 3 years ago			`Accuracy (train) for RFST: 84.5%`
			`precision recall f1-score support`

			`chinese 0.80 0.77 0.78 242`
			`indian 0.89 0.92 0.90 234`
			`japanese 0.86 0.84 0.85 254`
			`korean 0.88 0.83 0.85 242`
			`thai 0.80 0.87 0.83 227`

			`accuracy 0.84 1199`
			`macro avg 0.85 0.85 0.84 1199`
			`weighted avg 0.85 0.84 0.84 1199`

			`Accuracy (train) for ADA: 72.4%`
			`precision recall f1-score support`

			`chinese 0.64 0.49 0.56 242`
			`indian 0.91 0.83 0.87 234`
			`japanese 0.68 0.69 0.69 254`
			`korean 0.73 0.79 0.76 242`
			`thai 0.67 0.83 0.74 227`

			`accuracy 0.72 1199`
			`macro avg 0.73 0.73 0.72 1199`
			`weighted avg 0.73 0.72 0.72 1199`
			```
editorial 3 years ago
classifier 3! 3 years ago			`✅ Learn about [Ensemble Classifiers](https://scikit-learn.org/stable/modules/ensemble.html)`

			`This method of Machine Learning "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost.`

			`- [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forest), an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter is set to the number of trees.`

			`- [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.`
classification 3 audit 3 years ago
			`---`
editorial 3 years ago
classifier 3! 3 years ago			`## 🚀Challenge`
lessons 4 years ago
classifier 3! 3 years ago			`Each of these techniques has a large number of parameters that you can tweak. Research each one's default parameters and think about what tweaking these parameters would mean for the model's quality.`
lessons 4 years ago
path to quizzes needs to change 3 years ago			`## [Post-lecture quiz](https://white-water-09ec41f0f.azurestaticapps.net/quiz/24/)`
editorial 3 years ago
lessons 4 years ago			`## Review & Self Study`

classification 3 audit 3 years ago			`There's a lot of jargon in these lessons, so take a minute to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-15963-cxa) of useful terminology!`
editorial 3 years ago
Assignment callout made more clear 3 years ago			`## Assignment`

classification 3 audit 3 years ago			`[Parameter play](assignment.md)`