You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/4-Classification/2-Classifiers-1/README.md

11 KiB

Recipe Classifiers 1

In this lesson, you will use the dataset you saved from the last lesson full of balanced, clean data all about recipes. You will use this dataset with a variety of classifiers to predict a given national cuisine based on a group of ingredients. While doing so, you'll learn more about some of the ways that algorithms can be leveraged for classification tasks.

Pre-lecture quiz

Preparation

Assuming you completed Lesson 1, make sure that a cleaned_cuisines.csv file exists in the root /data folder for these four lessons.

Working in this lesson's notebook.ipynb folder, import that file along with the Pandas library:

import pandas as pd
recipes_df = pd.read_csv("../../data/cleaned_cuisine.csv")
recipes_df.head()

The data looks like this:

Unnamed: 0 cuisine almond angelica anise anise_seed apple apple_brandy apricot armagnac ... whiskey white_bread white_wine whole_grain_wheat_flour wine wood yam yeast yogurt zucchini
0 0 indian 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 indian 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 2 indian 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 3 indian 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 4 indian 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

Now, import several more libraries:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np

Divide the X and y coordinates into two dataframes for training. cuisine can be the labels dataframe:

recipes_label_df = recipes_df['cuisine']
recipes_label_df.head()

It will look like this:

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

Drop that Unnamed: 0 column and the cuisine column and save the rest of the data as trainable features:

recipes_feature_df = recipes_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
recipes_feature_df.head()

Your features look like this:

almond angelica anise anise_seed apple apple_brandy apricot armagnac artemisia artichoke ... whiskey white_bread white_wine whole_grain_wheat_flour wine wood yam yeast yogurt zucchini
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

Now you are ready to train your model!

Choosing your classifier

Now that your data is clean and ready for training, you have to decide which algorithm to use for the job.

Scikit-Learn groups Classification under Supervised Learning, and in that category you will find many ways to classify. The variety is quite bewildering at first sight. The following methods all include classification techniques:

  • Linear Models
  • Support Vector Machines
  • Stochastic Gradient Descent
  • Nearest Neighbors
  • Gaussian Processes
  • Decision Trees
  • Ensemble methods (voting Classifier)
  • Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)

You can also use neural networks to classify, but that is outside the scope of this lesson.

So, which classifier should you choose? Often, running through several and looking for a good result is a way to test. Scikit-Learn offers a side-by-side comparison on a created dataset, comparing KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, showing the results visualized:

comparison of classifiers

AutoML solves this problem neatly by running these comparisons in the cloud, allowing you to choose the best algorithm for your data. Try it here

Todo: knowledge check

Train your model

Let's train that model. Split your data into training and testing groups:

X_train, X_test, y_train, y_test = train_test_split(recipes_feature_df, recipes_label_df, test_size=0.3)

Use LogisticRegression with a multiclass setting and the lbfgs solver to train.

Todo: explain ravel

lr = LogisticRegression(multi_class='ovr',solver='lbfgs')
model = lr.fit(X_train, np.ravel(y_train))

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))

The accuracy is good at over 80%!

You can see this model in action by testing one row of data (#50):

print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')

The result is printed:

ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
cuisine: indian

Try a different row number!

Digging deeper, you can check for the accuracy of this prediction:

test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()

The result is printed - Indian cuisine is its best guess, with good probability:

0
indian 0.715851
chinese 0.229475
japanese 0.029763
korean 0.017277
thai 0.007634

Can you explain why the model is pretty sure this is an Indian recipe?

Get more detail by printing a classification report, as you did in the Regression lessons:

y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))
precision recall f1-score support
chinese 0.73 0.71 0.72 229
indian 0.91 0.93 0.92 254
japanese 0.70 0.75 0.72 220
korean 0.86 0.76 0.81 242
thai 0.79 0.85 0.82 254
accuracy 0.80 1199
macro avg 0.80 0.80 0.80 1199
weighted avg 0.80 0.80 0.80 1199

🚀Challenge

Add a challenge for students to work on collaboratively in class to enhance the project

Optional: add a screenshot of the completed lesson's UI if appropriate

Post-lecture quiz

Review & Self Study

Assignment

Assignment Name