14 KiB

Raw Permalink Blame History Unescape Escape

Cuisine classifiers 1

In this lesson, you will use the dataset you saved from the previous lesson, which contains balanced and clean data about cuisines.

You will use this dataset with various classifiers to predict the national cuisine based on a set of ingredients. Along the way, you'll learn more about how algorithms can be applied to classification tasks.

Pre-lecture quiz

Preparation

If you completed Lesson 1, ensure that a cleaned_cuisines.csv file exists in the root /data folder for these four lessons.

Exercise - predict a national cuisine

In this lesson's notebook.ipynb folder, import the file along with the Pandas library:
```
import pandas as pd
cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
cuisines_df.head()
```
The data looks like this:

	Unnamed: 0	cuisine	almond	...	yogurt
0	0	indian	0	...	0
1	1	indian	1	...	0
2	2	indian	0	...	0
3	3	indian	0	...	0
4	4	indian	0	...	1

Next, import several additional libraries:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np

Separate the X and y coordinates into two dataframes for training. Use cuisine as the labels dataframe:

cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

It will look like this:

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

Drop the Unnamed: 0 column and the cuisine column using drop(). Save the remaining data as trainable features:
```
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()
```
Your features will look like this:

	almond	...	yogurt
0	0	...	0
1	1	...	0
2	0	...	0
3	0	...	0
4	0	...	1

Now you're ready to train your model!

Choosing your classifier

With your data clean and ready for training, it's time to decide which algorithm to use for the task.

Scikit-learn categorizes classification under Supervised Learning, offering a wide range of classification methods. The options can seem overwhelming at first glance. These methods include:

Linear Models
Support Vector Machines
Stochastic Gradient Descent
Nearest Neighbors
Gaussian Processes
Decision Trees
Ensemble methods (voting Classifier)
Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)

You can also use neural networks for classification, but that is beyond the scope of this lesson.

Which classifier should you choose?

So, how do you decide on a classifier? Often, testing several options and comparing results is a good approach. Scikit-learn provides a side-by-side comparison on a sample dataset, showcasing KNeighbors, SVC (two variations), GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB, and QuadraticDiscriminationAnalysis, with visualized results:

Plots generated from Scikit-learn's documentation

AutoML simplifies this process by running these comparisons in the cloud, helping you select the best algorithm for your data. Try it here

A more informed approach

Instead of guessing, you can refer to this downloadable ML Cheat Sheet. For our multiclass problem, it suggests several options:

A section of Microsoft's Algorithm Cheat Sheet, detailing multiclass classification options

✅ Download this cheat sheet, print it out, and keep it handy!

Reasoning

Let's evaluate different approaches based on our constraints:

Neural networks are too resource-intensive. Given our clean but small dataset and the fact that we're training locally in notebooks, neural networks are not ideal for this task.
Avoid two-class classifiers. Since this is not a binary classification problem, two-class classifiers like one-vs-all are not suitable.
Decision tree or logistic regression could work. Both decision trees and logistic regression are viable options for multiclass data.
Multiclass Boosted Decision Trees are not suitable. These are better for nonparametric tasks like ranking, which is not relevant here.

Using Scikit-learn

We'll use Scikit-learn to analyze our data. Logistic regression in Scikit-learn offers several options. Check out the parameters you can configure.

Two key parameters to set are multi_class and solver. These determine the behavior and algorithm used for logistic regression. Not all solvers are compatible with all multi_class values.

According to the documentation, for multiclass classification:

The one-vs-rest (OvR) scheme is used if multi_class is set to ovr.
Cross-entropy loss is used if multi_class is set to multinomial. (The multinomial option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’, and ‘newton-cg’ solvers.)

🎓 The 'scheme' refers to how logistic regression handles multiclass classification. It can be 'ovr' (one-vs-rest) or 'multinomial'. These schemes adapt logistic regression, which is primarily designed for binary classification, to handle multiclass tasks. source

🎓 The 'solver' is the algorithm used to optimize the problem. source.

Scikit-learn provides this table to explain how solvers handle different challenges based on data structures:

Exercise - split the data

Let's start with logistic regression for our first training attempt, as you recently learned about it in a previous lesson. Split your data into training and testing sets using train_test_split():

X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Exercise - apply logistic regression

Since this is a multiclass problem, you need to choose a scheme and a solver. Use LogisticRegression with a multiclass setting and the liblinear solver for training.

Create a logistic regression model with multi_class set to ovr and the solver set to liblinear:

lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))

✅ Try using a different solver like lbfgs, which is often the default option.

Note, use Pandas ravel function to flatten your data when needed. The accuracy is good at over 80%!

You can see this model in action by testing one row of data (#50):

print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')

The result is printed:

ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
cuisine: indian

✅ Try a different row number and check the results.

Digging deeper, you can check the accuracy of this prediction:
```
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()
```
The result is printed - Indian cuisine is its best guess, with good probability:

0

indian 0.715851

chinese 0.229475

japanese 0.029763

korean 0.017277

thai 0.007634

✅ Can you explain why the model is quite confident this is an Indian cuisine?

	0
indian	0.715851
chinese	0.229475
japanese	0.029763
korean	0.017277
thai	0.007634

Get more detail by printing a classification report, as you did in the regression lessons:

y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))

	precision	recall	f1-score	support
chinese	0.73	0.71	0.72	229
indian	0.91	0.93	0.92	254
japanese	0.70	0.75	0.72	220
korean	0.86	0.76	0.81	242
thai	0.79	0.85	0.82	254
accuracy	0.80	1199
macro avg	0.80	0.80	0.80	1199
weighted avg	0.80	0.80	0.80	1199

🚀Challenge

In this lesson, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the many options Scikit-learn provides to classify data. Dig deeper into the concept of 'solver' to understand what goes on behind the scenes.

Post-lecture quiz

Review & Self Study

Dig a little more into the math behind logistic regression in this lesson

Assignment

Study the solvers

Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

	almond	...	yogurt
0	0	...	0
1	1	...	0
2	0	...	0
3	0	...	0
4	0	...	1

	almond	...	yogurt
0	0	...	0
1	1	...	0
2	0	...	0
3	0	...	0
4	0	...	1

14 KiB Raw Permalink Blame History Unescape Escape