History

localizeflow[bot] 978e88a425 chore(i18n): sync translations with latest source changes (chunk 88/173, 100 files)		2 months ago
..
solution	chore(i18n): sync translations with latest source changes (chunk 88/173, 100 files)	2 months ago
README.md	chore(i18n): sync translations with latest source changes (chunk 88/173, 100 files)	2 months ago
assignment.md	🌐 Update translations via Co-op Translator	4 months ago
notebook.ipynb	🌐 Update translations via Co-op Translator	4 months ago

README.md

Unescape Escape

Cuisine classifiers 1

For dis lesson, you go use di dataset wey you save from di last lesson wey get balanced, clean data about cuisines.

You go use dis dataset with different classifiers to predict one national cuisine based on di group of ingredients. As you dey do am, you go learn more about di ways wey algorithms fit help for classification tasks.

Pre-lecture quiz

Preparation

If you don finish Lesson 1, make sure say cleaned_cuisines.csv file dey inside di root /data folder for dis four lessons.

Exercise - predict one national cuisine

For dis lesson notebook.ipynb folder, import di file plus di Pandas library:

import pandas as pd
cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
cuisines_df.head()

Di data go look like dis:

	Unnamed: 0	cuisine	almond	...	yogurt
0	0	indian	0	...	0
1	1	indian	1	...	0
2	2	indian	0	...	0
3	3	indian	0	...	0
4	4	indian	0	...	1

Now, import more libraries:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np

Divide di X and y coordinates into two dataframes for training. cuisine fit be di labels dataframe:

cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

E go look like dis:

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

Drop di Unnamed: 0 column plus di cuisine column, use drop(). Save di rest of di data as trainable features:
```
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()
```
Your features go look like dis:

	almond	...	yogurt
0	0	...	0
1	1	...	0
2	0	...	0
3	0	...	0
4	0	...	1

Now you fit train your model!

Choosing your classifier

Now wey your data don clean and e dey ready for training, you go need decide which algorithm you go use for di work.

Scikit-learn dey group classification under Supervised Learning, and for dat category you go find plenty ways to classify. Di variety fit dey confusing for first sight. Di methods wey dey include classification techniques na:

Linear Models
Support Vector Machines
Stochastic Gradient Descent
Nearest Neighbors
Gaussian Processes
Decision Trees
Ensemble methods (voting Classifier)
Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)

You fit also use neural networks to classify data, but dat one no dey inside dis lesson.

Which classifier you go choose?

So, which classifier you go use? Sometimes, to try different ones and check di result na one way to test. Scikit-learn dey offer side-by-side comparison for one created dataset, wey compare KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, wey show di results visualized:

Plots wey dem generate for Scikit-learn documentation

AutoML dey solve dis problem well by running dis comparisons for di cloud, e go allow you choose di best algorithm for your data. Try am here

Better way

Better way wey pass to dey guess anyhow na to follow di ideas wey dey dis downloadable ML Cheat sheet. Here, we go see say for our multiclass problem, we get some options:

Part of Microsoft's Algorithm Cheat Sheet, wey show multiclass classification options

✅ Download dis cheat sheet, print am, and hang am for your wall!

Reasoning

Make we try reason di different approaches wey we fit use based on di constraints wey we get:

Neural networks dey too heavy. Based on our clean but small dataset, and di fact say we dey run training locally for notebooks, neural networks dey too much for dis task.
No two-class classifier. We no go use two-class classifier, so dat one rule out one-vs-all.
Decision tree or logistic regression fit work. Decision tree fit work, or logistic regression for multiclass data.
Multiclass Boosted Decision Trees dey solve different problem. Di multiclass boosted decision tree dey best for nonparametric tasks, e.g. tasks wey dey build rankings, so e no go help us.

Using Scikit-learn

We go use Scikit-learn to analyze our data. But, plenty ways dey to use logistic regression for Scikit-learn. Check di parameters to pass.

Di two important parameters na multi_class and solver wey we need to set when we dey ask Scikit-learn to do logistic regression. Di multi_class value dey apply certain behavior. Di value of di solver na di algorithm wey e go use. No be all solvers fit pair with all multi_class values.

According to di docs, for di multiclass case, di training algorithm:

Dey use di one-vs-rest (OvR) scheme, if di multi_class option dey set to ovr
Dey use di cross-entropy loss, if di multi_class option dey set to multinomial. (Currently di multinomial option dey supported only by di ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)"

🎓 Di 'scheme' here fit be 'ovr' (one-vs-rest) or 'multinomial'. Since logistic regression dey really designed to support binary classification, dis schemes dey help am handle multiclass classification tasks better. source

🎓 Di 'solver' dey defined as "di algorithm wey e go use for di optimization problem". source.

Scikit-learn dey offer dis table to explain how solvers dey handle different challenges wey different kinds of data structures dey bring:

Exercise - split di data

Make we focus on logistic regression for our first training trial since you don learn about am for di previous lesson. Split your data into training and testing groups by calling train_test_split():

X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Exercise - apply logistic regression

Since you dey use di multiclass case, you need choose wetin scheme to use and wetin solver to set. Use LogisticRegression with multiclass setting and di liblinear solver to train.

Create logistic regression with multi_class set to ovr and di solver set to liblinear:
```
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))
```
✅ Try different solver like lbfgs, wey dem dey often set as default

Note, use Pandas ravel function to flatten your data when e dey needed.

Di accuracy dey good at over 80%!

You fit see dis model for action by testing one row of data (#50):

print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')

Di result go show:

ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
cuisine: indian

✅ Try use different row number and check wetin e go show

If you wan sabi more, you fit check how correct dis prediction be:
```
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()
```
Di result wey e print - Indian food na di best guess, and e get beta chance:

0

indian 0.715851

chinese 0.229475

japanese 0.029763

korean 0.017277

thai 0.007634

✅ You fit explain why di model sure say na Indian food?

	0
indian	0.715851
chinese	0.229475
japanese	0.029763
korean	0.017277
thai	0.007634

Get more info by printing classification report, like you do for regression lessons:

y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))

	precision	recall	f1-score	support
chinese	0.73	0.71	0.72	229
indian	0.91	0.93	0.92	254
japanese	0.70	0.75	0.72	220
korean	0.86	0.76	0.81	242
thai	0.79	0.85	0.82	254
accuracy	0.80	1199
macro avg	0.80	0.80	0.80	1199
weighted avg	0.80	0.80	0.80	1199

🚀Challenge

For dis lesson, you use di clean data take build machine learning model wey fit predict national food based on di ingredients wey dem use. Take time read di plenty options wey Scikit-learn get to classify data. Try sabi di concept of 'solver' well to understand wetin dey happen for di background.

Post-lecture quiz

Review & Self Study

Try sabi di mathematics wey dey behind logistic regression for dis lesson

Assignment

Study di solvers

Disclaimer:
Dis dokyument don use AI transleto service Co-op Translator do di translation. Even as we dey try make am correct, abeg sabi say machine translation fit get mistake or no dey accurate well. Di original dokyument for im native language na di main correct source. For important mata, e good make una use professional human translation. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because una use dis translation.

	almond	...	yogurt
0	0	...	0
1	1	...	0
2	0	...	0
3	0	...	0
4	0	...	1

	almond	...	yogurt
0	0	...	0
1	1	...	0
2	0	...	0
3	0	...	0
4	0	...	1

README.md Unescape Escape