You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/pcm/4-Classification/2-Classifiers-1
localizeflow[bot] 978e88a425
chore(i18n): sync translations with latest source changes (chunk 88/173, 100 files)
3 weeks ago
..
solution chore(i18n): sync translations with latest source changes (chunk 88/173, 100 files) 3 weeks ago
README.md chore(i18n): sync translations with latest source changes (chunk 88/173, 100 files) 3 weeks ago
assignment.md 🌐 Update translations via Co-op Translator 2 months ago
notebook.ipynb 🌐 Update translations via Co-op Translator 2 months ago

README.md

Cuisine classifiers 1

For dis lesson, you go use di dataset wey you save from di last lesson wey get balanced, clean data about cuisines.

You go use dis dataset with different classifiers to predict one national cuisine based on di group of ingredients. As you dey do am, you go learn more about di ways wey algorithms fit help for classification tasks.

Pre-lecture quiz

Preparation

If you don finish Lesson 1, make sure say cleaned_cuisines.csv file dey inside di root /data folder for dis four lessons.

Exercise - predict one national cuisine

  1. For dis lesson notebook.ipynb folder, import di file plus di Pandas library:

    import pandas as pd
    cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
    cuisines_df.head()
    

    Di data go look like dis:

Unnamed: 0 cuisine almond angelica anise anise_seed apple apple_brandy apricot armagnac ... whiskey white_bread white_wine whole_grain_wheat_flour wine wood yam yeast yogurt zucchini
0 0 indian 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 indian 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 2 indian 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 3 indian 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 4 indian 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
  1. Now, import more libraries:

    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
    from sklearn.svm import SVC
    import numpy as np
    
  2. Divide di X and y coordinates into two dataframes for training. cuisine fit be di labels dataframe:

    cuisines_label_df = cuisines_df['cuisine']
    cuisines_label_df.head()
    

    E go look like dis:

    0    indian
    1    indian
    2    indian
    3    indian
    4    indian
    Name: cuisine, dtype: object
    
  3. Drop di Unnamed: 0 column plus di cuisine column, use drop(). Save di rest of di data as trainable features:

    cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
    cuisines_feature_df.head()
    

    Your features go look like dis:

almond angelica anise anise_seed apple apple_brandy apricot armagnac artemisia artichoke ... whiskey white_bread white_wine whole_grain_wheat_flour wine wood yam yeast yogurt zucchini
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

Now you fit train your model!

Choosing your classifier

Now wey your data don clean and e dey ready for training, you go need decide which algorithm you go use for di work.

Scikit-learn dey group classification under Supervised Learning, and for dat category you go find plenty ways to classify. Di variety fit dey confusing for first sight. Di methods wey dey include classification techniques na:

  • Linear Models
  • Support Vector Machines
  • Stochastic Gradient Descent
  • Nearest Neighbors
  • Gaussian Processes
  • Decision Trees
  • Ensemble methods (voting Classifier)
  • Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)

You fit also use neural networks to classify data, but dat one no dey inside dis lesson.

Which classifier you go choose?

So, which classifier you go use? Sometimes, to try different ones and check di result na one way to test. Scikit-learn dey offer side-by-side comparison for one created dataset, wey compare KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, wey show di results visualized:

comparison of classifiers

Plots wey dem generate for Scikit-learn documentation

AutoML dey solve dis problem well by running dis comparisons for di cloud, e go allow you choose di best algorithm for your data. Try am here

Better way

Better way wey pass to dey guess anyhow na to follow di ideas wey dey dis downloadable ML Cheat sheet. Here, we go see say for our multiclass problem, we get some options:

cheatsheet for multiclass problems

Part of Microsoft's Algorithm Cheat Sheet, wey show multiclass classification options

Download dis cheat sheet, print am, and hang am for your wall!

Reasoning

Make we try reason di different approaches wey we fit use based on di constraints wey we get:

  • Neural networks dey too heavy. Based on our clean but small dataset, and di fact say we dey run training locally for notebooks, neural networks dey too much for dis task.
  • No two-class classifier. We no go use two-class classifier, so dat one rule out one-vs-all.
  • Decision tree or logistic regression fit work. Decision tree fit work, or logistic regression for multiclass data.
  • Multiclass Boosted Decision Trees dey solve different problem. Di multiclass boosted decision tree dey best for nonparametric tasks, e.g. tasks wey dey build rankings, so e no go help us.

Using Scikit-learn

We go use Scikit-learn to analyze our data. But, plenty ways dey to use logistic regression for Scikit-learn. Check di parameters to pass.

Di two important parameters na multi_class and solver wey we need to set when we dey ask Scikit-learn to do logistic regression. Di multi_class value dey apply certain behavior. Di value of di solver na di algorithm wey e go use. No be all solvers fit pair with all multi_class values.

According to di docs, for di multiclass case, di training algorithm:

  • Dey use di one-vs-rest (OvR) scheme, if di multi_class option dey set to ovr
  • Dey use di cross-entropy loss, if di multi_class option dey set to multinomial. (Currently di multinomial option dey supported only by di lbfgs, sag, saga and newton-cg solvers.)"

🎓 Di 'scheme' here fit be 'ovr' (one-vs-rest) or 'multinomial'. Since logistic regression dey really designed to support binary classification, dis schemes dey help am handle multiclass classification tasks better. source

🎓 Di 'solver' dey defined as "di algorithm wey e go use for di optimization problem". source.

Scikit-learn dey offer dis table to explain how solvers dey handle different challenges wey different kinds of data structures dey bring:

solvers

Exercise - split di data

Make we focus on logistic regression for our first training trial since you don learn about am for di previous lesson. Split your data into training and testing groups by calling train_test_split():

X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Exercise - apply logistic regression

Since you dey use di multiclass case, you need choose wetin scheme to use and wetin solver to set. Use LogisticRegression with multiclass setting and di liblinear solver to train.

  1. Create logistic regression with multi_class set to ovr and di solver set to liblinear:

    lr = LogisticRegression(multi_class='ovr',solver='liblinear')
    model = lr.fit(X_train, np.ravel(y_train))
    
    accuracy = model.score(X_test, y_test)
    print ("Accuracy is {}".format(accuracy))
    

    Try different solver like lbfgs, wey dem dey often set as default

    Note, use Pandas ravel function to flatten your data when e dey needed.

    Di accuracy dey good at over 80%!

  2. You fit see dis model for action by testing one row of data (#50):

    print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
    print(f'cuisine: {y_test.iloc[50]}')
    

    Di result go show:

    ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
    cuisine: indian
    

Try use different row number and check wetin e go show

  1. If you wan sabi more, you fit check how correct dis prediction be:

    test= X_test.iloc[50].values.reshape(-1, 1).T
    proba = model.predict_proba(test)
    classes = model.classes_
    resultdf = pd.DataFrame(data=proba, columns=classes)
    
    topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
    topPrediction.head()
    

    Di result wey e print - Indian food na di best guess, and e get beta chance:

    0
    indian 0.715851
    chinese 0.229475
    japanese 0.029763
    korean 0.017277
    thai 0.007634

    You fit explain why di model sure say na Indian food?

  2. Get more info by printing classification report, like you do for regression lessons:

    y_pred = model.predict(X_test)
    print(classification_report(y_test,y_pred))
    
    precision recall f1-score support
    chinese 0.73 0.71 0.72 229
    indian 0.91 0.93 0.92 254
    japanese 0.70 0.75 0.72 220
    korean 0.86 0.76 0.81 242
    thai 0.79 0.85 0.82 254
    accuracy 0.80 1199
    macro avg 0.80 0.80 0.80 1199
    weighted avg 0.80 0.80 0.80 1199

🚀Challenge

For dis lesson, you use di clean data take build machine learning model wey fit predict national food based on di ingredients wey dem use. Take time read di plenty options wey Scikit-learn get to classify data. Try sabi di concept of 'solver' well to understand wetin dey happen for di background.

Post-lecture quiz

Review & Self Study

Try sabi di mathematics wey dey behind logistic regression for dis lesson

Assignment

Study di solvers


Disclaimer:
Dis dokyument don use AI transleto service Co-op Translator do di translation. Even as we dey try make am correct, abeg sabi say machine translation fit get mistake or no dey accurate well. Di original dokyument for im native language na di main correct source. For important mata, e good make una use professional human translation. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because una use dis translation.