|
4 years ago | |
---|---|---|
.. | ||
solution | 4 years ago | |
translations | 4 years ago | |
README.md | 4 years ago | |
assignment.md | 4 years ago | |
notebook.ipynb | 4 years ago |
README.md
Recipe Classifiers 1
In this lesson, you will use the dataset you saved from the last lesson full of balanced, clean data all about recipes. You will use this dataset with a variety of classifiers to predict a given national cuisine based on a group of ingredients. While doing so, you'll learn more about some of the ways that algorithms can be leveraged for classification tasks.
Pre-lecture quiz
Preparatory steps to start this lesson
Assuming you completed Lesson 1, make sure that a cleaned_cuisines.csv
file exists in the root /data
folder for these four lessons.
Working in this lesson's notebook.ipynb
folder, import that file along with the Pandas library:
import pandas as pd
recipes_df = pd.read_csv("../../data/cleaned_cuisine.csv")
recipes_df.head()
The data looks like this:
Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 3 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Now, import several more libraries:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np
Divide the X and y coordinates into two dataframes for training. cuisine
can be the labels dataframe:
recipes_label_df = recipes_df['cuisine']
recipes_label_df.head()
It will look like this:
0 indian
1 indian
2 indian
3 indian
4 indian
Name: cuisine, dtype: object
Drop that Unnamed: 0
column and the cuisine
column and save the rest of the data as trainable features:
recipes_feature_df = recipes_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
recipes_feature_df.head()
Your features look like this:
almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Now you are ready to train your model!
Choosing your classifier
Now that your data is clean and ready for training, you have to decide which algorithm to use for the job.
TODO: discuss the types
✅ Todo: knowledge check
Train your model
Let's train that model. Split your data into training and testing groups:
X_train, X_test, y_train, y_test = train_test_split(recipes_feature_df, recipes_label_df, test_size=0.3)
Use LogisticRegression with a multiclass setting and the lbfgs solver to train.
✅ Todo: explain ravel
lr = LogisticRegression(multi_class='ovr',solver='lbfgs')
model = lr.fit(X_train, np.ravel(y_train))
accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))
The accuracy is good at over 80%!
You can see this model in action by testing one row of data (#50):
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')
The result is printed:
ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
cuisine: indian
✅ Try a different row number!
Digging deeper, you can check for the accuracy of this prediction:
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)
topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()
The result is printed - Indian cuisine is its best guess, with good probability:
0 | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
indian | 0.715851 | ||||||||||||||||||||
chinese | 0.229475 | ||||||||||||||||||||
japanese | 0.029763 | ||||||||||||||||||||
korean | 0.017277 | ||||||||||||||||||||
thai | 0.007634 |
✅ Can you explain why the model is pretty sure this is an Indian recipe?
Get more detail by printing a classification report, as you did in the Regression lessons:
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))
precision | recall | f1-score | support | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chinese | 0.73 | 0.71 | 0.72 | 229 | |||||||||||||||||
indian | 0.91 | 0.93 | 0.92 | 254 | |||||||||||||||||
japanese | 0.70 | 0.75 | 0.72 | 220 | |||||||||||||||||
korean | 0.86 | 0.76 | 0.81 | 242 | |||||||||||||||||
thai | 0.79 | 0.85 | 0.82 | 254 | |||||||||||||||||
accuracy | 0.80 | 1199 | |||||||||||||||||||
macro avg | 0.80 | 0.80 | 0.80 | 1199 | |||||||||||||||||
weighted avg | 0.80 | 0.80 | 0.80 | 1199 |
🚀Challenge
Add a challenge for students to work on collaboratively in class to enhance the project
Optional: add a screenshot of the completed lesson's UI if appropriate