classification 2

pull/34/head
Jen Looper 3 years ago
parent f0c4490404
commit 8b2ebdb064

File diff suppressed because one or more lines are too long

@ -1,44 +1,160 @@
# Recipe Classifiers 1
In this lesson, you will use the dataset you saved from the last lesson full of balanced, clean data all about recipes. You will use this dataset with a variety of classifiers to predict a given national cuisine based on a group of ingredients. While doing so, you'll learn more about some of the ways that algorithms can be leveraged for classification tasks.
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/19/)
# Preparatory steps to start this lesson
Assuming you completed Lesson 1, make sure that a `cleaned_cuisines.csv` file exists in the root `/data` folder for these four lessons.
Working in this lesson's `notebook.ipynb` folder, import that file along with the Pandas library:
```python
import pandas as pd
recipes_df = pd.read_csv("../../data/cleaned_cuisine.csv")
recipes_df.head()
```
The data looks like this:
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 0 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Now, import several more libraries:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np
```
Divide the X and y coordinates into two dataframes for training. `cuisine` can be the labels dataframe:
```python
recipes_label_df = recipes_df['cuisine']
recipes_label_df.head()
```
Describe what we will learn
It will look like this:
### Introduction
```
0 indian
1 indian
2 indian
3 indian
4 indian
Name: cuisine, dtype: object
```
Describe what will be covered
Drop that `Unnamed: 0` column and the `cuisine` column and save the rest of the data as trainable features:
> Notes
```python
recipes_feature_df = recipes_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
recipes_feature_df.head()
```
### Prerequisite
Your features look like this:
What steps should have been covered before this lesson?
| almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini | |
| -----: | -------: | ----: | ---------: | ----: | -----------: | ------: | -------: | --------: | --------: | ---: | ------: | ----------: | ---------: | ----------------------: | ---: | ---: | ---: | ----: | -----: | -------: | --- |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
### Preparation
Now you are ready to train your model!
Preparatory steps to start this lesson
## Choosing your classifier
---
Now that your data is clean and ready for training, you have to decide which algorithm to use for the job.
[Step through content in blocks]
TODO: discuss the types
## [Topic 1]
✅ Todo: knowledge check
### Task:
## Train your model
Work together to progressively enhance your codebase to build the project with shared code:
Let's train that model. Split your data into training and testing groups:
```html
code blocks
```python
X_train, X_test, y_train, y_test = train_test_split(recipes_feature_df, recipes_label_df, test_size=0.3)
```
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
Use LogisticRegression with a multiclass setting and the lbfgs solver to train.
✅ Todo: explain ravel
```python
lr = LogisticRegression(multi_class='ovr',solver='lbfgs')
model = lr.fit(X_train, np.ravel(y_train))
accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))
```
## [Topic 2]
The accuracy is good at over 80%!
## [Topic 3]
You can see this model in action by testing one row of data (#50):
```python
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')
```
The result is printed:
```
ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
cuisine: indian
```
✅ Try a different row number!
Digging deeper, you can check for the accuracy of this prediction:
```python
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)
topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()
```
The result is printed - Indian cuisine is its best guess, with good probability:
| | 0 | | | | | | | | | | | | | | | | | | | | |
| -------: | -------: | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| indian | 0.715851 | | | | | | | | | | | | | | | | | | | | |
| chinese | 0.229475 | | | | | | | | | | | | | | | | | | | | |
| japanese | 0.029763 | | | | | | | | | | | | | | | | | | | | |
| korean | 0.017277 | | | | | | | | | | | | | | | | | | | | |
| thai | 0.007634 | | | | | | | | | | | | | | | | | | | | |
✅ Can you explain why the model is pretty sure this is an Indian recipe?
Get more detail by printing a classification report, as you did in the Regression lessons:
```python
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))
```
| precision | recall | f1-score | support | | | | | | | | | | | | | | | | | | |
| ------------ | ------ | -------- | ------- | ---- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| chinese | 0.73 | 0.71 | 0.72 | 229 | | | | | | | | | | | | | | | | | |
| indian | 0.91 | 0.93 | 0.92 | 254 | | | | | | | | | | | | | | | | | |
| japanese | 0.70 | 0.75 | 0.72 | 220 | | | | | | | | | | | | | | | | | |
| korean | 0.86 | 0.76 | 0.81 | 242 | | | | | | | | | | | | | | | | | |
| thai | 0.79 | 0.85 | 0.82 | 254 | | | | | | | | | | | | | | | | | |
| accuracy | 0.80 | 1199 | | | | | | | | | | | | | | | | | | | |
| macro avg | 0.80 | 0.80 | 0.80 | 1199 | | | | | | | | | | | | | | | | | |
| weighted avg | 0.80 | 0.80 | 0.80 | 1199 | | | | | | | | | | | | | | | | | |
## 🚀Challenge
Add a challenge for students to work on collaboratively in class to enhance the project

@ -0,0 +1,28 @@
{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": 3
},
"orig_nbformat": 2
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"source": [
"# Build Classification Models"
],
"cell_type": "markdown",
"metadata": {}
}
]
}

File diff suppressed because it is too large Load Diff
Loading…
Cancel
Save