classification 1

pull/34/head
Jen Looper 4 years ago
parent b7927cee63
commit 71b883285a

@ -2,6 +2,8 @@
[![ML, AI, Deep Learning - What's the difference?](https://img.youtube.com/vi/lTd9RSxS9ZE/0.jpg)](https://youtu.be/lTd9RSxS9ZE "ML, AI, Deep Learning - What's the difference?")
python path: https://docs.microsoft.com/en-us/learn/paths/python-language/
> 🎥 Click the image above for a video discussing the difference between Machine Learning, AI, and Deep Learning.
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/1/)
### Introduction

@ -1,32 +1,205 @@
# Introduction to Classification
In these four lessons, you will discover the 'meat and potatoes' of classic machine learning - Classification. No pun intended - we will walk through using various classification algorithms with a dataset all about the brilliant cuisines of Asia. Hope you're hungry!
In these four lessons, you will discover the 'meat and potatoes' of classic machine learning - Classification. No pun intended - we will walk through using various classification algorithms with a dataset all about the brilliant cuisines of Asia and India. Hope you're hungry!
Classification is a form of [supervised learning](https://wikipedia.org/wiki/Supervised_learning) that bears a lot in common with Regression techniques. If machine learning is all about assigning names to things via datasets, then classification generally falls into two groups: binary classification and multiclass classfication.
Remember, Linear Regression helped you predict relationships between variables and make accurate predictions on where a new datapoint would fall in relationship to that line. So, you could predict what price a pumpkin would be in September vs. December, for example. Logistic Regression helped you discover binary categories: at this price point, is this pumpkin orange or not-orange?
Classification uses various algorithms to determine other ways of determining a data point's label or class. Let's work with this recipe data to see whether, by observing a group of ingredients, we can determine its cuisine of origin.
[![Introduction to Classification](https://img.youtube.com/vi/eg8DJYwdMyg/0.jpg)](https://youtu.be/eg8DJYwdMyg "Introduction to Classification")
> 🎥 Click the image above for a video: MIT's John Guttag introduces Classification
Remember, Linear Regression helped you predict relationships between variables and make accurate predictions on where a new datapoint would fall in relationship to that line. So, you could predict what price a pumpkin would be in September vs. December, for example. Logistic Regression helped you discover binary categories: at this price point, is this pumpkin orange or not-orange?
Classification uses various algorithms to determine other ways of determining a data point's label or class. Let's work with this recipe data to see whether, by observing a group of ingredients, we can determine its cuisine of origin.
## [Pre-lecture quiz](link-to-quiz-app)
### Introduction
Before working to clean the data and prepare it for analysis, it's useful to understand several of the algorithms that you will use.
Classification is one of the fundamental activities of the machine learning researcher and data scientist. From basic classification of a binary value ("is this email spam or not?") to complex image classification and segmentation using computer vision, it's always useful to be able to sort data into classes and ask questions of it. Or, to state the process in a more scientific way, your classification method creates a predictive model that enables you to map the relationship between input variables to output variables.
Before starting the process of cleaning our data, visualizing it, and prepping it for our ML tasks, let's learn a bit about the various ways machine learning can be leveraged to classify data.
Derived from [statistics](https://wikipedia.org/wiki/Statistical_classification), classification using classic machine learning uses features, such as 'smoker','weight', and 'age' to determine 'likelihood of developing X disease'. As a supervised learning technique similar to the Regression exercises you performed earlier, your data is labeled and the ML algorithms use those labels to classify and predict classes (or 'features') of a dataset and assign them to a group or outcome.
✅ Take a moment to imagine a dataset about recipes. What would a multiclass model be able to answer? What would a binary model be able to answer? What if you wanted to determine whether a given cuisine was likely to contain Fenugreek? What if you wanted to see if, given a present of a grocery bag full of star anise, artichokes, cauliflower, and horseradish, you could create a typical Indian dish?
## Hello 'classifier'
The question we want to ask of this recipe dataset is actually a **multiclass question**, as we have several potential national cuisines to work with. Given a batch of ingredients, which of these many classes will the data fit?
Scikit-Learn offers several different algorithms to use to classify data, depending on the kind of problem you want to solve. In the next two lessons, you'll learn about several of these algorithms.
## Clean and Balance Your Data
The first task at hand before starting this project is to clean and **balance** your data to get better results. Start with the blank `notebook.ipynb` file ini the root of this folder.
The first think to install is [imblearn](https://imbalanced-learn.org/stable/). This is a Scikit-Learn package that will allow you to better balance the data (you will learn more about this task in a minute).
```python
pip install imblearn
```
Then, import the packages you need to import your data and visualize it. Import SMOTE from imblearn.
```python
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from imblearn.over_sampling import SMOTE
```
The next task will be to import the data:
```python
df = pd.read_csv('../data/recipes.csv')
```
Check the data's shape:
```python
df.head()
```
- Support-vector machines
- Naive Bayes
- Decision trees
- K-nearest neighbor algorithm
The first five rows look like this:
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 65 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 66 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 67 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 68 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 69 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
Get info about this data:
```python
df.info()
```
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2448 entries, 0 to 2447
Columns: 385 entries, Unnamed: 0 to zucchini
dtypes: int64(384), object(1)
memory usage: 7.2+ MB
```
## Learning about cuisines
Now the work starts to become more interesting. Let's discover the distribution of data, per cuisine:
```python
df.cuisine.value_counts().plot.barh()
```
![cuisine data distribution](images/cuisine-dist.png)
There are a finite number of cuisines, but the distribution of data is uneven. You can fix that! Before doing so, explore a little more. How much data exactly is available per cuisine?
```python
thai_df = df[(df.cuisine == "thai")]
japanese_df = df[(df.cuisine == "japanese")]
chinese_df = df[(df.cuisine == "chinese")]
indian_df = df[(df.cuisine == "indian")]
korean_df = df[(df.cuisine == "korean")]
print(f'thai df: {thai_df.shape}')
print(f'japanese df: {japanese_df.shape}')
print(f'chinese df: {chinese_df.shape}')
print(f'indian df: {indian_df.shape}')
print(f'korean df: {korean_df.shape}')
```
thai df: (289, 385)
japanese df: (320, 385)
chinese df: (442, 385)
indian df: (598, 385)
korean df: (799, 385)
## Discovering ingredients
Now you can dig deeper into the data and learn what are the typical ingredients per cuisine. You should clean out recurrent data that creates confusion between cuisines, so let's learn about this problem.
Create a function in Python to create an ingredient dataframe. This function will start by dropping an unhelpful column and sort through ingredients by their count:
```python
def create_ingredient_df(df):
ingredient_df = df.T.drop(['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
ingredient_df = ingredient_df.sort_values(by='value', ascending=False
inplace=False)
return ingredient_df
```
Now you can use that function to get an idea of top ten most popular ingredients by cuisine:
```python
thai_ingredient_df = create_ingredient_df(thai_df)
thai_ingredient_df.head(10).plot.barh()
```
![thai](images/thai.png)
```python
japanese_ingredient_df = create_ingredient_df(japanese_df)
japanese_ingredient_df.head(10).plot.barh()
```
![japanese](images/japanese.png)
```python
chinese_ingredient_df = create_ingredient_df(chinese_df)
chinese_ingredient_df.head(10).plot.barh()
```
![chinese](images/chinese.png)
```python
indian_ingredient_df = create_ingredient_df(indian_df)
indian_ingredient_df.head(10).plot.barh()
```
![indian](images/indian.png)
```python
korean_ingredient_df = create_ingredient_df(korean_df)
korean_ingredient_df.head(10).plot.barh()
```
![korean](images/korean.png)
Now, drop the most common ingredients that create confusion between distinct cuisines. Everyone loves rice, garlic and ginger!
```python
feature_df= df.drop(['cuisine','Unnamed: 0','rice','garlic','ginger'], axis=1)
labels_df = df.cuisine #.unique()
feature_df.head()
```
## Balance the dataset
Now that you have cleaned the data, use [SMOTE](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html) - "Synthetic Minority Over-sampling Technique" - to balance it. This strategy generates new samples by interpolation.
```python
oversample = SMOTE()
transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)
```
By balancing your data, you'll have better results when classifying it. Now you can check the numbers of labels per ingredient:
```python
print(f'new label count: {transformed_label_df.value_counts()}')
print(f'old label count: {df.cuisine.value_counts()}')
```
new label count: korean 799
chinese 799
indian 799
japanese 799
thai 799
Name: cuisine, dtype: int64
old label count: korean 799
indian 598
chinese 442
japanese 320
thai 289
Name: cuisine, dtype: int64
The data is nice and clean, balanced, and very delicious! You can take one more look at the data using `transformed_df.head()` and `transformed_df.info()`. Save a copy of this data for use in future lessons:
```python
transformed_df.to_csv("../../data/cleaned_cuisine.csv")
```
This fresh CSV can now be found in the root data folder.
## 🚀Challenge
## [Post-lecture quiz](link-to-quiz-app)

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.7 KiB

@ -0,0 +1,28 @@
{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": 3
},
"orig_nbformat": 2
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"source": [
"# Delicious Asian Recipes "
],
"cell_type": "markdown",
"metadata": {}
}
]
}

File diff suppressed because one or more lines are too long

@ -1,563 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Build Classification Model"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\r\n",
"from sklearn.model_selection import train_test_split, cross_val_score\r\n",
"from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve\r\n",
"from sklearn.svm import SVC\r\n",
"import pandas as pd\r\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>almond</th>\n",
" <th>angelica</th>\n",
" <th>anise</th>\n",
" <th>anise_seed</th>\n",
" <th>apple</th>\n",
" <th>apple_brandy</th>\n",
" <th>apricot</th>\n",
" <th>armagnac</th>\n",
" <th>artemisia</th>\n",
" <th>artichoke</th>\n",
" <th>...</th>\n",
" <th>whiskey</th>\n",
" <th>white_bread</th>\n",
" <th>white_wine</th>\n",
" <th>whole_grain_wheat_flour</th>\n",
" <th>wine</th>\n",
" <th>wood</th>\n",
" <th>yam</th>\n",
" <th>yeast</th>\n",
" <th>yogurt</th>\n",
" <th>zucchini</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 380 columns</p>\n",
"</div>"
],
"text/plain": [
" almond angelica anise anise_seed apple apple_brandy apricot \\\n",
"0 0 0 0 0 0 0 0 \n",
"1 1 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 0 0 \n",
"\n",
" armagnac artemisia artichoke ... whiskey white_bread white_wine \\\n",
"0 0 0 0 ... 0 0 0 \n",
"1 0 0 0 ... 0 0 0 \n",
"2 0 0 0 ... 0 0 0 \n",
"3 0 0 0 ... 0 0 0 \n",
"4 0 0 0 ... 0 0 0 \n",
"\n",
" whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n",
"0 0 0 0 0 0 0 0 \n",
"1 0 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 1 0 \n",
"\n",
"[5 rows x 380 columns]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transformed_feature_df = pd.read_csv(\".data/features_dataset.csv\")\r\n",
"transformed_feature_df= transformed_feature_df.drop(['Unnamed: 0'], axis=1)\r\n",
"transformed_feature_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cuisine</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>indian</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>indian</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>indian</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>indian</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>indian</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cuisine\n",
"0 indian\n",
"1 indian\n",
"2 indian\n",
"3 indian\n",
"4 indian"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transformed_label_df = pd.read_csv(\".data/labels_dataset.csv\")\r\n",
"transformed_label_df= transformed_label_df.drop(['Unnamed: 0'], axis=1)\r\n",
"transformed_label_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(transformed_feature_df, transformed_label_df, test_size=0.3)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy is 0.8031693077564637\n"
]
}
],
"source": [
"lr = LogisticRegression(multi_class='ovr',solver='lbfgs')\r\n",
"model = lr.fit(X_train, np.ravel(y_train))\r\n",
"\r\n",
"accuracy = model.score(X_test, y_test)\r\n",
"print (\"Accuracy is {}\".format(accuracy))"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ingredients: Index(['corn'], dtype='object')\n",
"cusine: cuisine thai\n",
"Name: 3816, dtype: object\n"
]
}
],
"source": [
"# test an item\r\n",
"print(f'ingredients: {X_test.iloc[20][X_test.iloc[20]!=0].keys()}')\r\n",
"print(f'cusine: {y_test.iloc[20]}')"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>thai</th>\n",
" <td>0.475724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>chinese</th>\n",
" <td>0.201912</td>\n",
" </tr>\n",
" <tr>\n",
" <th>japanese</th>\n",
" <td>0.152046</td>\n",
" </tr>\n",
" <tr>\n",
" <th>korean</th>\n",
" <td>0.110980</td>\n",
" </tr>\n",
" <tr>\n",
" <th>indian</th>\n",
" <td>0.059338</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"thai 0.475724\n",
"chinese 0.201912\n",
"japanese 0.152046\n",
"korean 0.110980\n",
"indian 0.059338"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#rehsape to 2d array and transpose\r\n",
"test= X_test.iloc[20].values.reshape(-1, 1).T\r\n",
"# predict with score\r\n",
"proba = model.predict_proba(test)\r\n",
"classes = model.classes_\r\n",
"# create df with classes and scores\r\n",
"resultdf = pd.DataFrame(data=proba, columns=classes)\r\n",
"\r\n",
"# create df to show results\r\n",
"topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])\r\n",
"topPrediction.head()"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" chinese 0.77 0.70 0.74 239\n",
" indian 0.88 0.88 0.88 240\n",
" japanese 0.76 0.79 0.77 227\n",
" korean 0.86 0.78 0.82 240\n",
" thai 0.75 0.86 0.80 253\n",
"\n",
" accuracy 0.80 1199\n",
" macro avg 0.81 0.80 0.80 1199\n",
"weighted avg 0.81 0.80 0.80 1199\n",
"\n"
]
}
],
"source": [
"y_pred = model.predict(X_test)\r\n",
"print(classification_report(y_test,y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Try different classifiers"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"\r\n",
"C = 10\r\n",
"# Create different classifiers.\r\n",
"classifiers = {\r\n",
" 'L1 logistic': LogisticRegression(C=C, penalty='l1',\r\n",
" solver='saga',\r\n",
" multi_class='multinomial',\r\n",
" max_iter=10000),\r\n",
" 'L2 logistic (Multinomial)': LogisticRegression(C=C, penalty='l2',\r\n",
" solver='saga',\r\n",
" multi_class='multinomial',\r\n",
" max_iter=10000),\r\n",
" 'L2 logistic (OvR)': LogisticRegression(C=C, penalty='l2',\r\n",
" solver='saga',\r\n",
" multi_class='ovr',\r\n",
" max_iter=10000),\r\n",
" 'Linear SVC': SVC(kernel='linear', C=C, probability=True,\r\n",
" random_state=0)\r\n",
"}\r\n"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy (train) for L1 logistic: 79.9% \n",
"Accuracy (train) for L2 logistic (Multinomial): 79.7% \n",
"Accuracy (train) for L2 logistic (OvR): 79.8% \n",
"Accuracy (train) for Linear SVC: 77.9% \n"
]
}
],
"source": [
"n_classifiers = len(classifiers)\r\n",
"\r\n",
"for index, (name, classifier) in enumerate(classifiers.items()):\r\n",
" classifier.fit(X_train, np.ravel(y_train))\r\n",
"\r\n",
" y_pred = classifier.predict(X_test)\r\n",
" accuracy = accuracy_score(y_test, y_pred)\r\n",
" print(\"Accuracy (train) for %s: %0.1f%% \" % (name, accuracy * 100))\r\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"interpreter": {
"hash": "dd61f40108e2a19f4ef0d3ebbc6b6eea57ab3c4bc13b15fe6f390d3d86442534"
},
"kernelspec": {
"display_name": "Python 3.8.5 64-bit ('onnxwine': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long

@ -0,0 +1,336 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Build Classification Model"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import train_test_split, cross_val_score\n",
"from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve\n",
"from sklearn.svm import SVC\n",
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Unnamed: 0 cuisine almond angelica anise anise_seed apple \\\n",
"0 0 indian 0 0 0 0 0 \n",
"1 1 indian 1 0 0 0 0 \n",
"2 2 indian 0 0 0 0 0 \n",
"3 3 indian 0 0 0 0 0 \n",
"4 4 indian 0 0 0 0 0 \n",
"\n",
" apple_brandy apricot armagnac ... whiskey white_bread white_wine \\\n",
"0 0 0 0 ... 0 0 0 \n",
"1 0 0 0 ... 0 0 0 \n",
"2 0 0 0 ... 0 0 0 \n",
"3 0 0 0 ... 0 0 0 \n",
"4 0 0 0 ... 0 0 0 \n",
"\n",
" whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n",
"0 0 0 0 0 0 0 0 \n",
"1 0 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 1 0 \n",
"\n",
"[5 rows x 382 columns]"
],
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Unnamed: 0</th>\n <th>cuisine</th>\n <th>almond</th>\n <th>angelica</th>\n <th>anise</th>\n <th>anise_seed</th>\n <th>apple</th>\n <th>apple_brandy</th>\n <th>apricot</th>\n <th>armagnac</th>\n <th>...</th>\n <th>whiskey</th>\n <th>white_bread</th>\n <th>white_wine</th>\n <th>whole_grain_wheat_flour</th>\n <th>wine</th>\n <th>wood</th>\n <th>yam</th>\n <th>yeast</th>\n <th>yogurt</th>\n <th>zucchini</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>indian</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>indian</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>indian</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>indian</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>4</td>\n <td>indian</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 382 columns</p>\n</div>"
},
"metadata": {},
"execution_count": 26
}
],
"source": [
"recipes_df = pd.read_csv(\"../../data/cleaned_cuisine.csv\")\n",
"recipes_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 indian\n",
"1 indian\n",
"2 indian\n",
"3 indian\n",
"4 indian\n",
"Name: cuisine, dtype: object"
]
},
"metadata": {},
"execution_count": 27
}
],
"source": [
"recipes_label_df = recipes_df['cuisine']\n",
"recipes_label_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" almond angelica anise anise_seed apple apple_brandy apricot \\\n",
"0 0 0 0 0 0 0 0 \n",
"1 1 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 0 0 \n",
"\n",
" armagnac artemisia artichoke ... whiskey white_bread white_wine \\\n",
"0 0 0 0 ... 0 0 0 \n",
"1 0 0 0 ... 0 0 0 \n",
"2 0 0 0 ... 0 0 0 \n",
"3 0 0 0 ... 0 0 0 \n",
"4 0 0 0 ... 0 0 0 \n",
"\n",
" whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n",
"0 0 0 0 0 0 0 0 \n",
"1 0 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 1 0 \n",
"\n",
"[5 rows x 380 columns]"
],
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>almond</th>\n <th>angelica</th>\n <th>anise</th>\n <th>anise_seed</th>\n <th>apple</th>\n <th>apple_brandy</th>\n <th>apricot</th>\n <th>armagnac</th>\n <th>artemisia</th>\n <th>artichoke</th>\n <th>...</th>\n <th>whiskey</th>\n <th>white_bread</th>\n <th>white_wine</th>\n <th>whole_grain_wheat_flour</th>\n <th>wine</th>\n <th>wood</th>\n <th>yam</th>\n <th>yeast</th>\n <th>yogurt</th>\n <th>zucchini</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 380 columns</p>\n</div>"
},
"metadata": {},
"execution_count": 28
}
],
"source": [
"recipes_feature_df = recipes_df.drop(['Unnamed: 0', 'cuisine'], axis=1)\n",
"recipes_feature_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(recipes_feature_df, recipes_label_df, test_size=0.3)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Accuracy is 0.810675562969141\n"
]
}
],
"source": [
"lr = LogisticRegression(multi_class='ovr',solver='lbfgs')\n",
"model = lr.fit(X_train, np.ravel(y_train))\n",
"\n",
"accuracy = model.score(X_test, y_test)\n",
"print (\"Accuracy is {}\".format(accuracy))"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"ingredients: Index(['bean', 'coriander', 'cumin', 'fenugreek', 'pepper', 'turmeric',\n 'vegetable_oil'],\n dtype='object')\ncusine: thai\n"
]
}
],
"source": [
"# test an item\n",
"print(f'ingredients: {X_test.iloc[20][X_test.iloc[20]!=0].keys()}')\n",
"print(f'cuisine: {y_test.iloc[20]}')"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" 0\n",
"indian 0.530435\n",
"thai 0.344293\n",
"japanese 0.108792\n",
"chinese 0.015001\n",
"korean 0.001480"
],
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>indian</th>\n <td>0.530435</td>\n </tr>\n <tr>\n <th>thai</th>\n <td>0.344293</td>\n </tr>\n <tr>\n <th>japanese</th>\n <td>0.108792</td>\n </tr>\n <tr>\n <th>chinese</th>\n <td>0.015001</td>\n </tr>\n <tr>\n <th>korean</th>\n <td>0.001480</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"execution_count": 32
}
],
"source": [
"#rehsape to 2d array and transpose\r\n",
"test= X_test.iloc[20].values.reshape(-1, 1).T\r\n",
"# predict with score\r\n",
"proba = model.predict_proba(test)\r\n",
"classes = model.classes_\r\n",
"# create df with classes and scores\r\n",
"resultdf = pd.DataFrame(data=proba, columns=classes)\r\n",
"\r\n",
"# create df to show results\r\n",
"topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])\r\n",
"topPrediction.head()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" precision recall f1-score support\n\n chinese 0.75 0.67 0.70 231\n indian 0.91 0.90 0.90 255\n japanese 0.77 0.82 0.79 260\n korean 0.83 0.83 0.83 220\n thai 0.79 0.83 0.81 233\n\n accuracy 0.81 1199\n macro avg 0.81 0.81 0.81 1199\nweighted avg 0.81 0.81 0.81 1199\n\n"
]
}
],
"source": [
"y_pred = model.predict(X_test)\r\n",
"print(classification_report(y_test,y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Try different classifiers"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"\r\n",
"C = 10\r\n",
"# Create different classifiers.\r\n",
"classifiers = {\r\n",
" 'L1 logistic': LogisticRegression(C=C, penalty='l1',\r\n",
" solver='saga',\r\n",
" multi_class='multinomial',\r\n",
" max_iter=10000),\r\n",
" 'L2 logistic (Multinomial)': LogisticRegression(C=C, penalty='l2',\r\n",
" solver='saga',\r\n",
" multi_class='multinomial',\r\n",
" max_iter=10000),\r\n",
" 'L2 logistic (OvR)': LogisticRegression(C=C, penalty='l2',\r\n",
" solver='saga',\r\n",
" multi_class='ovr',\r\n",
" max_iter=10000),\r\n",
" 'Linear SVC': SVC(kernel='linear', C=C, probability=True,\r\n",
" random_state=0)\r\n",
"}\r\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Accuracy (train) for L1 logistic: 79.4% \n",
"Accuracy (train) for L2 logistic (Multinomial): 79.2% \n",
"Accuracy (train) for L2 logistic (OvR): 80.2% \n",
"Accuracy (train) for Linear SVC: 79.1% \n"
]
}
],
"source": [
"n_classifiers = len(classifiers)\r\n",
"\r\n",
"for index, (name, classifier) in enumerate(classifiers.items()):\r\n",
" classifier.fit(X_train, np.ravel(y_train))\r\n",
"\r\n",
" y_pred = classifier.predict(X_test)\r\n",
" accuracy = accuracy_score(y_test, y_pred)\r\n",
" print(\"Accuracy (train) for %s: %0.1f%% \" % (name, accuracy * 100))\r\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"interpreter": {
"hash": "dd61f40108e2a19f4ef0d3ebbc6b6eea57ab3c4bc13b15fe6f390d3d86442534"
},
"kernelspec": {
"name": "python37364bit8d3b438fb5fc4430a93ac2cb74d693a7",
"display_name": "Python 3.7.0 64-bit ('3.7')"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
},
"metadata": {
"interpreter": {
"hash": "70b38d7a306a849643e446cd70466270a13445e5987dfa1344ef2b127438fa4d"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}

@ -1,20 +1,19 @@
# Getting Started with Classification
## Regional topic: Delicious Asian Recipes 🍜
In Asia, food traditions are extremely diverse, and very delicious! Let's look at data about regional recipes to try to guess where they originated.
In Asia and India, food traditions are extremely diverse, and very delicious! Let's look at data about regional recipes to try to guess where they originated.
![Thai food seller](./images/thai-food.jpg)
> Photo by <a href="https://unsplash.com/@changlisheng?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Lisheng Chang</a> on <a href="https://unsplash.com/s/photos/asian-food?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## What you will learn
In this section, you will build on the skills you learned in Lesson 1 (Regression) to learn about more classifiers you can use that will help you learn about your data.
In this section, you will build on the skills you learned in Lesson 1 (Regression) to learn about other classifiers you can use that will help you learn about your data.
## Lessons
1. [Introduction to Classification](1-Introduction/README.md)
2. [Build a Discriminative Model](2-Discriminative/README.md)
3. [Build a Generative Model](3-Generative/README.md)
2. [More Classifiers](2-Classifiers-1/README.md)
3. [Yet Other Classifiers](3-Classifiers-2/README.md)
4. [Applied ML: Build a Web App](4-Applied/README.md)
## Credits

Loading…
Cancel
Save