In these four lessons, you will explore a fundamental focus of classic machine learning - _classification_. We will walk through using various classification algorithms with a dataset about all the brilliant cuisines of Asia and India. Hope you're hungry!
Classification is a form of [supervised learning](https://wikipedia.org/wiki/Supervised_learning) that bears a lot in common with regression techniques. If machine learning is all about predicting values or names to things by using datasets, then classification generally falls into two groups: _binary classification_ and _multiclass classification_.
> 🎥 Click the image above for a video: MIT's John Guttag introduces classification
> 🎥 点击上图观看视频:麻省理工学院的John Guttag介绍分类
Remember:
回忆一下:
- **Linear regression** helped you predict relationships between variables and make accurate predictions on where a new datapoint would fall in relationship to that line. So, you could predict _what price a pumpkin would be in September vs. December_, for example.
- **Logistic regression** helped you discover "binary categories": at this price point, _is this pumpkin orange or not-orange_?
- **逻辑回归**帮助你发现“二元类别”:例如,在这个价位上,_这个南瓜是橙色还是非橙色_?
Classification uses various algorithms to determine other ways of determining a data point's label or class. Let's work with this cuisine data to see whether, by observing a group of ingredients, we can determine its cuisine of origin.
Classification is one of the fundamental activities of the machine learning researcher and data scientist. From basic classification of a binary value ("is this email spam or not?"), to complex image classification and segmentation using computer vision, it's always useful to be able to sort data into classes and ask questions of it.
To state the process in a more scientific way, your classification method creates a predictive model that enables you to map the relationship between input variables to output variables.
Before starting the process of cleaning our data, visualizing it, and prepping it for our ML tasks, let's learn a bit about the various ways machine learning can be leveraged to classify data.
Derived from [statistics](https://wikipedia.org/wiki/Statistical_classification), classification using classic machine learning uses features, such as `smoker`, `weight`, and `age` to determine _likelihood of developing X disease_. As a supervised learning technique similar to the regression exercises you performed earlier, your data is labeled and the ML algorithms use those labels to classify and predict classes (or 'features') of a dataset and assign them to a group or outcome.
✅ Take a moment to imagine a dataset about cuisines. What would a multiclass model be able to answer? What would a binary model be able to answer? What if you wanted to determine whether a given cuisine was likely to use fenugreek? What if you wanted to see if, given a present of a grocery bag full of star anise, artichokes, cauliflower, and horseradish, you could create a typical Indian dish?
> 🎥 Click the image above for a video.The whole premise of the show 'Chopped' is the 'mystery basket' where chefs have to make some dish out of a random choice of ingredients. Surely a ML model would have helped!
The question we want to ask of this cuisine dataset is actually a **multiclass question**, as we have several potential national cuisines to work with. Given a batch of ingredients, which of these many classes will the data fit?
Scikit-learn offers several different algorithms to use to classify data, depending on the kind of problem you want to solve. In the next two lessons, you'll learn about several of these algorithms.
The first task at hand, before starting this project, is to clean and **balance** your data to get better results. Start with the blank _notebook.ipynb_ file in the root of this folder.
The first thing to install is [imblearn](https://imbalanced-learn.org/stable/). This is a Scikit-learn package that will allow you to better balance the data (you will learn more about this task in a minute).
@ -113,21 +113,21 @@ The first thing to install is [imblearn](https://imbalanced-learn.org/stable/).
memory usage: 7.2+ MB
memory usage: 7.2+ MB
```
```
## Exercise - learning about cuisines
## 练习 - 学习美食
Now the work starts to become more interesting. Let's discover the distribution of data, per cuisine
现在这项工作开始变得更有趣了。让我们发现菜系的数据分布
1. Plot the data as bars by calling `barh()`:
1. 通过调用`barh()`将数据绘制为柱状图:
```python
```python
df.cuisine.value_counts().plot.barh()
df.cuisine.value_counts().plot.barh()
```
```


There are a finite number of cuisines, but the distribution of data is uneven. You can fix that! Before doing so, explore a little more.
菜系数量有限,但数据分布不均。你可以解决这个问题!在这样做之前,多探索一点。
1. Find out how much data is available per cuisine and print it out:
1. 找出每种菜系有多少可用数据并将其打印出来:
```python
```python
thai_df = df[(df.cuisine == "thai")]
thai_df = df[(df.cuisine == "thai")]
@ -143,7 +143,7 @@ Now the work starts to become more interesting. Let's discover the distribution
print(f'korean df: {korean_df.shape}')
print(f'korean df: {korean_df.shape}')
```
```
the output looks like so:
输出如下所示:
```output
```output
thai df: (289, 385)
thai df: (289, 385)
@ -153,11 +153,11 @@ Now the work starts to become more interesting. Let's discover the distribution
korean df: (799, 385)
korean df: (799, 385)
```
```
## Discovering ingredients
## 研究成分
Now you can dig deeper into the data and learn what are the typical ingredients per cuisine. You should clean out recurrent data that creates confusion between cuisines, so let's learn about this problem.
1. Create a function `create_ingredient()` in Python to create an ingredient dataframe. This function will start by dropping an unhelpful column and sort through ingredients by their count:
@ -225,27 +225,27 @@ Now you can dig deeper into the data and learn what are the typical ingredients
feature_df.head()
feature_df.head()
```
```
## Balance the dataset
## 平衡数据集
Now that you have cleaned the data, use [SMOTE](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html) - "Synthetic Minority Over-sampling Technique" - to balance it.
By balancing your data, you'll have better results when classifying it. Think about a binary classification. If most of your data is one class, a ML model is going to predict that class more frequently, just because there is more data for it. Balancing the data takes any skewed data and helps remove this imbalance.
1. You can take one more look at the data using `transformed_df.head()` and `transformed_df.info()`. Save a copy of this data for use in future lessons:
This fresh CSV can now be found in the root data folder.
可以在根数据文件夹中找到这个新的CSV。
---
---
## 🚀Challenge
## 🚀挑战
This curriculum contains several interesting datasets. Dig through the `data` folders and see if any contain datasets that would be appropriate for binary or multi-class classification? What questions would you ask of this dataset?