Update README.zh-cn.md

pull/124/head
feiyun0112 4 years ago committed by GitHub
parent 24442a9c3b
commit 677aa8ac43
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,85 +1,86 @@
# Build a regression model using Scikit-learn: prepare and visualize data # 使用Scikit-learn构建回归模型准备和可视化数据
> ![Data visualization infographic](./images/data-visualization.png) > ![数据可视化信息图](../images/data-visualization.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded) > 作者[Dasani Madipalli](https://twitter.com/dasani_decoded)
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/11/) ## [课前测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/11/)
## Introduction ## 介绍
Now that you are set up with the tools you need to start tackling machine learning model building with Scikit-learn, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset. 既然你已经设置了开始使用Scikit-learn处理机器学习模型构建所需的工具你就可以开始对数据提出问题了。当你处理数据并应用ML解决方案时了解如何提出正确的问题以正确释放数据集的潜力非常重要。
In this lesson, you will learn: 在本课中,你将学习:
- How to prepare your data for model-building. - 如何为模型构建准备数据。
- How to use Matplotlib for data visualization. - 如何使用Matplotlib进行数据可视化。
## Asking the right question of your data ## 对你的数据提出正确的问题
The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data. 你需要回答的问题将决定你将使用哪种类型的ML算法。你得到的答案的质量将在很大程度上取决于你的数据的性质。
Take a look at the [data](../data/US-pumpkins.csv) provided for this lesson. You can open this .csv file in VS Code. A quick skim immediately shows that there are blanks and a mix of strings and numeric data. There's also a strange column called 'Package' where the data is a mix between 'sacks', 'bins' and other values. The data, in fact, is a bit of a mess. 查看为本课程提供的[数据](../data/US-pumpkins.csv)。你可以在VS Code中打开这个.csv文件。快速浏览一下就会发现有空格还有字符串和数字数据的混合。还有一个奇怪的列叫做“Package”其中的数据是“sacks”、“bins”和其他值的混合。事实上数据有点乱。
In fact, it is not very common to be gifted a dataset that is completely ready to use to create a ML model out of the box. In this lesson, you will learn how to prepare a raw dataset using standard Python libraries. You will also learn various techniques to visualize the data. 事实上获得一个完全准备好用于创建开箱即用的ML模型的数据集并不是很常见。在本课中你将学习如何使用标准Python库准备原始数据集。你还将学习各种技术来可视化数据。
## Case study: 'the pumpkin market' ## 案例研究:“南瓜市场”
In this folder you will find a .csv file in the root `data` folder called [US-pumpkins.csv](../data/US-pumpkins.csv) which includes 1757 lines of data about the market for pumpkins, sorted into groupings by city. This is raw data extracted from the [Specialty Crops Terminal Markets Standard Reports](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) distributed by the United States Department of Agriculture. 你将在`data`文件夹中找到一个名为[US-pumpkins.csv](../data/US-pumpkins.csv)的.csv 文件其中包含有关南瓜市场的1757行数据已 按城市排序分组。这是从美国农业部分发的[特种作物终端市场标准报告](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice)中提取的原始数据。
### Preparing data ### 准备数据
This data is in the public domain. It can be downloaded in many separate files, per city, from the USDA web site. To avoid too many separate files, we have concatenated all the city data into one spreadsheet, thus we have already _prepared_ the data a bit. Next, let's take a closer look at the data. 这些数据属于公共领域。它可以从美国农业部网站下载,每个城市有许多不同的文件。为了避免太多单独的文件,我们将所有城市数据合并到一个电子表格中,因此我们已经准备了一些数据。接下来,让我们仔细看看数据。
### The pumpkin data - early conclusions ### 南瓜数据 - 早期结论
What do you notice about this data? You already saw that there is a mix of strings, numbers, blanks and strange values that you need to make sense of. 你对这些数据有什么看法?你已经看到了无法理解的字符串、数字、空格和奇怪值的混合体。
What question can you ask of this data, using a Regression technique? What about "Predict the price of a pumpkin for sale during a given month". Looking again at the data, there are some changes you need to make to create the data structure necessary for the task. 你可以使用回归技术对这些数据提出什么问题?“预测给定月份内待售南瓜的价格”怎么样?再次查看数据,你需要进行一些更改才能创建任务所需的数据结构。
## Exercise - analyze the pumpkin data ## 练习 - 分析南瓜数据
Let's use [Pandas](https://pandas.pydata.org/), (the name stands for `Python Data Analysis`) a tool very useful for shaping data, to analyze and prepare this pumpkin data. 让我们使用[Pandas](https://pandas.pydata.org/)“Python 数据分析”的意思)一个非常有用的工具,用于分析和准备南瓜数据。
### First, check for missing dates ### 首先,检查遗漏的日期
You will first need to take steps to check for missing dates: 你首先需要采取以下步骤来检查缺少的日期:
1. Convert the dates to a month format (these are US dates, so the format is `MM/DD/YYYY`). 1. 将日期转换为月份格式(这些是美国日期,因此格式为`MM/DD/YYYY`)。
2. Extract the month to a new column.
Open the _notebook.ipynb_ file in Visual Studio Code and import the spreadsheet in to a new Pandas dataframe. 2. 将月份提取到新列。
1. Use the `head()` function to view the first five rows. 在 Visual Studio Code 中打开notebook.ipynb文件并将电子表格导入到新的Pandas dataframe中。
1. 使用 `head()`函数查看前五行。
```python ```python
import pandas as pd import pandas as pd
pumpkins = pd.read_csv('../data/US-pumpkins.csv') pumpkins = pd.read_csv('../../data/US-pumpkins.csv')
pumpkins.head() pumpkins.head()
``` ```
What function would you use to view the last five rows? 使用什么函数来查看最后五行?
1. Check if there is missing data in the current dataframe: 2. 检查当前dataframe中是否缺少数据
```python ```python
pumpkins.isnull().sum() pumpkins.isnull().sum()
``` ```
There is missing data, but maybe it won't matter for the task at hand. 有数据丢失,但可能对手头的任务来说无关紧要。
1. To make your dataframe easier to work with, drop several of its columns, using `drop()`, keeping only the columns you need: 3. 为了让你的dataframe更容易使用使用`drop()`删除它的几个列,只保留你需要的列:
```python ```python
new_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date'] new_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']
pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1) pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
``` ```
### Second, determine average price of pumpkin ### 然后,确定南瓜的平均价格
Think about how to determine the average price of a pumpkin in a given month. What columns would you pick for this task? Hint: you'll need 3 columns. 考虑如何确定给定月份南瓜的平均价格。你会为此任务选择哪些列提示你需要3列。
Solution: take the average of the `Low Price` and `High Price` columns to populate the new Price column, and convert the Date column to only show the month. Fortunately, according to the check above, there is no missing data for dates or prices. 解决方案:取`Low Price`和`High Price`列的平均值来填充新的Price列将Date列转换成只显示月份。幸运的是根据上面的检查没有丢失日期或价格的数据。
1. To calculate the average, add the following code: 1. 要计算平均值,请添加以下代码:
```python ```python
price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2 price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2
@ -88,37 +89,37 @@ Solution: take the average of the `Low Price` and `High Price` columns to popula
``` ```
Feel free to print any data you'd like to check using `print(month)`. 请随意使用`print(month)`打印你想检查的任何数据。
2. Now, copy your converted data into a fresh Pandas dataframe: 2. 现在将转换后的数据复制到新的Pandas dataframe中
```python ```python
new_pumpkins = pd.DataFrame({'Month': month, 'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'],'High Price': pumpkins['High Price'], 'Price': price}) new_pumpkins = pd.DataFrame({'Month': month, 'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'],'High Price': pumpkins['High Price'], 'Price': price})
``` ```
Printing out your dataframe will show you a clean, tidy dataset on which you can build your new regression model. 打印出的dataframe将向你展示一个干净整洁的数据集你可以在此数据集上构建新的回归模型。
### But wait! There's something odd here ### 但是等等!这里有点奇怪
If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in '1 1/9 bushel' measures, and some in '1/2 bushel' measures, some per pumpkin, some per pound, and some in big boxes with varying widths. 如果你看看`Package`(包装)一栏南瓜有很多不同的配置。有的以1 1/9蒲式耳的尺寸出售有的以1/2蒲式耳的尺寸出售有的以每只南瓜出售有的以每磅出售有的以不同宽度的大盒子出售。
> Pumpkins seem very hard to weigh consistently > 南瓜似乎很难统一称重方式
Digging into the original data, it's interesting that anything with `Unit of Sale` equalling 'EACH' or 'PER BIN' also have the `Package` type per inch, per bin, or 'each'. Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string 'bushel' in their `Package` column. 深入研究原始数据,有趣的是,任何`Unit of Sale`等于“EACH”或“PER BIN”的东西也具有每英寸、每箱或“每个”的`Package`类型。南瓜似乎很难采用统一称重方式,因此让我们通过仅选择`Package`列中带有字符串“蒲式耳”的南瓜来过滤它们。
1. Add a filter at the top of the file, under the initial .csv import: 1. 在初始.csv导入下添加过滤器
```python ```python
pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)] pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)]
``` ```
If you print the data now, you can see that you are only getting the 415 or so rows of data containing pumpkins by the bushel. 如果你现在打印数据,你可以看到你只获得了 415 行左右包含按蒲式耳计算的南瓜的数据。
### But wait! There's one more thing to do ### 可是等等! 还有一件事要做
Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, so do some math to standardize it. 你是否注意到每行的蒲式耳数量不同?你需要对定价进行标准化,以便显示每蒲式耳的定价,因此请进行一些数学计算以对其进行标准化。
1. Add these lines after the block creating the new_pumpkins dataframe: 1. 在创建 new_pumpkins dataframe的代码块之后添加这些行
```python ```python
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1 1/9'), 'Price'] = price/(1 + 1/9) new_pumpkins.loc[new_pumpkins['Package'].str.contains('1 1/9'), 'Price'] = price/(1 + 1/9)
@ -126,34 +127,35 @@ Did you notice that the bushel amount varies per row? You need to normalize the
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1/2'), 'Price'] = price/(1/2) new_pumpkins.loc[new_pumpkins['Package'].str.contains('1/2'), 'Price'] = price/(1/2)
``` ```
According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), a bushel's weight depends on the type of produce, as it's a volume measurement. "A bushel of tomatoes, for example, is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds." It's all pretty complicated! Let's not bother with making a bushel-to-pound conversion, and instead price by the bushel. All this study of bushels of pumpkins, however, goes to show how very important it is to understand the nature of your data! 根据 [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308)蒲式耳的重量取决于产品的类型因为它是一种体积测量。“例如一蒲式耳西红柿应该重56 磅……叶子和蔬菜占据更多空间重量更轻所以一蒲式耳菠菜只有20磅。” 这一切都相当复杂!让我们不要费心进行蒲式耳到磅的转换,而是按蒲式耳定价。然而,所有这些对蒲式耳南瓜的研究表明,了解数据的性质是多么重要!
Now, you can analyze the pricing per unit based on their bushel measurement. If you print out the data one more time, you can see how it's standardized. 现在,你可以根据蒲式耳测量来分析每单位的定价。如果你再打印一次数据,你可以看到它是如何标准化的。
Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin. 你有没有注意到半蒲式耳卖的南瓜很贵?你能弄清楚为什么吗?提示:小南瓜比大南瓜贵得多,这可能是因为考虑到一个大的空心馅饼南瓜占用的未使用空间,每蒲式耳的南瓜要多得多。
## Visualization Strategies ## 可视化策略
Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover. 数据科学家的部分职责是展示他们使用的数据的质量和性质。为此,他们通常会创建有趣的可视化或绘图、图形和图表,以显示数据的不同方面。通过这种方式,他们能够直观地展示难以发现的关系和差距。
Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise. 可视化还可以帮助确定最适合数据的机器学习技术。例如,似乎沿着一条线的散点图表明该数据是线性回归练习的良好候选者。
One data visualization libary that works well in Jupyter notebooks is [Matplotlib](https://matplotlib.org/) (which you also saw in the previous lesson). 一个在Jupyter notebooks中运行良好的数据可视化库是[Matplotlib](https://matplotlib.org/)(你在上一课中也看到过)。
> Get more experience with data visualization in [these tutorials](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-15963-cxa). > 在[这些教程](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-15963-cxa)中获得更多数据可视化经验。
## Exercise - experiment with Matplotlib ## 练习 - 使用 Matplotlib 进行实验
Try to create some basic plots to display the new dataframe you just created. What would a basic line plot show? 尝试创建一些基本图形来显示你刚刚创建的新dataframe。基本线图会显示什么
1. Import Matplotlib at the top of the file, under the Pandas import: 1. 在文件顶部导入Matplotlib
```python ```python
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
``` ```
1. Rerun the entire notebook to refresh. 2. 重新刷新以运行整个notebook。
1. At the bottom of the notebook, add a cell to plot the data as a box:
3. 在notebook底部添加一个单元格以绘制数据
```python ```python
price = new_pumpkins.Price price = new_pumpkins.Price
@ -162,39 +164,39 @@ Try to create some basic plots to display the new dataframe you just created. Wh
plt.show() plt.show()
``` ```
![A scatterplot showing price to month relationship](./images/scatterplot.png) ![显示价格与月份关系的散点图](../images/scatterplot.png)
Is this a useful plot? Does anything about it surprise you? 这是一个有用的图吗?有什么让你吃惊的吗?
It's not particularly useful as all it does is display in your data as a spread of points in a given month. 它并不是特别有用,因为它所做的只是在你的数据中显示为给定月份的点数分布。
### Make it useful ### 让它有用
To get charts to display useful data, you usually need to group the data somehow. Let's try creating a plot where the y axis shows the months and the data demonstrates the distribution of data. 为了让图表显示有用的数据你通常需要以某种方式对数据进行分组。让我们尝试创建一个图其中y轴显示月份数据显示数据的分布。
1. Add a cell to create a grouped bar chart: 1. 添加单元格以创建分组条形图:
```python ```python
new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar') new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
plt.ylabel("Pumpkin Price") plt.ylabel("Pumpkin Price")
``` ```
![A bar chart showing price to month relationship](./images/barchart.png) ![显示价格与月份关系的条形图](../images/barchart.png)
This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not? 这是一个更有用的数据可视化似乎表明南瓜的最高价格出现在9月和10月。这符合你的期望吗为什么为什么不
--- ---
## 🚀Challenge ## 🚀挑战
Explore the different types of visualization that Matplotlib offers. Which types are most appropriate for regression problems? 探索Matplotlib提供的不同类型的可视化。哪种类型最适合回归问题
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/12/) ## [课后测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/12/)
## Review & Self Study ## 复习与自学
Take a look at the many ways to visualize data. Make a list of the various libraries available and note which are best for given types of tasks, for example 2D visualizations vs. 3D visualizations. What do you discover? 请看一下可视化数据的多种方法。列出各种可用的库并注意哪些库最适合给定类型的任务例如2D可视化与3D可视化。你发现了什么
## Assignment ## 任务
[Exploring visualization](assignment.md) [探索可视化](../assignment.md)

Loading…
Cancel
Save