Now that you are set up with the tools you need to start tackling machine learning model building with Scikit-learn, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.
The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.
Take a look at the [data](../data/US-pumpkins.csv) provided for this lesson. You can open this .csv file in VS Code. A quick skim immediately shows that there are blanks and a mix of strings and numeric data. There's also a strange column called 'Package' where the data is a mix between 'sacks', 'bins' and other values. The data, in fact, is a bit of a mess.
In fact, it is not very common to be gifted a dataset that is completely ready to use to create a ML model out of the box. In this lesson, you will learn how to prepare a raw dataset using standard Python libraries. You will also learn various techniques to visualize the data.
In this folder you will find a .csv file in the root `data` folder called [US-pumpkins.csv](../data/US-pumpkins.csv) which includes 1757 lines of data about the market for pumpkins, sorted into groupings by city. This is raw data extracted from the [Specialty Crops Terminal Markets Standard Reports](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) distributed by the United States Department of Agriculture.
This data is in the public domain. It can be downloaded in many separate files, per city, from the USDA web site. To avoid too many separate files, we have concatenated all the city data into one spreadsheet, thus we have already _prepared_ the data a bit. Next, let's take a closer look at the data.
What do you notice about this data? You already saw that there is a mix of strings, numbers, blanks and strange values that you need to make sense of.
你对这些数据有什么看法?你已经看到了无法理解的字符串、数字、空格和奇怪值的混合体。
What question can you ask of this data, using a Regression technique? What about "Predict the price of a pumpkin for sale during a given month". Looking again at the data, there are some changes you need to make to create the data structure necessary for the task.
Let's use [Pandas](https://pandas.pydata.org/), (the name stands for `Python Data Analysis`) a tool very useful for shaping data, to analyze and prepare this pumpkin data.
pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
```
```
### Second, determine average price of pumpkin
### 然后,确定南瓜的平均价格
Think about how to determine the average price of a pumpkin in a given month. What columns would you pick for this task? Hint: you'll need 3 columns.
考虑如何确定给定月份南瓜的平均价格。你会为此任务选择哪些列?提示:你需要3列。
Solution: take the average of the `Low Price` and `High Price` columns to populate the new Price column, and convert the Date column to only show the month. Fortunately, according to the check above, there is no missing data for dates or prices.
Printing out your dataframe will show you a clean, tidy dataset on which you can build your new regression model.
打印出的dataframe将向你展示一个干净整洁的数据集,你可以在此数据集上构建新的回归模型。
### But wait! There's something odd here
### 但是等等!这里有点奇怪
If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in '1 1/9 bushel' measures, and some in '1/2 bushel' measures, some per pumpkin, some per pound, and some in big boxes with varying widths.
Digging into the original data, it's interesting that anything with `Unit of Sale` equalling 'EACH' or 'PER BIN' also have the `Package` type per inch, per bin, or 'each'. Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string 'bushel' in their `Package` column.
深入研究原始数据,有趣的是,任何`Unit of Sale`等于“EACH”或“PER BIN”的东西也具有每英寸、每箱或“每个”的`Package`类型。南瓜似乎很难采用统一称重方式,因此让我们通过仅选择`Package`列中带有字符串“蒲式耳”的南瓜来过滤它们。
1. Add a filter at the top of the file, under the initial .csv import:
If you print the data now, you can see that you are only getting the 415 or so rows of data containing pumpkins by the bushel.
如果你现在打印数据,你可以看到你只获得了 415 行左右包含按蒲式耳计算的南瓜的数据。
### But wait! There's one more thing to do
### 可是等等! 还有一件事要做
Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, so do some math to standardize it.
✅ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), a bushel's weight depends on the type of produce, as it's a volume measurement. "A bushel of tomatoes, for example, is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds." It's all pretty complicated! Let's not bother with making a bushel-to-pound conversion, and instead price by the bushel. All this study of bushels of pumpkins, however, goes to show how very important it is to understand the nature of your data!
Now, you can analyze the pricing per unit based on their bushel measurement. If you print out the data one more time, you can see how it's standardized.
现在,你可以根据蒲式耳测量来分析每单位的定价。如果你再打印一次数据,你可以看到它是如何标准化的。
✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.
Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.
Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.
One data visualization libary that works well in Jupyter notebooks is [Matplotlib](https://matplotlib.org/) (which you also saw in the previous lesson).
> Get more experience with data visualization in [these tutorials](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-15963-cxa).
Try to create some basic plots to display the new dataframe you just created. What would a basic line plot show?
尝试创建一些基本图形来显示你刚刚创建的新dataframe。基本线图会显示什么?
1. Import Matplotlib at the top of the file, under the Pandas import:
1. 在文件顶部导入Matplotlib:
```python
```python
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
```
```
1. Rerun the entire notebook to refresh.
2. 重新刷新以运行整个notebook。
1. At the bottom of the notebook, add a cell to plot the data as a box:
3. 在notebook底部,添加一个单元格以绘制数据:
```python
```python
price = new_pumpkins.Price
price = new_pumpkins.Price
@ -162,39 +164,39 @@ Try to create some basic plots to display the new dataframe you just created. Wh
plt.show()
plt.show()
```
```
![A scatterplot showing price to month relationship](./images/scatterplot.png)
![显示价格与月份关系的散点图](../images/scatterplot.png)
Is this a useful plot? Does anything about it surprise you?
这是一个有用的图吗?有什么让你吃惊的吗?
It's not particularly useful as all it does is display in your data as a spread of points in a given month.
它并不是特别有用,因为它所做的只是在你的数据中显示为给定月份的点数分布。
### Make it useful
### 让它有用
To get charts to display useful data, you usually need to group the data somehow. Let's try creating a plot where the y axis shows the months and the data demonstrates the distribution of data.
![A bar chart showing price to month relationship](./images/barchart.png)
![显示价格与月份关系的条形图](../images/barchart.png)
This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?
Take a look at the many ways to visualize data. Make a list of the various libraries available and note which are best for given types of tasks, for example 2D visualizations vs. 3D visualizations. What do you discover?