Update README.zh-cn.md

pull/123/head
feiyun0112 4 years ago committed by GitHub
parent b028149817
commit 773edc1f74
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,105 +1,104 @@
# Get started with Python and Scikit-learn for regression models # 开始使用Python和Scikit学习回归模型
![Summary of regressions in a sketchnote](../../sketchnotes/ml-regression.png) ![回归](../../sketchnotes/ml-regression.png)
> Sketchnote by [Tomomi Imura](https://www.twitter.com/girlie_mac) > 作者[Tomomi Imura](https://www.twitter.com/girlie_mac)
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/9/) ## [课前测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/9/)
## Introduction ## 介绍
In these four lessons, you will discover how to build regression models. We will discuss what these are for shortly. But before you do anything, make sure you have the right tools in place to start the process! 在这四节课中,你将了解如何构建回归模型。我们将很快讨论这些是什么。但在你做任何事情之前,请确保你有合适的工具来开始这个过程!
In this lesson, you will learn how to: 在本课中,你将学习如何:
- Configure your computer for local machine learning tasks. - 为本地机器学习任务配置你的计算机。
- Work with Jupyter notebooks. - 使用Jupyter notebooks。
- Use Scikit-learn, including installation. - 使用Scikit-learn包括安装。
- Explore linear regression with a hands-on exercise. - 通过动手练习探索线性回归。
## Installations and configurations ## 安装和配置
[![Using Python with Visual Studio Code](https://img.youtube.com/vi/7EXd4_ttIuw/0.jpg)](https://youtu.be/7EXd4_ttIuw "Using Python with Visual Studio Code") [![在 Visual Studio Code中使用 Python](https://img.youtube.com/vi/7EXd4_ttIuw/0.jpg)](https://youtu.be/7EXd4_ttIuw "在 Visual Studio Code中使用 Python")
> 🎥 Click the image above for a video: using Python within VS Code. > 🎥 单击上图观看视频在VS Code中使用Python。
1. **Install Python**. Ensure that [Python](https://www.python.org/downloads/) is installed on your computer. You will use Python for many data science and machine learning tasks. Most computer systems already include a Python installation. There are useful [Python Coding Packs](https://code.visualstudio.com/learn/educators/installers?WT.mc_id=academic-15963-cxa) available as well, to ease the setup for some users. 1. **安装 Python**。确保你的计算机上安装了[Python](https://www.python.org/downloads/)。你将在许多数据科学和机器学习任务中使用 Python。大多数计算机系统已经安装了Python。也有一些有用的[Python编码包](https://code.visualstudio.com/learn/educations/installers?WT.mc_id=academic-15963-cxa)可用于简化某些用户的设置。
Some usages of Python, however, require one version of the software, whereas others require a different version. For this reason, it's useful to work within a [virtual environment](https://docs.python.org/3/library/venv.html). 然而Python的某些用法需要一个版本的软件而其他用法则需要另一个不同的版本。 因此,在[虚拟环境](https://docs.python.org/3/library/venv.html)中工作很有用。
2. **Install Visual Studio Code**. Make sure you have Visual Studio Code installed on your computer. Follow these instructions to [install Visual Studio Code](https://code.visualstudio.com/) for the basic installation. You are going to use Python in Visual Studio Code in this course, so you might want to brush up on how to [configure Visual Studio Code](https://docs.microsoft.com/learn/modules/python-install-vscode?WT.mc_id=academic-15963-cxa) for Python development. 2. **安装 Visual Studio Code**。确保你的计算机上安装了Visual Studio Code。按照这些说明[安装 Visual Studio Code](https://code.visualstudio.com/)进行基本安装。在本课程中你将在Visual Studio Code中使用Python因此你可能想复习如何[配置 Visual Studio Code](https://docs.microsoft.com/learn/modules/python-install-vscode?WT.mc_id=academic-15963-cxa)用于Python开发。
> Get comfortable with Python by working through this collection of [Learn modules](https://docs.microsoft.com/users/jenlooper-2911/collections/mp1pagggd5qrq7?WT.mc_id=academic-15963-cxa) > 通过学习这一系列的 [学习模块](https://docs.microsoft.com/users/jenlooper-2911/collections/mp1pagggd5qrq7?WT.mc_id=academic-15963-cxa)熟悉Python
3. **Install Scikit-learn**, by following [these instructions](https://scikit-learn.org/stable/install.html). Since you need to ensure that you use Python 3, it's recommended that you use a virtual environment. Note, if you are installing this library on a M1 Mac, there are special instructions on the page linked above. 3. **按照[这些说明]安装Scikit learn**(https://scikit-learn.org/stable/install.html)。由于你需要确保使用Python3因此建议你使用虚拟环境。注意如果你是在M1 Mac上安装这个库在上面链接的页面上有特别的说明。
1. **Install Jupyter Notebook**. You will need to [install the Jupyter package](https://pypi.org/project/jupyter/). 4. **安装Jupyter Notebook**。你需要[安装Jupyter包](https://pypi.org/project/jupyter/)。
## Your ML authoring environment ## 你的ML工作环境
You are going to use **notebooks** to develop your Python code and create machine learning models. This type of file is a common tool for data scientists, and they can be identified by their suffix or extension `.ipynb`. 你将使用**notebooks**开发Python代码并创建机器学习模型。这种类型的文件是数据科学家的常用工具可以通过后缀或扩展名`.ipynb`来识别它们。
Notebooks are an interactive environment that allow the developer to both code and add notes and write documentation around the code which is quite helpful for experimental or research-oriented projects. Notebooks是一个交互式环境,允许开发人员编写代码并添加注释并围绕代码编写文档,这对于实验或面向研究的项目非常有帮助。
### Exercise - work with a notebook ### 练习 - 使用notebook
In this folder, you will find the file _notebook.ipynb_. 1. 在Visual Studio Code中打开_notebook.ipynb_。
1. Open _notebook.ipynb_ in Visual Studio Code. Jupyter服务器将以python3+启动。你会发现notebook可以“运行”的区域、代码块。你可以通过选择看起来像播放按钮的图标来运行代码块。
A Jupyter server will start with Python 3+ started. You will find areas of the notebook that can be `run`, pieces of code. You can run a code block, by selecting the icon that looks like a play button. 2. 选择`md`图标并添加一点markdown输入文字**#Welcome to your notebook**。
1. Select the `md` icon and add a bit of markdown, and the following text **# Welcome to your notebook**. 接下来添加一些Python代码。
Next, add some Python code. 1. 在代码块中输入**print("hello notebook")**。
1. Type **print('hello notebook')** in the code block. 2. 选择箭头运行代码。
1. Select the arrow to run the code.
You should see the printed statement: 你应该看到打印的语句:
```output ```output
hello notebook hello notebook
``` ```
![VS Code with a notebook open](images/notebook.png) ![打开notebook的VS Code](../images/notebook.png)
You can interleaf your code with comments to self-document the notebook. 你可以为你的代码添加注释以便notebook可以自描述。
Think for a minute how different a web developer's working environment is versus that of a data scientist. 想一想web开发人员的工作环境与数据科学家的工作环境有多大的不同。
## Up and running with Scikit-learn ## 启动并运行Scikit-learn
Now that Python is set up in your local environment, and you are comfortable with Jupyter notebooks, let's get equally comfortable with Scikit-learn (pronounce it `sci` as in `science`). Scikit-learn provides an [extensive API](https://scikit-learn.org/stable/modules/classes.html#api-ref) to help you perform ML tasks. 现在Python已在你的本地环境中设置好并且你对Jupyter notebook感到满意让我们同样熟悉Scikit-learn在“science”中发音为“sci”。 Scikit-learn提供了[大量的API](https://scikit-learn.org/stable/modules/classes.html#api-ref)来帮助你执行ML任务。
According to their [website](https://scikit-learn.org/stable/getting_started.html), "Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities." 根据他们的[网站](https://scikit-learn.org/stable/getting_started.html)“Scikit-learn是一个开源机器学习库支持有监督和无监督学习。它还提供了各种模型拟合工具、数据预处理、模型选择和评估以及许多其他实用程序。”
In this course, you will use Scikit-learn and other tools to build machine learning models to perform what we call 'traditional machine learning' tasks. We have deliberately avoided neural networks and deep learning, as they are better covered in our forthcoming 'AI for Beginners' curriculum. 在本课程中你将使用Scikit-learn和其他工具来构建机器学习模型以执行我们所谓的“传统机器学习”任务。我们特意避免了神经网络和深度学习因为它们在我们即将推出的“面向初学者的人工智能”课程中得到了更好的介绍。
Scikit-learn makes it straightforward to build models and evaluate them for use. It is primarily focused on using numeric data and contains several ready-made datasets for use as learning tools. It also includes pre-built models for students to try. Let's explore the process of loading prepackaged data and using a built in estimator first ML model with Scikit-learn with some basic data. Scikit-learn使构建模型和评估它们的使用变得简单。它主要侧重于使用数字数据并包含几个现成的数据集用作学习工具。它还包括供学生尝试的预建模型。让我们探索加载预先打包的数据和使用内置的estimator first ML模型和Scikit-learn以及一些基本数据的过程。
## Exercise - your first Scikit-learn notebook ## 练习 - 你的第一个Scikit-learn notebook
> This tutorial was inspired by the [linear regression example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py) on Scikit-learn's web site. > 本教程的灵感来自Scikit-learn网站上的[线性回归示例](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py)。
In the _notebook.ipynb_ file associated to this lesson, clear out all the cells by pressing the 'trash can' icon. 在与本课程相关的_notebook.ipynb_文件中通过点击“垃圾桶”图标清除所有单元格。
In this section, you will work with a small dataset about diabetes that is built into Scikit-learn for learning purposes. Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials. 在本节中你将使用一个关于糖尿病的小数据集该数据集内置于Scikit-learn中以用于学习目的。想象一下你想为糖尿病患者测试一种治疗方法。机器学习模型可能会帮助你根据变量组合确定哪些患者对治疗反应更好。即使是非常基本的回归模型在可视化时也可能会显示有助于组织理论临床试验的变量信息。
There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use linear regression, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use logistic regression. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate. 回归方法有很多种,你选择哪一种取决于你正在寻找的答案。如果你想预测给定年龄的人的可能身高,你可以使用线性回归,因为你正在寻找**数值**。如果你有兴趣了解某种菜肴是否应被视为素食主义者,那么你正在寻找**类别分配**,以便使用逻辑回归。稍后你将了解有关逻辑回归的更多信息。想一想你可以对数据提出的一些问题,以及这些方法中的哪一个更合适。
Let's get started on this task. 让我们开始这项任务。
### Import libraries ### 导入库
For this task we will import some libraries: 对于此任务,我们将导入一些库:
- **matplotlib**. It's a useful [graphing tool](https://matplotlib.org/) and we will use it to create a line plot. - **matplotlib**。这是一个有用的[绘图工具](https://matplotlib.org/),我们将使用它来创建线图。
- **numpy**. [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html) is a useful library for handling numeric data in Python. - **numpy**。 [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html)是一个有用的库用于在Python中处理数字数据。
- **sklearn**. This is the Scikit-learn library. - **sklearn**。这是Scikit-learn库。
Import some libraries to help with your tasks. 导入一些库来帮助你完成任务。
1. Add imports by typing the following code: 1. 通过输入以下代码添加导入:
```python ```python
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
@ -107,26 +106,26 @@ Import some libraries to help with your tasks.
from sklearn import datasets, linear_model, model_selection from sklearn import datasets, linear_model, model_selection
``` ```
Above you are importing `matplottlib`, `numpy` and you are importing `datasets`, `linear_model` and `model_selection` from `sklearn`. `model_selection` is used for splitting data into training and test sets. 在上面的代码中,你正在导入`matplottlib`、`numpy`,你正在从`sklearn`导入`datasets`、`linear_model`和`model_selection`。 `model_selection`用于将数据拆分为训练集和测试集。
### The diabetes dataset ### 糖尿病数据集
The built-in [diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) includes 442 samples of data around diabetes, with 10 feature variables, some of which include: 内置的[糖尿病数据集](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)包含442个围绕糖尿病的数据样本具有10个特征变量其中包括
age: age in years age:岁数
bmi: body mass index bmi:体重指数
bp: average blood pressure bp:平均血压
s1 tc: T-Cells (a type of white blood cells) s1 tcT细胞一种白细胞
This dataset includes the concept of 'sex' as a feature variable important to research around diabetes. Many medical datasets include this type of binary classification. Think a bit about how categorizations such as this might exclude certain parts of a population from treatments. 该数据集包括“性别”的概念,作为对糖尿病研究很重要的特征变量。许多医学数据集包括这种类型的二元分类。想一想诸如此类的分类如何将人群的某些部分排除在治疗之外。
Now, load up the X and y data. 现在加载X和y数据。
> 🎓 Remember, this is supervised learning, and we need a named 'y' target. > 🎓 请记住这是监督学习我们需要一个命名为“y”的目标。
In a new code cell, load the diabetes dataset by calling `load_diabetes()`. The input `return_X_y=True` signals that `X` will be a data matrix, and `y` will be the regression target. 在新的代码单元中,通过调用`load_diabetes()`加载糖尿病数据集。输入`return_X_y=True`表示`X`将是一个数据矩阵,而`y`将是回归目标。
1. Add some print commands to show the shape of the data matrix and its first element: 1. 添加一些打印命令来显示数据矩阵的形状及其第一个元素:
```python ```python
X, y = datasets.load_diabetes(return_X_y=True) X, y = datasets.load_diabetes(return_X_y=True)
@ -134,9 +133,9 @@ In a new code cell, load the diabetes dataset by calling `load_diabetes()`. The
print(X[0]) print(X[0])
``` ```
What you are getting back as a response, is a tuple. What you are doing is to assign the two first values of the tuple to `X` and `y` respectively. Learn more [about tuples](https://wikipedia.org/wiki/Tuple). 作为响应返回的是一个元组。你正在做的是将元组的前两个值分别分配给`X`和`y`。了解更多 [关于元组](https://wikipedia.org/wiki/Tuple)。
You can see that this data has 442 items shaped in arrays of 10 elements: 你可以看到这个数据有442个项目组成了10个元素的数组
```text ```text
(442, 10) (442, 10)
@ -144,38 +143,38 @@ In a new code cell, load the diabetes dataset by calling `load_diabetes()`. The
-0.04340085 -0.00259226 0.01990842 -0.01764613] -0.04340085 -0.00259226 0.01990842 -0.01764613]
``` ```
Think a bit about the relationship between the data and the regression target. Linear regression predicts relationships between feature X and target variable y. Can you find the [target](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) for the diabetes dataset in the documentation? What is this dataset demonstrating, given that target? 稍微思考一下数据和回归目标之间的关系。线性回归预测特征X和目标变量y之间的关系。你能在文档中找到糖尿病数据集的[目标](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)吗?鉴于该目标,该数据集展示了什么?
2. Next, select a portion of this dataset to plot by arranging it into a new array using numpy's `newaxis` function. We are going to use linear regression to generate a line between values in this data, according to a pattern it determines. 2. 接下来通过使用numpy的`newaxis`函数将其排列到一个新数组中来选择要绘制的该数据集的一部分。我们将使用线性回归根据它确定的模式在此数据中的值之间生成一条线。
```python ```python
X = X[:, np.newaxis, 2] X = X[:, np.newaxis, 2]
``` ```
At any time, print out the data to check its shape. 随时打印数据以检查其形状。
3. Now that you have data ready to be plotted, you can see if a machine can help determine a logical split between the numbers in this dataset. To do this, you need to split both the data (X) and the target (y) into test and training sets. Scikit-learn has a straightforward way to do this; you can split your test data at a given point. 3. 现在你已准备好绘制数据,你可以查看机器是否可以帮助确定此数据集中数字之间的逻辑分割。为此你需要将数据(X)和目标(y)拆分为测试集和训练集。Scikit-learn有一个简单的方法来做到这一点你可以在给定点拆分测试数据。
```python ```python
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33) X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33)
``` ```
4. Now you are ready to train your model! Load up the linear regression model and train it with your X and y training sets using `model.fit()`: 4. 现在你已准备好训练你的模型!加载线性回归模型并使用`model.fit()`使用X和y训练集对其进行训练
```python ```python
model = linear_model.LinearRegression() model = linear_model.LinearRegression()
model.fit(X_train, y_train) model.fit(X_train, y_train)
``` ```
`model.fit()` is a function you'll see in many ML libraries such as TensorFlow `model.fit()`是一个你会在许多机器学习库(例如 TensorFlow中看到的函数
5. Then, create a prediction using test data, using the function `predict()`. This will be used to draw the line between data groups 5. 然后,使用函数`predict()`,使用测试数据创建预测。这将用于绘制数据组之间的线
```python ```python
y_pred = model.predict(X_test) y_pred = model.predict(X_test)
``` ```
6. Now it's time to show the data in a plot. Matplotlib is a very useful tool for this task. Create a scatterplot of all the X and y test data, and use the prediction to draw a line in the most appropriate place, between the model's data groupings. 6. 现在是时候在图中显示数据了。Matplotlib是完成此任务的非常有用的工具。创建所有X和y测试数据的散点图并使用预测在模型的数据分组之间最合适的位置画一条线。
```python ```python
plt.scatter(X_test, y_test, color='black') plt.scatter(X_test, y_test, color='black')
@ -183,24 +182,24 @@ In a new code cell, load the diabetes dataset by calling `load_diabetes()`. The
plt.show() plt.show()
``` ```
![a scatterplot showing datapoints around diabetes](./images/scatterplot.png) ![显示糖尿病周围数据点的散点图](../images/scatterplot.png)
Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model. 想一想这里发生了什么。一条直线穿过许多小数据点但它到底在做什么你能看到你应该如何使用这条线来预测一个新的、未见过的数据点对应的y轴值吗尝试用语言描述该模型的实际用途。
Congratulations, you built your first linear regression model, created a prediction with it, and displayed it in a plot! 恭喜,你构建了第一个线性回归模型,使用它创建了预测,并将其显示在绘图中!
--- ---
## 🚀Challenge ## 🚀挑战
Plot a different variable from this dataset. Hint: edit this line: `X = X[:, np.newaxis, 2]`. Given this dataset's target, what are you able to discover about the progression of diabetes as a disease? 从这个数据集中绘制一个不同的变量。提示:编辑这一行:`X = X[:, np.newaxis, 2]`。鉴于此数据集的目标,你能够发现糖尿病作为一种疾病的进展情况吗?
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/10/) ## [课后测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/10/)
## Review & Self Study ## 复习与自学
In this tutorial, you worked with simple linear regression, rather than univariate or multiple linear regression. Read a little about the differences between these methods, or take a look at [this video](https://www.coursera.org/lecture/quantifying-relationships-regression-models/linear-vs-nonlinear-categorical-variables-ai2Ef) 在本教程中,你使用了简单线性回归,而不是单变量或多元线性回归。阅读一些关于这些方法之间差异的信息,或查看[此视频](https://www.coursera.org/lecture/quantifying-relationships-regression-models/linear-vs-nonlinear-categorical-variables-ai2Ef)
Read more about the concept of regression and think about what kinds of questions can be answered by this technique. Take this [tutorial](https://docs.microsoft.com/learn/modules/train-evaluate-regression-models?WT.mc_id=academic-15963-cxa) to deepen your understanding. 阅读有关回归概念的更多信息,并思考这种技术可以回答哪些类型的问题。用这个[教程](https://docs.microsoft.com/learn/modules/train-evaluate-regression-models?WT.mc_id=academic-15963-cxa)加深你的理解。
## Assignment ## 任务
[A different dataset](assignment.md) [不同的数据集](../assignment.md)

Loading…
Cancel
Save