{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "rQ8UhzFpgRra" }, "source": [ "# 数据准备\n", "\n", "[原始笔记本来源于 *数据科学:Python和机器学习工作室中的数据科学机器学习入门,作者 Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n", "\n", "## 探索 `DataFrame` 信息\n", "\n", "> **学习目标:** 在本小节结束时,您应该能够熟练地查找存储在 pandas DataFrame 中的数据的一般信息。\n", "\n", "当您将数据加载到 pandas 中时,它很可能会以 `DataFrame` 的形式存在。然而,如果您的 `DataFrame` 中的数据集有60,000行和400列,您该如何开始了解您正在处理的内容呢?幸运的是,pandas 提供了一些方便的工具,可以快速查看 `DataFrame` 的整体信息,以及前几行和后几行数据。\n", "\n", "为了探索这些功能,我们将导入 Python 的 scikit-learn 库,并使用一个每位数据科学家都见过数百次的经典数据集:英国生物学家 Ronald Fisher 在其1936年的论文《多重测量在分类问题中的应用》中使用的 *Iris* 数据集:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "id": "hB1RofhdgRrp", "trusted": false }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])" ] }, { "cell_type": "markdown", "metadata": { "id": "AGA0A_Y8hMdz" }, "source": [ "### `DataFrame.shape`\n", "我们已经将鸢尾花数据集加载到变量 `iris_df` 中。在深入分析数据之前,了解我们拥有的数据点数量以及数据集的整体规模是很有价值的。查看我们正在处理的数据量是很有帮助的。\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LOe5jQohhulf", "outputId": "fb0577ac-3b4a-4623-cb41-20e1b264b3e9" }, "outputs": [ { "data": { "text/plain": [ "(150, 4)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "smE7AGzOhxk2" }, "source": [ "所以,我们正在处理包含150行和4列的数据。每一行代表一个数据点,每一列代表与数据框相关的一个特征。基本上,这里有150个数据点,每个数据点包含4个特征。\n", "\n", "`shape`在这里是数据框的一个属性,而不是一个函数,这就是为什么它的末尾没有一对括号。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "d3AZKs0PinGP" }, "source": [ "### `DataFrame.columns`\n", "现在让我们来看这4列数据。它们分别具体代表什么呢?`columns` 属性会告诉我们数据框中列的名称。\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YPGh_ziji-CY", "outputId": "74e7a43a-77cc-4c80-da56-7f50767c37a0" }, "outputs": [ { "data": { "text/plain": [ "Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n", " 'petal width (cm)'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.columns" ] }, { "cell_type": "markdown", "metadata": { "id": "TsobcU_VjCC_" }, "source": [ "正如我们所见,有四(4)列。`columns` 属性告诉我们列的名称,除此之外基本没有其他信息。当我们想要识别数据集包含的特征时,这个属性变得重要。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "2UTlvkjmgRrs" }, "source": [ "### `DataFrame.info`\n", "通过 `shape` 属性提供的数据量以及通过 `columns` 属性提供的特征或列名,可以让我们对数据集有一些初步了解。现在,我们希望更深入地探索数据集。`DataFrame.info()` 函数在这方面非常有用。\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dHHRyG0_gRrt", "outputId": "d8fb0c40-4f18-4e19-da48-c8db77d1d3a5", "trusted": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 150 entries, 0 to 149\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 sepal length (cm) 150 non-null float64\n", " 1 sepal width (cm) 150 non-null float64\n", " 2 petal length (cm) 150 non-null float64\n", " 3 petal width (cm) 150 non-null float64\n", "dtypes: float64(4)\n", "memory usage: 4.8 KB\n" ] } ], "source": [ "iris_df.info()" ] }, { "cell_type": "markdown", "metadata": { "id": "1XgVMpvigRru" }, "source": [ "从这里,我们可以得出一些观察:\n", "\n", "1. 每列的数据类型:在这个数据集中,所有数据都存储为64位浮点数。\n", "2. 非空值的数量:处理空值是数据准备中的一个重要步骤。这将在后续的笔记本中处理。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "IYlyxbpWFEF4" }, "source": [ "### DataFrame.describe()\n", "假设我们的数据集中有大量的数值数据。可以对每一列单独进行单变量统计计算,例如均值、中位数、四分位数等。`DataFrame.describe()` 函数为数据集的数值列提供统计摘要。\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "id": "tWV-CMstFIRA", "outputId": "4fc49941-bc13-4b0c-a412-cb39e7d3f289" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count150.000000150.000000150.000000150.000000
mean5.8433333.0573333.7580001.199333
std0.8280660.4358661.7652980.762238
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "count 150.000000 150.000000 150.000000 150.000000\n", "mean 5.843333 3.057333 3.758000 1.199333\n", "std 0.828066 0.435866 1.765298 0.762238\n", "min 4.300000 2.000000 1.000000 0.100000\n", "25% 5.100000 2.800000 1.600000 0.300000\n", "50% 5.800000 3.000000 4.350000 1.300000\n", "75% 6.400000 3.300000 5.100000 1.800000\n", "max 7.900000 4.400000 6.900000 2.500000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.describe()" ] }, { "cell_type": "markdown", "metadata": { "id": "zjjtW5hPGMuM" }, "source": [ "上面的输出显示了每列的总数据点数、均值、标准差、最小值、下四分位数(25%)、中位数(50%)、上四分位数(75%)和最大值。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-lviAu99gRrv" }, "source": [ "### `DataFrame.head`\n", "通过以上所有函数和属性,我们已经对数据集有了一个总体的了解。我们知道数据点的数量、特征的数量、每个特征的数据类型以及每个特征的非空值数量。\n", "\n", "现在是时候查看数据本身了。让我们看看我们的 `DataFrame` 的前几行(前几个数据点)是什么样子的:\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DZMJZh0OgRrw", "outputId": "d9393ee5-c106-4797-f815-218f17160e00", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "EBHEimZuEFQK" }, "source": [ "在这里的输出中,我们可以看到数据集的五(5)条记录。如果我们查看左侧的索引,我们会发现这些是前五行。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oj7GkrTdgRry" }, "source": [ "### 练习:\n", "\n", "从上面的例子可以看出,默认情况下,`DataFrame.head` 会返回 `DataFrame` 的前五行。在下面的代码单元中,你能找到一种方法来显示超过五行的数据吗?\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true, "id": "EKRmRFFegRrz", "trusted": false }, "outputs": [], "source": [ "# Hint: Consult the documentation by using iris_df.head?" ] }, { "cell_type": "markdown", "metadata": { "id": "BJ_cpZqNgRr1" }, "source": [ "### `DataFrame.tail`\n", "另一种查看数据的方法是从末尾开始(而不是从开头)。`DataFrame.head` 的反面是 `DataFrame.tail`,它会返回 `DataFrame` 的最后五行:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "heanjfGWgRr2", "outputId": "6ae09a21-fe09-4110-b0d7-1a1fbf34d7f3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
1456.73.05.22.3
1466.32.55.01.9
1476.53.05.22.0
1486.23.45.42.3
1495.93.05.11.8
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "145 6.7 3.0 5.2 2.3\n", "146 6.3 2.5 5.0 1.9\n", "147 6.5 3.0 5.2 2.0\n", "148 6.2 3.4 5.4 2.3\n", "149 5.9 3.0 5.1 1.8" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.tail()" ] }, { "cell_type": "markdown", "metadata": { "id": "31kBWfyLgRr3" }, "source": [ "在实际操作中,能够轻松查看 `DataFrame` 的前几行或后几行非常有用,尤其是在检查有序数据集中的异常值时。\n", "\n", "上面通过代码示例展示的所有函数和属性,都能帮助我们快速了解数据的外观和特性。\n", "\n", "> **要点:** 仅仅通过查看 `DataFrame` 的元数据,或者查看其前几行和后几行的值,就可以快速了解数据的大小、形状以及内容。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "TvurZyLSDxq_" }, "source": [ "### 缺失数据\n", "让我们深入了解缺失数据。缺失数据是指某些列中没有存储任何值的情况。\n", "\n", "举个例子:假设某人对自己的体重很敏感,因此在调查中没有填写体重字段。那么,这个人的体重值就会缺失。\n", "\n", "在现实世界的数据集中,缺失值是非常常见的。\n", "\n", "**Pandas如何处理缺失数据**\n", "\n", "Pandas处理缺失值有两种方式。第一种方式你在之前的章节中已经见过:`NaN`,即“非数字”。实际上,这是一种特殊值,是IEEE浮点规范的一部分,仅用于表示缺失的浮点值。\n", "\n", "对于非浮点类型的缺失值,Pandas使用Python的`None`对象。虽然同时遇到两种不同类型的缺失值可能会让人感到困惑,但这种设计选择有其合理的编程原因。在实际应用中,这种方式使Pandas能够在绝大多数情况下提供一个良好的折中方案。尽管如此,`None`和`NaN`都存在一些限制,需要注意它们在使用时的规则和约束。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lOHqUlZFgRr5" }, "source": [ "### `None`:非浮点类型的缺失数据\n", "由于 `None` 来自 Python,它不能用于数据类型不是 `'object'` 的 NumPy 和 pandas 数组中。请记住,NumPy 数组(以及 pandas 中的数据结构)只能包含一种类型的数据。这种特性赋予了它们在大规模数据和计算工作中的强大能力,但也限制了它们的灵活性。这类数组必须向“最低公分母”数据类型进行提升,即能够包含数组中所有内容的数据类型。当数组中包含 `None` 时,这意味着你正在处理 Python 对象。\n", "\n", "为了更直观地理解这一点,请看以下示例数组(注意它的 `dtype`):\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QIoNdY4ngRr7", "outputId": "92779f18-62f4-4a03-eca2-e9a101604336", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "array([2, None, 6, 8], dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "example1 = np.array([2, None, 6, 8])\n", "example1" ] }, { "cell_type": "markdown", "metadata": { "id": "pdlgPNbhgRr7" }, "source": [ "上升数据类型的现实带来了两个副作用。首先,操作将在解释的 Python 代码层面上进行,而不是编译的 NumPy 代码层面。这基本上意味着,任何涉及包含 `None` 的 `Series` 或 `DataFrame` 的操作都会变得更慢。虽然你可能不会注意到这种性能下降,但对于大型数据集来说,这可能会成为一个问题。\n", "\n", "第二个副作用源于第一个副作用。由于 `None` 本质上将 `Series` 或 `DataFrame` 拉回到原生 Python 的世界,因此在包含 `None` 值的数组上使用 NumPy/pandas 的聚合函数(如 `sum()` 或 `min()`)通常会产生错误:\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 292 }, "id": "gWbx-KB9gRr8", "outputId": "ecba710a-22ec-41d5-a39c-11f67e645b50", "trusted": false }, "outputs": [ { "ename": "TypeError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" ] } ], "source": [ "example1.sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "LcEwO8UogRr9" }, "source": [ "**关键点**:整数与`None`值之间的加法(以及其他操作)是未定义的,这可能会限制您对包含它们的数据集所能执行的操作。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "pWvVHvETgRr9" }, "source": [ "### `NaN`:缺失的浮点值\n", "\n", "与 `None` 不同,NumPy(因此也包括 pandas)支持 `NaN`,以便进行快速的矢量化操作和 ufuncs。坏消息是,任何涉及 `NaN` 的算术运算结果都会是 `NaN`。例如:\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "rcFYfMG9gRr9", "outputId": "699e81b7-5c11-4b46-df1d-06071768690f", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan + 1" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BW3zQD2-gRr-", "outputId": "4525b6c4-495d-4f7b-a979-efce1dae9bd0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan * 0" ] }, { "cell_type": "markdown", "metadata": { "id": "fU5IPRcCgRr-" }, "source": [ "好消息:在包含 `NaN` 的数组上运行聚合不会出现错误。坏消息:结果并不总是有用:\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LCInVgSSgRr_", "outputId": "fa06495a-0930-4867-87c5-6023031ea8b5", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "(nan, nan, nan)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ] }, { "cell_type": "markdown", "metadata": { "id": "nhlnNJT7gRr_" }, "source": [ "### 练习:\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true, "id": "yan3QRaOgRr_", "trusted": false }, "outputs": [], "source": [ "# What happens if you add np.nan and None together?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_iDvIRC8gRsA" }, "source": [ "请记住:`NaN` 仅用于表示缺失的浮点值;整数、字符串或布尔值没有 `NaN` 的等价物。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "kj6EKdsAgRsA" }, "source": [ "### `NaN` 和 `None`:pandas 中的空值\n", "\n", "尽管 `NaN` 和 `None` 的行为可能略有不同,但 pandas 仍然能够将它们互换处理。为了更好地理解这一点,来看一个整数的 `Series`:\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Nji-KGdNgRsA", "outputId": "36aa14d2-8efa-4bfd-c0ed-682991288822", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ] }, { "cell_type": "markdown", "metadata": { "id": "WklCzqb8gRsB" }, "source": [ "### 练习:\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true, "id": "Cy-gqX5-gRsB", "trusted": false }, "outputs": [], "source": [ "# Now set an element of int_series equal to None.\n", "# How does that element show up in the Series?\n", "# What is the dtype of the Series?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "WjMQwltNgRsB" }, "source": [ "在将数据类型向上转换以实现 `Series` 和 `DataFrame` 数据同质化的过程中,pandas 会自如地在 `None` 和 `NaN` 之间切换缺失值。由于这一设计特性,将 `None` 和 `NaN` 视为 pandas 中两种不同形式的“空值”是很有帮助的。事实上,你会发现 pandas 中一些用于处理缺失值的核心方法在命名上也反映了这一点:\n", "\n", "- `isnull()`:生成一个布尔掩码,用于标识缺失值\n", "- `notnull()`:与 `isnull()` 相反\n", "- `dropna()`:返回过滤后的数据版本\n", "- `fillna()`:返回填充或插补缺失值后的数据副本\n", "\n", "这些方法非常重要,掌握并熟练使用它们是关键。接下来,我们将深入了解每个方法。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Yh5ifd9FgRsB" }, "source": [ "### 检测空值\n", "\n", "既然我们已经了解了缺失值的重要性,那么在处理它们之前,我们需要先在数据集中检测这些值。 \n", "`isnull()` 和 `notnull()` 是检测空数据的主要方法。它们都会返回数据上的布尔掩码。\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true, "id": "e-vFp5lvgRsC", "trusted": false }, "outputs": [], "source": [ "example3 = pd.Series([0, np.nan, '', None])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1XdaJJ7PgRsC", "outputId": "92fc363a-1874-471f-846d-f4f9ce1f51d0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 True\n", "dtype: bool" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull()" ] }, { "cell_type": "markdown", "metadata": { "id": "PaSZ0SQygRsC" }, "source": [ "仔细看看输出结果,有没有让你感到意外的地方?虽然 `0` 是一个算术上的空值,但它仍然是一个完全合法的整数,pandas 也将其视为整数。`''` 则稍微复杂一些。虽然我们在第1节中用它来表示一个空字符串值,但它仍然是一个字符串对象,而不是 pandas 所认为的空值。\n", "\n", "现在,让我们换个角度,用这些方法以更接近实际使用的方式来操作。你可以直接将布尔掩码用作 ``Series`` 或 ``DataFrame`` 的索引,这在处理孤立的缺失值(或存在值)时非常有用。\n", "\n", "如果我们想要缺失值的总数,只需对 `isnull()` 方法生成的掩码进行求和即可。\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JCcQVoPkHDUv", "outputId": "001daa72-54f8-4bd5-842a-4df627a79d4d" }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "PlBqEo3mgRsC" }, "source": [ "### 练习:\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true, "id": "ggDVf5uygRsD", "trusted": false }, "outputs": [], "source": [ "# Try running example3[example3.notnull()].\n", "# Before you do so, what do you expect to see?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "D_jWN7mHgRsD" }, "source": [ "**关键点**:在使用 DataFrame 时,`isnull()` 和 `notnull()` 方法会产生类似的结果:它们显示结果及其索引,这将在您处理数据时为您提供极大的帮助。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "BvnoojWsgRr4" }, "source": [ "### 处理缺失数据\n", "\n", "> **学习目标:** 在本小节结束时,你应该了解如何以及何时在 DataFrame 中替换或移除空值。\n", "\n", "机器学习模型本身无法处理缺失数据。因此,在将数据传入模型之前,我们需要处理这些缺失值。\n", "\n", "如何处理缺失数据涉及一些微妙的权衡,这可能会影响你的最终分析结果以及实际应用的效果。\n", "\n", "处理缺失数据主要有两种方法:\n", "\n", "1. 删除包含缺失值的行\n", "2. 用其他值替换缺失值\n", "\n", "我们将详细讨论这两种方法及其优缺点。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "3VaYC1TvgRsD" }, "source": [ "### 删除空值\n", "\n", "我们传递给模型的数据量会直接影响其性能。删除空值意味着我们减少了数据点的数量,从而减小了数据集的规模。因此,当数据集相当大时,建议删除包含空值的行。\n", "\n", "另一种情况可能是某一行或某一列有大量缺失值。在这种情况下,它们可能会被删除,因为对于我们的分析来说,这些行/列的大部分数据都缺失,无法提供太多价值。\n", "\n", "除了识别缺失值之外,pandas 提供了一种方便的方法来从 `Series` 和 `DataFrame` 中移除空值。为了更好地理解这一点,让我们回到 `example3`。`DataFrame.dropna()` 函数可以帮助删除包含空值的行。\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7uIvS097gRsD", "outputId": "c13fc117-4ca1-4145-a0aa-42ac89e6e218", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "2 \n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3 = example3.dropna()\n", "example3" ] }, { "cell_type": "markdown", "metadata": { "id": "hil2cr64gRsD" }, "source": [ "请注意,这应该看起来像您的输出 `example3[example3.notnull()]`。这里的区别在于,`dropna` 不只是对掩码值进行索引,而是从 `Series` `example3` 中移除了那些缺失值。\n", "\n", "由于 DataFrame 是二维的,因此在删除数据时提供了更多选项。\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "an-l74sPgRsE", "outputId": "340876a0-63ad-40f6-bd54-6240cdae50ab", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN7
12.05.08
2NaN6.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 7\n", "1 2.0 5.0 8\n", "2 NaN 6.0 9" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4 = pd.DataFrame([[1, np.nan, 7], \n", " [2, 5, 8], \n", " [np.nan, 6, 9]])\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "66wwdHZrgRsE" }, "source": [ "(你注意到 pandas 将两列数据提升为浮点型以容纳 `NaN` 值了吗?)\n", "\n", "你无法从 `DataFrame` 中删除单个值,因此必须删除整行或整列。根据你的需求,你可能会选择其中一种操作,因此 pandas 为你提供了两种选项。在数据科学中,列通常代表变量,行通常代表观测值,因此你更可能会删除数据行;`dropna()` 的默认设置是删除包含任何空值的所有行:\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "jAVU24RXgRsE", "outputId": "0b5e5aee-7187-4d3f-b583-a44136ae5f80", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
12.05.08
\n", "
" ], "text/plain": [ " 0 1 2\n", "1 2.0 5.0 8" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna()" ] }, { "cell_type": "markdown", "metadata": { "id": "TrQRBuTDgRsE" }, "source": [ "如果需要,您可以从列中删除NA值。使用`axis=1`来实现:\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "GrBhxu9GgRsE", "outputId": "ff4001f3-2e61-4509-d60e-0093d1068437", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2
07
18
29
\n", "
" ], "text/plain": [ " 2\n", "0 7\n", "1 8\n", "2 9" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='columns')" ] }, { "cell_type": "markdown", "metadata": { "id": "KWXiKTfMgRsF" }, "source": [ "请注意,这可能会丢失大量数据,尤其是在较小的数据集中。如果您只想删除包含多个甚至全部空值的行或列,该怎么办?您可以在 `dropna` 中通过设置 `how` 和 `thresh` 参数来指定这些条件。\n", "\n", "默认情况下,`how='any'`(如果您想自行检查或查看该方法的其他参数,可以在代码单元中运行 `example4.dropna?`)。您也可以选择指定 `how='all'`,这样只会删除包含所有空值的行或列。让我们扩展示例 `DataFrame`,在接下来的练习中看看这是如何运作的。\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "Bcf_JWTsgRsF", "outputId": "72e0b1b8-52fa-4923-98ce-b6fbed6e44b1", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4[3] = np.nan\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "pNZer7q9JPNC" }, "source": [ "> 关键要点:\n", "1. 只有在数据集足够大的情况下,删除空值才是一个好主意。\n", "2. 如果某些行或列的大部分数据缺失,可以将整行或整列删除。\n", "3. `DataFrame.dropna(axis=)` 方法可以用来删除空值。`axis` 参数表示是删除行还是列。\n", "4. 还可以使用 `how` 参数。默认情况下,它被设置为 `any`,因此只会删除包含任意空值的行/列。可以将其设置为 `all`,以指定仅删除所有值均为空的行/列。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oXXSfQFHgRsF" }, "source": [ "### 练习:\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true, "id": "ExUwQRxpgRsF", "trusted": false }, "outputs": [], "source": [ "# How might you go about dropping just column 3?\n", "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "38kwAihWgRsG" }, "source": [ "`thresh` 参数为您提供了更细粒度的控制:您可以设置一行或一列需要具有的 *非空* 值的数量,以便保留:\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "M9dCNMaagRsG", "outputId": "8093713a-54d2-4e54-c73f-4eea315cb6f2", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
12.05.08NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "1 2.0 5.0 8 NaN" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='rows', thresh=3)" ] }, { "cell_type": "markdown", "metadata": { "id": "fmSFnzZegRsG" }, "source": [ "这里,第一行和最后一行已被删除,因为它们仅包含两个非空值。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "mCcxLGyUgRsG" }, "source": [ "### 填充空值\n", "\n", "有时候,用可能有效的值来填充缺失值是有意义的。填充空值有几种方法。第一种是利用领域知识(即对数据集所基于主题的了解)来近似填补缺失值。\n", "\n", "你可以使用 `isnull` 来原地完成这项工作,但如果需要填充的值很多,这可能会非常繁琐。由于这是数据科学中非常常见的任务,pandas 提供了 `fillna` 方法,它会返回一个 `Series` 或 `DataFrame` 的副本,其中的缺失值被替换为你选择的值。让我们创建另一个示例 `Series` 来看看它在实际中的工作方式。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "CE8S7louLezV" }, "source": [ "### 分类数据(非数值型)\n", "首先,让我们考虑非数值型数据。在数据集中,我们会有一些包含分类数据的列。例如:性别、True 或 False 等。\n", "\n", "在大多数情况下,我们会用该列的`众数`来替换缺失值。假设我们有100个数据点,其中90个是True,8个是False,还有2个未填写。那么,我们可以用True填补这2个缺失值,因为True是该列的众数。\n", "\n", "同样,在这里我们也可以利用领域知识。让我们来看一个用众数填补的例子。\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "MY5faq4yLdpQ", "outputId": "19ab472e-1eed-4de8-f8a7-db2a3af3cb1a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134None
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 None\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", " [3,4,None],\n", " [5,6,\"False\"],\n", " [7,8,\"True\"],\n", " [9,10,\"True\"]])\n", "\n", "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "MLAoMQOfNPlA" }, "source": [] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WKy-9Y2tN5jv", "outputId": "8da9fa16-e08c-447e-dea1-d4b1db2feebf" }, "outputs": [ { "data": { "text/plain": [ "True 3\n", "False 1\n", "Name: 2, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode[2].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "6iNz_zG_OKrx" }, "source": [] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "TxPKteRvNPOs" }, "outputs": [], "source": [ "fill_with_mode[2].fillna('True',inplace=True)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "tvas7c9_OPWE", "outputId": "ec3c8e44-d644-475e-9e22-c65101965850" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134True
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 True\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "SktitLxxOR16" }, "source": [ "正如我们所见,空值已被替换。不用说,我们本可以用任何内容替代`'True'`,它都会被替换。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "heYe1I0dOmQ_" }, "source": [ "### 数值数据\n", "现在来说说数值数据。这里有两种常见的方法来替换缺失值:\n", "\n", "1. 用行的中位数替换\n", "2. 用行的平均值替换\n", "\n", "当数据存在偏态分布且有异常值时,我们使用中位数替换。这是因为中位数对异常值具有鲁棒性。\n", "\n", "当数据经过归一化处理后,我们可以使用平均值,因为在这种情况下,平均值和中位数会非常接近。\n", "\n", "首先,让我们选取一个正态分布的列,并用该列的平均值填充缺失值。\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "09HM_2feOj5Y", "outputId": "7e309013-9acb-411c-9b06-4de795bbeeff" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
2NaN45
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 NaN 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [np.nan,4,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "ka7-wNfzSxbx" }, "source": [ "该列的平均值是\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XYtYEf5BSxFL", "outputId": "68a78d18-f0e5-4a9a-a959-2c3676a57c70" }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(fill_with_mean[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "oBSRGxKRS39K" }, "source": [] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "FzncQLmuS5jh", "outputId": "00f74fff-01f4-4024-c261-796f50f01d2e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
20.045
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 0.0 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "CwpVFCrPTC5z" }, "source": [ "正如我们所见,缺失值已被其平均值替换。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jIvF13a1i00Z" }, "source": [ "现在让我们尝试另一个数据框,这次我们将用列的中位数替换None值。\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DA59Bqo3jBYZ", "outputId": "85dae6ec-7394-4c36-fda0-e04769ec4a32" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
20NaN5
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 NaN 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [0,np.nan,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "mM1GpXYmjHnc" }, "source": [ "第二列的中位数是\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uiDy5v3xjHHX", "outputId": "564b6b74-2004-4486-90d4-b39330a64b88" }, "outputs": [ { "data": { "text/plain": [ "4.0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].median()" ] }, { "cell_type": "markdown", "metadata": { "id": "z9PLF75Jj_1s" }, "source": [] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "lFKbOxCMkBbg", "outputId": "a8bd18fb-2765-47d4-e5fe-e965f57ed1f4" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
204.05
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 4.0 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "8JtQ53GSkKWC" }, "source": [ "正如我们所见,NaN值已被该列的中位数替换\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0ybtWLDdgRsG", "outputId": "b8c238ef-6024-4ee2-be2b-aa1f0fcac61d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b NaN\n", "c 2.0\n", "d NaN\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", "example5" ] }, { "cell_type": "markdown", "metadata": { "id": "yrsigxRggRsH" }, "source": [ "您可以用一个值(例如 `0`)填充所有的空条目:\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KXMIPsQdgRsH", "outputId": "aeedfa0a-a421-4c2f-cb0d-183ce8f0c91d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 0.0\n", "c 2.0\n", "d 0.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(0)" ] }, { "cell_type": "markdown", "metadata": { "id": "RRlI5f_hkfKe" }, "source": [ "> 关键要点:\n", "1. 当数据较少或有填补缺失数据的策略时,应进行缺失值填补。\n", "2. 可以利用领域知识通过近似的方法填补缺失值。\n", "3. 对于分类数据,缺失值通常用该列的众数替代。\n", "4. 对于数值型数据,缺失值通常用均值(针对标准化数据集)或该列的中位数填补。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "FI9MmqFJgRsH" }, "source": [ "### 练习:\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true, "id": "af-ezpXdgRsH", "trusted": false }, "outputs": [], "source": [ "# What happens if you try to fill null values with a string, like ''?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "kq3hw1kLgRsI" }, "source": [ "您可以使用**前向填充**空值,即使用上一个有效值来填充空值:\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vO3BuNrggRsI", "outputId": "e2bc591b-0b48-4e88-ee65-754f2737c196", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 1.0\n", "c 2.0\n", "d 2.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='ffill')" ] }, { "cell_type": "markdown", "metadata": { "id": "nDXeYuHzgRsI" }, "source": [ "您还可以**向后填充**,将下一个有效值向后传播以填充空值:\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4M5onHcEgRsI", "outputId": "8f32b185-40dd-4a9f-bd85-54d6b6a414fe", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 2.0\n", "c 2.0\n", "d 3.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='bfill')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "id": "MbBzTom5gRsI" }, "source": [ "正如你可能猜到的,这与 DataFrame 的工作方式相同,但你还可以指定一个 `axis` 来填充空值:\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "aRpIvo4ZgRsI", "outputId": "905a980a-a808-4eca-d0ba-224bd7d85955", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "VM1qtACAgRsI", "outputId": "71f2ad28-9b4e-4ff4-f5c3-e731eb489ade", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.01.07.07.0
12.05.08.08.0
2NaN6.09.09.0
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 1.0 7.0 7.0\n", "1 2.0 5.0 8.0 8.0\n", "2 NaN 6.0 9.0 9.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(method='ffill', axis=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZeMc-I1EgRsI" }, "source": [ "注意,当没有可用于向前填充的先前值时,空值将保留。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "eeAoOU0RgRsJ" }, "source": [ "### 练习:\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true, "id": "e8S-CjW8gRsJ", "trusted": false }, "outputs": [], "source": [ "# What output does example4.fillna(method='bfill', axis=1) produce?\n", "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YHgy0lIrgRsJ" }, "source": [ "你可以创造性地使用`fillna`。例如,我们再来看`example4`,但这次我们用`DataFrame`中所有值的平均值来填充缺失值:\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "OtYVErEygRsJ", "outputId": "708b1e67-45ca-44bf-a5ee-8b2de09ece73", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.05.57NaN
12.05.08NaN
21.56.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 5.5 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 1.5 6.0 9 NaN" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(example4.mean())" ] }, { "cell_type": "markdown", "metadata": { "id": "zpMvCkLSgRsJ" }, "source": [ "请注意,第3列仍然没有值:默认的方向是按行填充值。\n", "\n", "> **要点:** 处理数据集中缺失值的方法有很多种。具体采用哪种策略(删除、替换,甚至如何替换)应根据数据的具体情况来决定。随着你处理和接触更多的数据集,你会逐渐培养出更好的直觉来应对缺失值问题。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bauDnESIl9FH" }, "source": [ "### 编码分类数据\n", "\n", "机器学习模型只能处理数字和任何形式的数值数据。它无法区分“是”和“否”,但可以区分0和1。因此,在填充缺失值之后,我们需要将分类数据编码为某种数值形式,以便模型能够理解。\n", "\n", "编码可以通过两种方式完成。接下来我们将讨论这两种方法。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "uDq9SxB7mu5i" }, "source": [ "**标签编码**\n", "\n", "标签编码本质上是将每个类别转换为一个数字。例如,假设我们有一个航空乘客的数据集,其中有一列表示他们的舱位类别,包括以下几种:['商务舱', '经济舱', '头等舱']。如果对其进行标签编码,这些类别将被转换为 [0,1,2]。让我们通过代码来看一个示例。由于我们将在后续的笔记本中学习 `scikit-learn`,这里我们不会使用它。\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "1vGz7uZyoWHL", "outputId": "9e252855-d193-4103-a54d-028ea7787b34" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "IDHnkwTYov-h" }, "source": [ "要对第一列进行标签编码,我们必须首先描述从每个类别到数字的映射,然后再进行替换\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZC5URJG3o1ES", "outputId": "aab0f1e7-e0f3-4c14-8459-9f9168c85437" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
0100
1202
2301
3401
4501
5600
\n", "
" ], "text/plain": [ " ID class\n", "0 10 0\n", "1 20 2\n", "2 30 1\n", "3 40 1\n", "4 50 1\n", "5 60 0" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class_labels = {'business class':0,'economy class':1,'first class':2}\n", "label['class'] = label['class'].replace(class_labels)\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "ftnF-TyapOPt" }, "source": [ "正如我们所见,输出与我们预期的一致。那么,我们什么时候使用标签编码呢?标签编码在以下一种或两种情况下使用: \n", "1. 当类别数量较多时 \n", "2. 当类别具有顺序关系时 \n" ] }, { "cell_type": "markdown", "metadata": { "id": "eQPAPVwsqWT7" }, "source": [ "**独热编码**\n", "\n", "另一种编码方式是独热编码。在这种编码方式中,每个列的类别都会作为一个单独的列添加到数据集中,并且每个数据点根据是否包含该类别被赋值为0或1。因此,如果有n个不同的类别,就会向数据框中添加n列。\n", "\n", "例如,我们继续使用飞机舱位的例子。类别是:['business class', 'economy class', 'first class']。如果我们进行独热编码,以下三列将被添加到数据集中:['class_business class', 'class_economy class', 'class_first class']。\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZM0eVh0ArKUL", "outputId": "83238a76-b3a5-418d-c0b6-605b02b6891b" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "one_hot" ] }, { "cell_type": "markdown", "metadata": { "id": "aVnZ7paDrWmb" }, "source": [ "让我们对第一列进行独热编码\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "RUPxf7egrYKr" }, "outputs": [], "source": [ "one_hot_data = pd.get_dummies(one_hot,columns=['class'])" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "TM37pHsFr4ge", "outputId": "7be15f53-79b2-447a-979c-822658339a9e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass_business classclass_economy classclass_first class
010100
120001
230010
340010
450010
560100
\n", "
" ], "text/plain": [ " ID class_business class class_economy class class_first class\n", "0 10 1 0 0\n", "1 20 0 0 1\n", "2 30 0 1 0\n", "3 40 0 1 0\n", "4 50 0 1 0\n", "5 60 1 0 0" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot_data" ] }, { "cell_type": "markdown", "metadata": { "id": "_zXRLOjXujdA" }, "source": [ "每个独热编码列包含0或1,用于指定该数据点是否存在该类别。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bDnC4NQOu0qr" }, "source": [ "我们什么时候使用独热编码?独热编码在以下一种或两种情况下使用:\n", "\n", "1. 当类别数量较少且数据集规模较小时。\n", "2. 当类别之间没有特定顺序时。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XnUmci_4uvyu" }, "source": [ "> 关键要点:\n", "1. 编码用于将非数值数据转换为数值数据。\n", "2. 编码分为两种类型:标签编码和独热编码,可以根据数据集的需求进行选择。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "K8UXOJYRgRsJ" }, "source": [ "## 删除重复数据\n", "\n", "> **学习目标:** 在本小节结束时,您应该能够熟练识别并删除 DataFrame 中的重复值。\n", "\n", "除了缺失数据外,您在处理真实世界的数据集时经常会遇到重复数据。幸运的是,pandas 提供了一种简单的方法来检测和删除重复条目。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qrEG-Wa0gRsJ" }, "source": [ "### 识别重复值:`duplicated`\n", "\n", "你可以使用 pandas 中的 `duplicated` 方法轻松识别重复值。该方法会返回一个布尔掩码,指示 `DataFrame` 中某个条目是否是之前条目的重复值。让我们创建另一个示例 `DataFrame` 来实际演示这一点。\n" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "ZLu6FEnZgRsJ", "outputId": "376512d1-d842-4db1-aea3-71052aeeecaf", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
2A1
3B3
4B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "2 A 1\n", "3 B 3\n", "4 B 3" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n", " 'numbers': [1, 2, 1, 3, 3]})\n", "example6" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cIduB5oBgRsK", "outputId": "3da27b3d-4d69-4e1d-bb52-0af21bae87f2", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "4 True\n", "dtype: bool" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.duplicated()" ] }, { "cell_type": "markdown", "metadata": { "id": "0eDRJD4SgRsK" }, "source": [ "### 删除重复值:`drop_duplicates`\n", "`drop_duplicates` 会返回一个数据副本,其中所有 `duplicated` 值均为 `False`:\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "w_YPpqIqgRsK", "outputId": "ac66bd2f-8671-4744-87f5-8b8d96553dea", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
3B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "3 B 3" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": { "id": "69AqoCZAgRsK" }, "source": [ "`duplicated` 和 `drop_duplicates` 默认会考虑所有列,但您可以指定它们仅检查 `DataFrame` 中的部分列:\n" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "BILjDs67gRsK", "outputId": "ef6dcc08-db8b-4352-c44e-5aa9e2bec0d3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates(['letters'])" ] }, { "cell_type": "markdown", "metadata": { "id": "GvX4og1EgRsL" }, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免责声明**: \n本文档使用AI翻译服务[Co-op Translator](https://github.com/Azure/co-op-translator)进行翻译。尽管我们努力确保准确性,但请注意,自动翻译可能包含错误或不准确之处。应以原始语言的文档作为权威来源。对于关键信息,建议使用专业人工翻译。对于因使用本翻译而引起的任何误解或误读,我们概不负责。\n" ] } ], "metadata": { "anaconda-cloud": {}, "colab": { "name": "notebook.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "coopTranslator": { "original_hash": "8533b3a2230311943339963fc7f04c21", "translation_date": "2025-09-02T08:22:06+00:00", "source_file": "2-Working-With-Data/08-data-preparation/notebook.ipynb", "language_code": "zh" } }, "nbformat": 4, "nbformat_minor": 0 }