{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "rQ8UhzFpgRra" }, "source": [ "# 數據準備\n", "\n", "[原始筆記本來源:*數據科學:Python 和機器學習工作室的數據科學機器學習入門,作者 Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n", "\n", "## 探索 `DataFrame` 資訊\n", "\n", "> **學習目標:** 在完成本小節後,您應該能夠熟練地查找 pandas DataFrame 中存儲的數據的一般資訊。\n", "\n", "當您將數據加載到 pandas 中後,數據很可能會以 `DataFrame` 的形式存在。然而,如果您的 `DataFrame` 中的數據集有 60,000 行和 400 列,您該如何開始了解自己正在處理的內容呢?幸運的是,pandas 提供了一些方便的工具,可以快速查看 `DataFrame` 的整體資訊,以及前幾行和後幾行的內容。\n", "\n", "為了探索這些功能,我們將導入 Python 的 scikit-learn 庫,並使用一個每位數據科學家都見過數百次的經典數據集:英國生物學家 Ronald Fisher 在他 1936 年的論文《多重測量在分類學問題中的應用》中使用的 *Iris* 數據集:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "id": "hB1RofhdgRrp", "trusted": false }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])" ] }, { "cell_type": "markdown", "metadata": { "id": "AGA0A_Y8hMdz" }, "source": [ "### `DataFrame.shape`\n", "我們已將 Iris 數據集載入到變數 `iris_df` 中。在深入分析數據之前,了解我們擁有的數據點數量以及整個數據集的大小是很有價值的。檢視我們正在處理的數據量是很有幫助的。\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LOe5jQohhulf", "outputId": "fb0577ac-3b4a-4623-cb41-20e1b264b3e9" }, "outputs": [ { "data": { "text/plain": [ "(150, 4)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "smE7AGzOhxk2" }, "source": [ "所以,我們正在處理150行和4列的數據。每一行代表一個數據點,每一列代表與數據框相關的一個特徵。基本上,這裡有150個數據點,每個數據點包含4個特徵。\n", "\n", "`shape` 在這裡是數據框的一個屬性,而不是一個函數,這就是為什麼它的結尾沒有一對括號。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "d3AZKs0PinGP" }, "source": [ "### `DataFrame.columns`\n", "現在讓我們來看看這四個數據欄位。每個欄位究竟代表什麼?`columns` 屬性會提供我們數據框中欄位的名稱。\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YPGh_ziji-CY", "outputId": "74e7a43a-77cc-4c80-da56-7f50767c37a0" }, "outputs": [ { "data": { "text/plain": [ "Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n", " 'petal width (cm)'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.columns" ] }, { "cell_type": "markdown", "metadata": { "id": "TsobcU_VjCC_" }, "source": [ "如我們所見,有四(4)列。`columns`屬性告訴我們列的名稱,基本上沒有其他內容。當我們想要識別數據集包含的特徵時,這個屬性就顯得重要。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "2UTlvkjmgRrs" }, "source": [ "### `DataFrame.info`\n", "透過 `shape` 屬性提供的數據量,以及透過 `columns` 屬性提供的特徵或欄位名稱,可以讓我們對數據集有初步的了解。現在,我們希望更深入地探索這個數據集。`DataFrame.info()` 函數在這方面非常有用。\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dHHRyG0_gRrt", "outputId": "d8fb0c40-4f18-4e19-da48-c8db77d1d3a5", "trusted": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 150 entries, 0 to 149\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 sepal length (cm) 150 non-null float64\n", " 1 sepal width (cm) 150 non-null float64\n", " 2 petal length (cm) 150 non-null float64\n", " 3 petal width (cm) 150 non-null float64\n", "dtypes: float64(4)\n", "memory usage: 4.8 KB\n" ] } ], "source": [ "iris_df.info()" ] }, { "cell_type": "markdown", "metadata": { "id": "1XgVMpvigRru" }, "source": [ "從這裡,我們可以做出一些觀察:\n", "1. 每個欄位的資料類型:在這個資料集中,所有的資料都以64位元浮點數形式儲存。\n", "2. 非空值的數量:處理空值是資料準備中的重要步驟,稍後會在筆記本中進行處理。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "IYlyxbpWFEF4" }, "source": [ "### DataFrame.describe()\n", "假設我們的數據集中有許多數值型資料。像平均值、中位數、四分位數等單變量統計計算可以針對每個欄位單獨進行。`DataFrame.describe()` 函數為我們提供了數據集中數值欄位的統計摘要。\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "id": "tWV-CMstFIRA", "outputId": "4fc49941-bc13-4b0c-a412-cb39e7d3f289" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count150.000000150.000000150.000000150.000000
mean5.8433333.0573333.7580001.199333
std0.8280660.4358661.7652980.762238
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "count 150.000000 150.000000 150.000000 150.000000\n", "mean 5.843333 3.057333 3.758000 1.199333\n", "std 0.828066 0.435866 1.765298 0.762238\n", "min 4.300000 2.000000 1.000000 0.100000\n", "25% 5.100000 2.800000 1.600000 0.300000\n", "50% 5.800000 3.000000 4.350000 1.300000\n", "75% 6.400000 3.300000 5.100000 1.800000\n", "max 7.900000 4.400000 6.900000 2.500000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.describe()" ] }, { "cell_type": "markdown", "metadata": { "id": "zjjtW5hPGMuM" }, "source": [ "上述輸出顯示了每列的數據點總數、平均值、標準差、最小值、下四分位數(25%)、中位數(50%)、上四分位數(75%)和最大值。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-lviAu99gRrv" }, "source": [ "### `DataFrame.head`\n", "透過以上所有的函數和屬性,我們已經對數據集有了一個高層次的概覽。我們知道數據集中有多少數據點,有多少特徵,每個特徵的數據類型,以及每個特徵中非空值的數量。\n", "\n", "現在是時候來看看數據本身了。讓我們來看看我們的 `DataFrame` 的前幾行(前幾個數據點)是什麼樣子的:\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DZMJZh0OgRrw", "outputId": "d9393ee5-c106-4797-f815-218f17160e00", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "EBHEimZuEFQK" }, "source": [ "作為輸出,我們可以看到數據集的五(5)個條目。如果我們查看左側的索引,我們會發現這是前五行。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oj7GkrTdgRry" }, "source": [ "### 練習:\n", "\n", "從上述範例可以看出,預設情況下,`DataFrame.head` 會返回 `DataFrame` 的前五行。在下面的程式碼區塊中,你能找到一種方法來顯示超過五行的內容嗎?\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true, "id": "EKRmRFFegRrz", "trusted": false }, "outputs": [], "source": [ "# Hint: Consult the documentation by using iris_df.head?" ] }, { "cell_type": "markdown", "metadata": { "id": "BJ_cpZqNgRr1" }, "source": [ "### `DataFrame.tail`\n", "另一種查看數據的方法是從結尾開始(而不是從開頭)。`DataFrame.head` 的相反方法是 `DataFrame.tail`,它會返回 `DataFrame` 的最後五行:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "heanjfGWgRr2", "outputId": "6ae09a21-fe09-4110-b0d7-1a1fbf34d7f3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
1456.73.05.22.3
1466.32.55.01.9
1476.53.05.22.0
1486.23.45.42.3
1495.93.05.11.8
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "145 6.7 3.0 5.2 2.3\n", "146 6.3 2.5 5.0 1.9\n", "147 6.5 3.0 5.2 2.0\n", "148 6.2 3.4 5.4 2.3\n", "149 5.9 3.0 5.1 1.8" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.tail()" ] }, { "cell_type": "markdown", "metadata": { "id": "31kBWfyLgRr3" }, "source": [ "在實際操作中,能夠輕鬆檢視 `DataFrame` 的前幾行或後幾行非常有用,特別是在檢查有序數據集中的異常值時。\n", "\n", "上面透過程式碼範例展示的所有函數和屬性,幫助我們快速了解數據的外觀和感覺。\n", "\n", "> **重點提示:** 即使僅僅透過查看 `DataFrame` 中的元數據或前幾個和後幾個值,也能立即對您正在處理的數據的大小、形狀和內容有一個初步的概念。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "TvurZyLSDxq_" }, "source": [ "### 缺失資料\n", "讓我們深入探討缺失資料。缺失資料是指某些欄位中沒有儲存任何值。\n", "\n", "舉個例子:假設某人對自己的體重很在意,因此在問卷中沒有填寫體重欄位。那麼,這個人的體重值就會是缺失的。\n", "\n", "在現實世界的數據集中,缺失值是非常常見的。\n", "\n", "**Pandas 如何處理缺失資料**\n", "\n", "Pandas 有兩種方式來處理缺失值。第一種方式你可能在之前的章節中已經見過:`NaN`,即「非數值」(Not a Number)。這其實是一個特殊的值,屬於 IEEE 浮點數規範的一部分,專門用來表示缺失的浮點數值。\n", "\n", "對於非浮點數的缺失值,Pandas 使用 Python 的 `None` 物件。雖然遇到兩種不同的值來表示相同的概念可能會讓人感到困惑,但這種設計選擇有其合理的程式邏輯原因。在實際應用中,這樣的設計能夠在絕大多數情況下提供良好的平衡。不過,無論是 `None` 還是 `NaN`,它們都帶有一些限制,必須注意它們在使用上的差異。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lOHqUlZFgRr5" }, "source": [ "### `None`: 非浮點型缺失資料\n", "由於 `None` 來自 Python,它無法用於非 `'object'` 資料類型的 NumPy 和 pandas 陣列。請記住,NumPy 陣列(以及 pandas 中的資料結構)只能包含一種資料類型。這正是它們在大規模資料和計算工作中展現強大效能的原因,但同時也限制了它們的靈活性。這類陣列必須提升為「最低共同分母」,即能包含陣列中所有內容的資料類型。當陣列中包含 `None` 時,表示您正在處理 Python 物件。\n", "\n", "以下範例陣列可以幫助您理解這一點(注意它的 `dtype`):\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QIoNdY4ngRr7", "outputId": "92779f18-62f4-4a03-eca2-e9a101604336", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "array([2, None, 6, 8], dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "example1 = np.array([2, None, 6, 8])\n", "example1" ] }, { "cell_type": "markdown", "metadata": { "id": "pdlgPNbhgRr7" }, "source": [ "上升數據類型的現實情況帶來了兩個副作用。首先,操作將在解釋執行的 Python 代碼層面進行,而不是在編譯執行的 NumPy 代碼層面進行。基本上,這意味著任何涉及包含 `None` 的 `Series` 或 `DataFrame` 的操作都會變得較慢。雖然你可能不會注意到這種性能下降,但對於大型數據集來說,這可能會成為一個問題。\n", "\n", "第二個副作用源於第一個副作用。由於 `None` 本質上將 `Series` 或 `DataFrame` 拉回到原生 Python 的世界中,因此對包含 `None` 值的數組使用 NumPy/pandas 的聚合函數(例如 `sum()` 或 `min()`)通常會產生錯誤:\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 292 }, "id": "gWbx-KB9gRr8", "outputId": "ecba710a-22ec-41d5-a39c-11f67e645b50", "trusted": false }, "outputs": [ { "ename": "TypeError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" ] } ], "source": [ "example1.sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "LcEwO8UogRr9" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "pWvVHvETgRr9" }, "source": [ "### `NaN`:缺失的浮點值\n", "\n", "與 `None` 不同,NumPy(因此也包括 pandas)支援 `NaN`,以便進行快速的向量化操作和通用函數(ufuncs)。壞消息是,任何對 `NaN` 進行的算術運算結果都會是 `NaN`。例如:\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "rcFYfMG9gRr9", "outputId": "699e81b7-5c11-4b46-df1d-06071768690f", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan + 1" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BW3zQD2-gRr-", "outputId": "4525b6c4-495d-4f7b-a979-efce1dae9bd0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan * 0" ] }, { "cell_type": "markdown", "metadata": { "id": "fU5IPRcCgRr-" }, "source": [ "好消息:在包含 `NaN` 的數組上運行的聚合不會出現錯誤。壞消息:結果並非一律有用:\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LCInVgSSgRr_", "outputId": "fa06495a-0930-4867-87c5-6023031ea8b5", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "(nan, nan, nan)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ] }, { "cell_type": "markdown", "metadata": { "id": "nhlnNJT7gRr_" }, "source": [] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true, "id": "yan3QRaOgRr_", "trusted": false }, "outputs": [], "source": [ "# What happens if you add np.nan and None together?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_iDvIRC8gRsA" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "kj6EKdsAgRsA" }, "source": [ "### `NaN` 和 `None`:pandas 中的空值\n", "\n", "儘管 `NaN` 和 `None` 的行為可能略有不同,但 pandas 仍然設計成可以互換處理它們。為了說明這一點,請考慮一個整數的 `Series`:\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Nji-KGdNgRsA", "outputId": "36aa14d2-8efa-4bfd-c0ed-682991288822", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ] }, { "cell_type": "markdown", "metadata": { "id": "WklCzqb8gRsB" }, "source": [] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true, "id": "Cy-gqX5-gRsB", "trusted": false }, "outputs": [], "source": [ "# Now set an element of int_series equal to None.\n", "# How does that element show up in the Series?\n", "# What is the dtype of the Series?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "WjMQwltNgRsB" }, "source": [ "在將資料類型向上轉型以建立 `Series` 和 `DataFrame` 的資料一致性時,pandas 會自動在 `None` 和 `NaN` 之間切換缺失值。由於這種設計特性,將 `None` 和 `NaN` 視為 pandas 中兩種不同形式的「空值」是很有幫助的。事實上,pandas 中一些處理缺失值的核心方法名稱就反映了這一點:\n", "\n", "- `isnull()`: 生成一個布林遮罩,指示缺失值的位置\n", "- `notnull()`: 與 `isnull()` 相反\n", "- `dropna()`: 返回一個過濾後的資料版本\n", "- `fillna()`: 返回一個填補或推補缺失值後的資料副本\n", "\n", "這些方法非常重要,值得熟練掌握。因此,接下來我們將深入探討每一個方法。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Yh5ifd9FgRsB" }, "source": [ "### 偵測空值\n", "\n", "現在我們已經了解缺失值的重要性,在處理它們之前,我們需要在資料集中偵測它們。\n", "`isnull()` 和 `notnull()` 是偵測空值的主要方法。這兩個方法都會在資料上返回布林遮罩。\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true, "id": "e-vFp5lvgRsC", "trusted": false }, "outputs": [], "source": [ "example3 = pd.Series([0, np.nan, '', None])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1XdaJJ7PgRsC", "outputId": "92fc363a-1874-471f-846d-f4f9ce1f51d0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 True\n", "dtype: bool" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull()" ] }, { "cell_type": "markdown", "metadata": { "id": "PaSZ0SQygRsC" }, "source": [ "仔細看看輸出結果,有沒有什麼讓你感到驚訝的地方?雖然 `0` 是一個算術上的「空值」,但它仍然是一個完全有效的整數,pandas 也將其視為如此。而 `''` 則稍微微妙一些。雖然我們在第 1 節中使用它來表示空字串值,但對 pandas 而言,它仍然是一個字串物件,而不是空值的表示。\n", "\n", "現在,讓我們換個角度,將這些方法以更接近實際使用的方式來應用。你可以直接將布林遮罩作為 ``Series`` 或 ``DataFrame`` 的索引,這在處理孤立的缺失值(或存在值)時非常有用。\n", "\n", "如果我們想要計算缺失值的總數,只需對 `isnull()` 方法生成的遮罩進行一次加總即可。\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JCcQVoPkHDUv", "outputId": "001daa72-54f8-4bd5-842a-4df627a79d4d" }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "PlBqEo3mgRsC" }, "source": [] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true, "id": "ggDVf5uygRsD", "trusted": false }, "outputs": [], "source": [ "# Try running example3[example3.notnull()].\n", "# Before you do so, what do you expect to see?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "D_jWN7mHgRsD" }, "source": [ "**關鍵要點**:當您在 DataFrame 中使用 `isnull()` 和 `notnull()` 方法時,兩者會產生類似的結果:它們顯示結果及其索引,這將在您處理數據時對您有極大的幫助。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "BvnoojWsgRr4" }, "source": [ "### 處理遺漏數據\n", "\n", "> **學習目標:** 在本小節結束時,您應該了解如何以及何時替換或移除 DataFrame 中的空值。\n", "\n", "機器學習模型無法直接處理遺漏數據。因此,在將數據傳入模型之前,我們需要先處理這些遺漏值。\n", "\n", "如何處理遺漏數據涉及微妙的取捨,可能會影響您的最終分析結果以及實際應用的效果。\n", "\n", "主要有兩種處理遺漏數據的方法:\n", "\n", "1. 移除包含遺漏值的行\n", "2. 用其他值替換遺漏值\n", "\n", "我們將詳細討論這兩種方法及其優缺點。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "3VaYC1TvgRsD" }, "source": [ "### 刪除空值\n", "\n", "我們傳遞給模型的數據量會直接影響其性能。刪除空值意味著我們減少了數據點的數量,從而縮小了數據集的規模。因此,當數據集相當大時,建議刪除包含空值的行。\n", "\n", "另一種情況可能是某一行或列有大量缺失值。在這種情況下,這些行或列可能會被刪除,因為它們對我們的分析幾乎沒有幫助,因為該行/列的大部分數據都是缺失的。\n", "\n", "除了識別缺失值之外,pandas 還提供了一種方便的方法來從 `Series` 和 `DataFrame` 中移除空值。為了實際了解這一點,我們可以回到 `example3`。`DataFrame.dropna()` 函數可以幫助刪除包含空值的行。\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7uIvS097gRsD", "outputId": "c13fc117-4ca1-4145-a0aa-42ac89e6e218", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "2 \n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3 = example3.dropna()\n", "example3" ] }, { "cell_type": "markdown", "metadata": { "id": "hil2cr64gRsD" }, "source": [ "請注意,這應該看起來像是您從 `example3[example3.notnull()]` 的輸出。這裡的不同之處在於,`dropna` 並非僅僅基於遮罩值進行索引,而是從 `Series` 的 `example3` 中移除了那些缺失值。\n", "\n", "由於 DataFrame 是二維的,因此在刪除數據時提供了更多選擇。\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "an-l74sPgRsE", "outputId": "340876a0-63ad-40f6-bd54-6240cdae50ab", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN7
12.05.08
2NaN6.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 7\n", "1 2.0 5.0 8\n", "2 NaN 6.0 9" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4 = pd.DataFrame([[1, np.nan, 7], \n", " [2, 5, 8], \n", " [np.nan, 6, 9]])\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "66wwdHZrgRsE" }, "source": [ "(你是否注意到 pandas 將其中兩列提升為浮點型,以容納 `NaN`?)\n", "\n", "你無法從 `DataFrame` 中刪除單一值,因此你必須刪除整行或整列。根據你的需求,你可能會選擇其中一種方式,因此 pandas 提供了兩種選項。由於在數據科學中,列通常代表變數,行則代表觀測值,因此你更可能刪除數據的行;`dropna()` 的預設設定是刪除所有包含任何空值的行:\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "jAVU24RXgRsE", "outputId": "0b5e5aee-7187-4d3f-b583-a44136ae5f80", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
12.05.08
\n", "
" ], "text/plain": [ " 0 1 2\n", "1 2.0 5.0 8" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna()" ] }, { "cell_type": "markdown", "metadata": { "id": "TrQRBuTDgRsE" }, "source": [ "如果需要,您可以從列中刪除 NA 值。使用 `axis=1` 來執行:\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "GrBhxu9GgRsE", "outputId": "ff4001f3-2e61-4509-d60e-0093d1068437", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2
07
18
29
\n", "
" ], "text/plain": [ " 2\n", "0 7\n", "1 8\n", "2 9" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='columns')" ] }, { "cell_type": "markdown", "metadata": { "id": "KWXiKTfMgRsF" }, "source": [ "注意,這可能會丟失許多您可能想保留的數據,特別是在較小的數據集中。如果您只想刪除包含幾個甚至全部空值的行或列該怎麼辦?您可以在 `dropna` 中使用 `how` 和 `thresh` 參數來指定這些設置。\n", "\n", "預設情況下,`how='any'`(如果您想自行檢查或查看該方法的其他參數,可以在程式碼單元中執行 `example4.dropna?`)。您也可以選擇指定 `how='all'`,這樣只會刪除包含所有空值的行或列。我們來擴展我們的範例 `DataFrame`,在下一個練習中看看這是如何運作的。\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "Bcf_JWTsgRsF", "outputId": "72e0b1b8-52fa-4923-98ce-b6fbed6e44b1", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4[3] = np.nan\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "pNZer7q9JPNC" }, "source": [ "> 關鍵重點:\n", "1. 僅在資料集足夠大的情況下,刪除空值才是個好主意。\n", "2. 如果整行或整列的大部分資料都缺失,可以考慮刪除這些行或列。\n", "3. `DataFrame.dropna(axis=)` 方法可用於刪除空值。`axis` 參數用於指定是刪除行還是列。\n", "4. `how` 參數也可以使用。預設值為 `any`,因此它只會刪除包含任意空值的行或列。可以將其設置為 `all`,以指定僅刪除所有值皆為空的行或列。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oXXSfQFHgRsF" }, "source": [] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true, "id": "ExUwQRxpgRsF", "trusted": false }, "outputs": [], "source": [ "# How might you go about dropping just column 3?\n", "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "38kwAihWgRsG" }, "source": [ "`thresh` 參數為您提供更細緻的控制:您可以設定一行或一列需要具有的*非空*值的數量,以便保留:\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "M9dCNMaagRsG", "outputId": "8093713a-54d2-4e54-c73f-4eea315cb6f2", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
12.05.08NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "1 2.0 5.0 8 NaN" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='rows', thresh=3)" ] }, { "cell_type": "markdown", "metadata": { "id": "fmSFnzZegRsG" }, "source": [ "這裡,第一行和最後一行已被刪除,因為它們僅包含兩個非空值。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "mCcxLGyUgRsG" }, "source": [ "### 填補空值\n", "\n", "有時候,用一些可能合理的值來填補缺失值是有意義的。填補空值有幾種技術。第一種方法是利用領域知識(即對數據集所基於主題的了解)來大致估算缺失值。\n", "\n", "你可以使用 `isnull` 來直接進行填補,但這可能會很繁瑣,特別是當你需要填補大量值時。由於這在數據科學中是一個非常常見的任務,pandas 提供了 `fillna` 方法,該方法會返回一個 `Series` 或 `DataFrame` 的副本,並將缺失值替換為你選擇的值。我們來創建另一個 `Series` 的例子,看看這在實際操作中是如何運作的。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "CE8S7louLezV" }, "source": [ "### 類別型資料(非數值型)\n", "首先,我們來討論非數值型資料。在數據集中,我們有一些包含類別型資料的欄位。例如:性別、True 或 False 等。\n", "\n", "在大多數情況下,我們會用該欄位的「眾數」來替換缺失值。假設我們有 100 筆數據,其中 90 筆是 True,8 筆是 False,還有 2 筆未填寫。那麼,我們可以將這 2 筆未填寫的數據填為 True,基於整個欄位的情況來考量。\n", "\n", "此外,我們也可以在這裡運用領域知識。讓我們來看一個使用眾數填補的例子。\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "MY5faq4yLdpQ", "outputId": "19ab472e-1eed-4de8-f8a7-db2a3af3cb1a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134None
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 None\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", " [3,4,None],\n", " [5,6,\"False\"],\n", " [7,8,\"True\"],\n", " [9,10,\"True\"]])\n", "\n", "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "MLAoMQOfNPlA" }, "source": [] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WKy-9Y2tN5jv", "outputId": "8da9fa16-e08c-447e-dea1-d4b1db2feebf" }, "outputs": [ { "data": { "text/plain": [ "True 3\n", "False 1\n", "Name: 2, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode[2].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "6iNz_zG_OKrx" }, "source": [] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "TxPKteRvNPOs" }, "outputs": [], "source": [ "fill_with_mode[2].fillna('True',inplace=True)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "tvas7c9_OPWE", "outputId": "ec3c8e44-d644-475e-9e22-c65101965850" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134True
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 True\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "SktitLxxOR16" }, "source": [ "正如我們所見,空值已被替換。毋庸置疑,我們本可以在 `'True'` 的位置寫任何內容,它都會被替換。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "heYe1I0dOmQ_" }, "source": [ "### 數值資料\n", "現在來談談數值資料。在這裡,我們有兩種常見的方法來替換缺失值:\n", "\n", "1. 用該行的中位數替換 \n", "2. 用該行的平均值替換 \n", "\n", "當資料有偏態且存在異常值時,我們會選擇用中位數替換。這是因為中位數對異常值具有穩健性。\n", "\n", "當資料已經正規化時,我們可以使用平均值,因為在這種情況下,平均值和中位數會非常接近。\n", "\n", "首先,我們來選擇一個呈正態分佈的欄位,並用該欄的平均值填補缺失值。\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "09HM_2feOj5Y", "outputId": "7e309013-9acb-411c-9b06-4de795bbeeff" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
2NaN45
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 NaN 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [np.nan,4,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "ka7-wNfzSxbx" }, "source": [ "該列的平均值是\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XYtYEf5BSxFL", "outputId": "68a78d18-f0e5-4a9a-a959-2c3676a57c70" }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(fill_with_mean[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "oBSRGxKRS39K" }, "source": [] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "FzncQLmuS5jh", "outputId": "00f74fff-01f4-4024-c261-796f50f01d2e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
20.045
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 0.0 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "CwpVFCrPTC5z" }, "source": [ "正如我們所見,缺失值已被其平均值取代。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jIvF13a1i00Z" }, "source": [ "現在讓我們嘗試另一個數據框,這次我們將用該列的中位數替換 None 值。\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DA59Bqo3jBYZ", "outputId": "85dae6ec-7394-4c36-fda0-e04769ec4a32" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
20NaN5
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 NaN 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [0,np.nan,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "mM1GpXYmjHnc" }, "source": [ "第二列的中位數是\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uiDy5v3xjHHX", "outputId": "564b6b74-2004-4486-90d4-b39330a64b88" }, "outputs": [ { "data": { "text/plain": [ "4.0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].median()" ] }, { "cell_type": "markdown", "metadata": { "id": "z9PLF75Jj_1s" }, "source": [] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "lFKbOxCMkBbg", "outputId": "a8bd18fb-2765-47d4-e5fe-e965f57ed1f4" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
204.05
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 4.0 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "8JtQ53GSkKWC" }, "source": [ "正如我們所見,NaN 值已被該列的中位數取代\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0ybtWLDdgRsG", "outputId": "b8c238ef-6024-4ee2-be2b-aa1f0fcac61d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b NaN\n", "c 2.0\n", "d NaN\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", "example5" ] }, { "cell_type": "markdown", "metadata": { "id": "yrsigxRggRsH" }, "source": [ "您可以使用單一值(例如 `0`)填充所有的空值條目:\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KXMIPsQdgRsH", "outputId": "aeedfa0a-a421-4c2f-cb0d-183ce8f0c91d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 0.0\n", "c 2.0\n", "d 0.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(0)" ] }, { "cell_type": "markdown", "metadata": { "id": "RRlI5f_hkfKe" }, "source": [ "> 關鍵重點:\n", "1. 當數據較少或有策略填補缺失數據時,應進行缺失值填補。\n", "2. 可以利用領域知識來估算並填補缺失值。\n", "3. 對於分類數據,通常使用該列的眾數來替代缺失值。\n", "4. 對於數值型數據,缺失值通常使用平均值(針對正規化數據集)或該列的中位數來填補。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "FI9MmqFJgRsH" }, "source": [] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true, "id": "af-ezpXdgRsH", "trusted": false }, "outputs": [], "source": [ "# What happens if you try to fill null values with a string, like ''?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "kq3hw1kLgRsI" }, "source": [] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vO3BuNrggRsI", "outputId": "e2bc591b-0b48-4e88-ee65-754f2737c196", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 1.0\n", "c 2.0\n", "d 2.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='ffill')" ] }, { "cell_type": "markdown", "metadata": { "id": "nDXeYuHzgRsI" }, "source": [ "您也可以使用**回填**將下一個有效值向後傳播以填補空值:\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4M5onHcEgRsI", "outputId": "8f32b185-40dd-4a9f-bd85-54d6b6a414fe", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 2.0\n", "c 2.0\n", "d 3.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='bfill')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "id": "MbBzTom5gRsI" }, "source": [ "正如你可能猜到的,這與 DataFrames 的操作方式相同,但你也可以指定一個 `axis` 來填充空值:\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "aRpIvo4ZgRsI", "outputId": "905a980a-a808-4eca-d0ba-224bd7d85955", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "VM1qtACAgRsI", "outputId": "71f2ad28-9b4e-4ff4-f5c3-e731eb489ade", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.01.07.07.0
12.05.08.08.0
2NaN6.09.09.0
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 1.0 7.0 7.0\n", "1 2.0 5.0 8.0 8.0\n", "2 NaN 6.0 9.0 9.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(method='ffill', axis=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZeMc-I1EgRsI" }, "source": [ "請注意,當先前的值無法用於向前填充時,空值將保留。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "eeAoOU0RgRsJ" }, "source": [] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true, "id": "e8S-CjW8gRsJ", "trusted": false }, "outputs": [], "source": [ "# What output does example4.fillna(method='bfill', axis=1) produce?\n", "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YHgy0lIrgRsJ" }, "source": [ "您可以創意地使用 `fillna`。例如,我們再來看看 `example4`,但這次我們用 `DataFrame` 中所有值的平均值來填補缺失值:\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "OtYVErEygRsJ", "outputId": "708b1e67-45ca-44bf-a5ee-8b2de09ece73", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.05.57NaN
12.05.08NaN
21.56.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 5.5 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 1.5 6.0 9 NaN" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(example4.mean())" ] }, { "cell_type": "markdown", "metadata": { "id": "zpMvCkLSgRsJ" }, "source": [ "注意,第 3 欄仍然是空的:預設的方向是按行填充數值。\n", "\n", "> **重點提示:** 處理數據集中缺失值的方法有很多種。具體採用哪種策略(刪除、替換,甚至是如何替換)應該根據該數據的具體情況來決定。隨著你處理和分析更多數據集,你將對如何處理缺失值有更好的理解和感覺。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bauDnESIl9FH" }, "source": [ "### 編碼分類資料\n", "\n", "機器學習模型只能處理數字以及任何形式的數值資料。它無法分辨「是」和「否」,但能區分 0 和 1。因此,在填補缺失值之後,我們需要將分類資料編碼成某種數值形式,讓模型能夠理解。\n", "\n", "編碼可以透過兩種方式完成。我們接下來會討論這些方法。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "uDq9SxB7mu5i" }, "source": [ "**標籤編碼**\n", "\n", "標籤編碼基本上是將每個類別轉換為一個數字。例如,假設我們有一個航空乘客的數據集,其中有一列包含他們的艙等,艙等包括以下類別:['商務艙', '經濟艙', '頭等艙']。如果對這些類別進行標籤編碼,則會被轉換為 [0,1,2]。讓我們通過程式碼來看一個例子。由於我們會在接下來的筆記本中學習 `scikit-learn`,所以這裡不使用它。\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "1vGz7uZyoWHL", "outputId": "9e252855-d193-4103-a54d-028ea7787b34" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "IDHnkwTYov-h" }, "source": [ "要對第一列執行標籤編碼,我們必須先描述每個類別到數字的映射,然後再進行替換\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZC5URJG3o1ES", "outputId": "aab0f1e7-e0f3-4c14-8459-9f9168c85437" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
0100
1202
2301
3401
4501
5600
\n", "
" ], "text/plain": [ " ID class\n", "0 10 0\n", "1 20 2\n", "2 30 1\n", "3 40 1\n", "4 50 1\n", "5 60 0" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class_labels = {'business class':0,'economy class':1,'first class':2}\n", "label['class'] = label['class'].replace(class_labels)\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "ftnF-TyapOPt" }, "source": [ "正如我們所見,輸出結果與我們預期的一致。那麼,我們什麼時候應該使用標籤編碼呢?標籤編碼適用於以下一種或兩種情況:\n", "\n", "1. 當類別數量很大時 \n", "2. 當類別具有順序性時 \n" ] }, { "cell_type": "markdown", "metadata": { "id": "eQPAPVwsqWT7" }, "source": [ "**獨熱編碼**\n", "\n", "另一種編碼方式是獨熱編碼(One Hot Encoding)。在這種編碼方式中,每個欄位的類別會被新增為一個獨立的欄位,而每個數據點會根據是否包含該類別被賦予 0 或 1 的值。因此,如果有 n 種不同的類別,則會在資料框中新增 n 個欄位。\n", "\n", "例如,我們以相同的飛機艙等為例。類別為:['business class', 'economy class', 'first class']。那麼,如果我們執行獨熱編碼,以下三個欄位將被新增到資料集中:['class_business class', 'class_economy class', 'class_first class']。\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZM0eVh0ArKUL", "outputId": "83238a76-b3a5-418d-c0b6-605b02b6891b" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "one_hot" ] }, { "cell_type": "markdown", "metadata": { "id": "aVnZ7paDrWmb" }, "source": [ "讓我們對第一列執行獨熱編碼\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "RUPxf7egrYKr" }, "outputs": [], "source": [ "one_hot_data = pd.get_dummies(one_hot,columns=['class'])" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "TM37pHsFr4ge", "outputId": "7be15f53-79b2-447a-979c-822658339a9e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass_business classclass_economy classclass_first class
010100
120001
230010
340010
450010
560100
\n", "
" ], "text/plain": [ " ID class_business class class_economy class class_first class\n", "0 10 1 0 0\n", "1 20 0 0 1\n", "2 30 0 1 0\n", "3 40 0 1 0\n", "4 50 0 1 0\n", "5 60 1 0 0" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot_data" ] }, { "cell_type": "markdown", "metadata": { "id": "_zXRLOjXujdA" }, "source": [ "每個獨熱編碼欄位包含 0 或 1,指定該數據點是否存在該類別。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bDnC4NQOu0qr" }, "source": [ "什麼時候使用獨熱編碼?獨熱編碼通常在以下情況之一或兩者中使用:\n", "\n", "1. 當類別數量和數據集的大小較小時。\n", "2. 當類別沒有特定的順序時。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XnUmci_4uvyu" }, "source": [ "> 關鍵重點:\n", "1. 編碼是將非數值型資料轉換為數值型資料的過程。\n", "2. 編碼分為兩種類型:標籤編碼和獨熱編碼,可以根據資料集的需求進行選擇。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "K8UXOJYRgRsJ" }, "source": [ "## 移除重複資料\n", "\n", "> **學習目標:** 在本小節結束時,您應該能夠輕鬆識別並移除 DataFrame 中的重複值。\n", "\n", "除了遺漏資料之外,您在處理真實世界的數據集時,經常會遇到重複的資料。幸運的是,pandas 提供了一種簡單的方法來檢測和移除重複的條目。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qrEG-Wa0gRsJ" }, "source": [ "### 識別重複值:`duplicated`\n", "\n", "你可以使用 pandas 的 `duplicated` 方法輕鬆找出重複值。該方法會返回一個布林遮罩,指示 `DataFrame` 中某個條目是否與之前的條目重複。我們來建立另一個範例 `DataFrame`,看看這個方法的實際應用。\n" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "ZLu6FEnZgRsJ", "outputId": "376512d1-d842-4db1-aea3-71052aeeecaf", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
2A1
3B3
4B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "2 A 1\n", "3 B 3\n", "4 B 3" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n", " 'numbers': [1, 2, 1, 3, 3]})\n", "example6" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cIduB5oBgRsK", "outputId": "3da27b3d-4d69-4e1d-bb52-0af21bae87f2", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "4 True\n", "dtype: bool" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.duplicated()" ] }, { "cell_type": "markdown", "metadata": { "id": "0eDRJD4SgRsK" }, "source": [ "### 刪除重複值:`drop_duplicates`\n", "`drop_duplicates` 會返回一份數據的副本,其中所有 `duplicated` 值均為 `False`:\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "w_YPpqIqgRsK", "outputId": "ac66bd2f-8671-4744-87f5-8b8d96553dea", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
3B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "3 B 3" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": { "id": "69AqoCZAgRsK" }, "source": [ "`duplicated` 和 `drop_duplicates` 預設會考慮所有列,但您可以指定它們僅檢查 `DataFrame` 中的一部分列:\n" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "BILjDs67gRsK", "outputId": "ef6dcc08-db8b-4352-c44e-5aa9e2bec0d3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates(['letters'])" ] }, { "cell_type": "markdown", "metadata": { "id": "GvX4og1EgRsL" }, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免責聲明**: \n本文件使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。應以原始語言的文件作為權威來源。對於關鍵資訊,建議尋求專業人工翻譯。我們對於因使用此翻譯而引起的任何誤解或錯誤解讀概不負責。 \n" ] } ], "metadata": { "anaconda-cloud": {}, "colab": { "name": "notebook.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "coopTranslator": { "original_hash": "8533b3a2230311943339963fc7f04c21", "translation_date": "2025-09-02T07:08:15+00:00", "source_file": "2-Working-With-Data/08-data-preparation/notebook.ipynb", "language_code": "mo" } }, "nbformat": 4, "nbformat_minor": 0 }