{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "rQ8UhzFpgRra" }, "source": [ "# 數據準備\n", "\n", "[原始筆記本來源:*Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio by Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n", "\n", "## 探索 `DataFrame` 資訊\n", "\n", "> **學習目標:** 在完成本小節後,您應該能夠熟悉如何查找存儲在 pandas DataFrame 中的數據的一般資訊。\n", "\n", "當您將數據載入到 pandas 中後,數據通常會以 `DataFrame` 的形式存在。然而,如果您的 `DataFrame` 中的數據集有 60,000 行和 400 列,您該如何開始了解您正在處理的內容呢?幸運的是,pandas 提供了一些方便的工具,可以快速查看 `DataFrame` 的整體資訊,以及前幾行和後幾行的數據。\n", "\n", "為了探索這些功能,我們將導入 Python 的 scikit-learn 庫,並使用一個每位數據科學家都看過數百次的經典數據集:英國生物學家 Ronald Fisher 在他1936年的論文《多重測量在分類問題中的應用》中使用的 *Iris* 數據集:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "id": "hB1RofhdgRrp", "trusted": false }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])" ] }, { "cell_type": "markdown", "metadata": { "id": "AGA0A_Y8hMdz" }, "source": [ "### `DataFrame.shape`\n", "我們已將鳶尾花數據集載入到變數 `iris_df` 中。在深入分析數據之前,了解我們擁有的數據點數量以及數據集的整體大小是很有價值的。這有助於我們了解正在處理的數據量。\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LOe5jQohhulf", "outputId": "fb0577ac-3b4a-4623-cb41-20e1b264b3e9" }, "outputs": [ { "data": { "text/plain": [ "(150, 4)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "smE7AGzOhxk2" }, "source": [ "所以,我們正在處理150行4列的數據。每一行代表一個數據點,每一列代表與數據框相關的一個特徵。基本上,就是有150個數據點,每個數據點包含4個特徵。\n", "\n", "`shape`在這裡是數據框的一個屬性,而不是一個函數,這就是為什麼它的後面沒有一對括號。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "d3AZKs0PinGP" }, "source": [ "### `DataFrame.columns`\n", "現在讓我們來看看這 4 個數據欄位。每個欄位究竟代表什麼呢?`columns` 屬性會提供我們數據框中欄位的名稱。\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YPGh_ziji-CY", "outputId": "74e7a43a-77cc-4c80-da56-7f50767c37a0" }, "outputs": [ { "data": { "text/plain": [ "Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n", " 'petal width (cm)'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.columns" ] }, { "cell_type": "markdown", "metadata": { "id": "TsobcU_VjCC_" }, "source": [ "正如我們所見,有四(4)列。`columns` 屬性告訴我們列的名稱,基本上沒有其他內容。當我們想要識別數據集包含的特徵時,這個屬性變得重要。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "2UTlvkjmgRrs" }, "source": [ "### `DataFrame.info`\n", "透過 `shape` 屬性提供的數據量以及透過 `columns` 屬性提供的特徵或欄位名稱,可以讓我們了解一些關於資料集的資訊。現在,我們希望更深入地探索資料集。`DataFrame.info()` 函數在這方面非常有用。\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dHHRyG0_gRrt", "outputId": "d8fb0c40-4f18-4e19-da48-c8db77d1d3a5", "trusted": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 150 entries, 0 to 149\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 sepal length (cm) 150 non-null float64\n", " 1 sepal width (cm) 150 non-null float64\n", " 2 petal length (cm) 150 non-null float64\n", " 3 petal width (cm) 150 non-null float64\n", "dtypes: float64(4)\n", "memory usage: 4.8 KB\n" ] } ], "source": [ "iris_df.info()" ] }, { "cell_type": "markdown", "metadata": { "id": "1XgVMpvigRru" }, "source": [ "從這裡,我們可以做出一些觀察: \n", "1. 每列的資料類型:在這個數據集中,所有數據都以64位浮點數的形式存儲。 \n", "2. 非空值的數量:處理空值是數據準備中的一個重要步驟,稍後會在筆記本中處理。 \n" ] }, { "cell_type": "markdown", "metadata": { "id": "IYlyxbpWFEF4" }, "source": [ "### DataFrame.describe()\n", "假設我們的數據集中有許多數值型數據。像平均值、中位數、四分位數等單變量統計計算可以分別對每一列進行操作。`DataFrame.describe()` 函數為我們提供了數據集中數值列的統計摘要。\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "id": "tWV-CMstFIRA", "outputId": "4fc49941-bc13-4b0c-a412-cb39e7d3f289" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count150.000000150.000000150.000000150.000000
mean5.8433333.0573333.7580001.199333
std0.8280660.4358661.7652980.762238
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "count 150.000000 150.000000 150.000000 150.000000\n", "mean 5.843333 3.057333 3.758000 1.199333\n", "std 0.828066 0.435866 1.765298 0.762238\n", "min 4.300000 2.000000 1.000000 0.100000\n", "25% 5.100000 2.800000 1.600000 0.300000\n", "50% 5.800000 3.000000 4.350000 1.300000\n", "75% 6.400000 3.300000 5.100000 1.800000\n", "max 7.900000 4.400000 6.900000 2.500000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.describe()" ] }, { "cell_type": "markdown", "metadata": { "id": "zjjtW5hPGMuM" }, "source": [ "上面的輸出顯示了每列的數據點總數、平均值、標準差、最小值、下四分位數(25%)、中位數(50%)、上四分位數(75%)和最大值。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-lviAu99gRrv" }, "source": [ "### `DataFrame.head`\n", "透過上述所有函數和屬性,我們已經對數據集有了一個高層次的概覽。我們知道數據點的數量、特徵的數量、每個特徵的數據類型以及每個特徵的非空值數量。\n", "\n", "現在是時候查看數據本身了。讓我們看看我們的 `DataFrame` 的前幾行(前幾個數據點)是什麼樣子:\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DZMJZh0OgRrw", "outputId": "d9393ee5-c106-4797-f815-218f17160e00", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "EBHEimZuEFQK" }, "source": [ "作為輸出,我們可以看到數據集的五(5)個條目。如果我們查看左側的索引,我們會發現這是前五行。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oj7GkrTdgRry" }, "source": [ "### 練習:\n", "\n", "從上面的例子可以看出,預設情況下,`DataFrame.head` 會返回 `DataFrame` 的前五行。在下面的程式碼區塊中,你能找到一種方法來顯示超過五行嗎?\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true, "id": "EKRmRFFegRrz", "trusted": false }, "outputs": [], "source": [ "# Hint: Consult the documentation by using iris_df.head?" ] }, { "cell_type": "markdown", "metadata": { "id": "BJ_cpZqNgRr1" }, "source": [ "### `DataFrame.tail`\n", "另一種查看數據的方式是從結尾開始(而不是從開頭)。`DataFrame.head` 的相反方法是 `DataFrame.tail`,它會返回 `DataFrame` 的最後五行:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "heanjfGWgRr2", "outputId": "6ae09a21-fe09-4110-b0d7-1a1fbf34d7f3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
1456.73.05.22.3
1466.32.55.01.9
1476.53.05.22.0
1486.23.45.42.3
1495.93.05.11.8
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "145 6.7 3.0 5.2 2.3\n", "146 6.3 2.5 5.0 1.9\n", "147 6.5 3.0 5.2 2.0\n", "148 6.2 3.4 5.4 2.3\n", "149 5.9 3.0 5.1 1.8" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.tail()" ] }, { "cell_type": "markdown", "metadata": { "id": "31kBWfyLgRr3" }, "source": [ "在實際操作中,能夠輕鬆檢視 `DataFrame` 的前幾行或後幾行非常有用,特別是在檢查有序數據集中的異常值時。\n", "\n", "上面透過程式碼範例展示的所有函數和屬性,幫助我們了解數據的外觀和感覺。\n", "\n", "> **重點提示:** 僅僅透過查看 `DataFrame` 中的資訊元數據或其前幾個和後幾個值,就能立即對您正在處理的數據的大小、形狀和內容有一個初步的了解。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "TvurZyLSDxq_" }, "source": [ "### 缺失資料\n", "讓我們來探討缺失資料。當某些欄位中沒有儲存任何值時,就會出現缺失資料。\n", "\n", "舉個例子:假設某人對自己的體重很在意,因此在調查問卷中沒有填寫體重欄位。那麼,這個人的體重值就會是缺失的。\n", "\n", "在現實世界的數據集中,缺失值是非常常見的。\n", "\n", "**Pandas 如何處理缺失資料**\n", "\n", "Pandas 有兩種方式來處理缺失值。第一種方式你在之前的章節中可能已經見過:`NaN`,即「非數值」(Not a Number)。這其實是一個特殊值,屬於 IEEE 浮點數規範的一部分,專門用來表示缺失的浮點數值。\n", "\n", "對於浮點數以外的缺失值,Pandas 使用的是 Python 的 `None` 物件。雖然遇到兩種不同的值來表示相同的概念可能會讓人感到困惑,但這種設計選擇背後有其合理的程式邏輯原因。實際上,這樣的做法使得 Pandas 能夠在大多數情況下提供一個良好的折衷方案。儘管如此,`None` 和 `NaN` 都有各自的限制,特別是在使用它們時需要注意的地方。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lOHqUlZFgRr5" }, "source": [ "### `None`: 非浮點數的缺失資料\n", "由於 `None` 來自 Python,因此它無法用於非 `'object'` 資料類型的 NumPy 和 pandas 陣列。請記住,NumPy 陣列(以及 pandas 中的資料結構)只能包含一種類型的資料。這正是它們在大規模資料和計算工作中具有強大功能的原因,但也因此限制了它們的靈活性。這類陣列必須向上轉型為“最低共同分母”,即能夠涵蓋陣列中所有內容的資料類型。當陣列中包含 `None` 時,這表示您正在處理 Python 物件。\n", "\n", "為了更清楚地了解這一點,請參考以下範例陣列(注意其 `dtype`):\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QIoNdY4ngRr7", "outputId": "92779f18-62f4-4a03-eca2-e9a101604336", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "array([2, None, 6, 8], dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "example1 = np.array([2, None, 6, 8])\n", "example1" ] }, { "cell_type": "markdown", "metadata": { "id": "pdlgPNbhgRr7" }, "source": [ "上升數據類型的現實伴隨著兩個副作用。首先,操作將在解釋的 Python 代碼層面進行,而不是編譯的 NumPy 代碼層面。基本上,這意味著任何涉及包含 `None` 的 `Series` 或 `DataFrame` 的操作都會變得較慢。雖然你可能不會注意到這種性能影響,但對於大型數據集來說,這可能會成為一個問題。\n", "\n", "第二個副作用源於第一個副作用。由於 `None` 本質上將 `Series` 或 `DataFrame` 拉回到原生 Python 的世界,因此在包含 ``None`` 值的數組上使用像 `sum()` 或 `min()` 這樣的 NumPy/pandas 聚合操作通常會產生錯誤:\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 292 }, "id": "gWbx-KB9gRr8", "outputId": "ecba710a-22ec-41d5-a39c-11f67e645b50", "trusted": false }, "outputs": [ { "ename": "TypeError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" ] } ], "source": [ "example1.sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "LcEwO8UogRr9" }, "source": [ "**主要結論**:整數與 `None` 值之間的加法(以及其他操作)是未定義的,這可能會限制您對包含這些值的數據集所能執行的操作。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "pWvVHvETgRr9" }, "source": [ "### `NaN`:缺失的浮點值\n", "\n", "與 `None` 不同,NumPy(因此也包括 pandas)支援使用 `NaN` 來進行快速的向量化操作和 ufuncs。壞消息是,任何對 `NaN` 進行的算術運算結果都會是 `NaN`。例如:\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "rcFYfMG9gRr9", "outputId": "699e81b7-5c11-4b46-df1d-06071768690f", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan + 1" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BW3zQD2-gRr-", "outputId": "4525b6c4-495d-4f7b-a979-efce1dae9bd0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan * 0" ] }, { "cell_type": "markdown", "metadata": { "id": "fU5IPRcCgRr-" }, "source": [ "好消息:在包含 `NaN` 的數組上運行的聚合不會出現錯誤。壞消息:結果並非一律有用:\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LCInVgSSgRr_", "outputId": "fa06495a-0930-4867-87c5-6023031ea8b5", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "(nan, nan, nan)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ] }, { "cell_type": "markdown", "metadata": { "id": "nhlnNJT7gRr_" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true, "id": "yan3QRaOgRr_", "trusted": false }, "outputs": [], "source": [ "# What happens if you add np.nan and None together?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_iDvIRC8gRsA" }, "source": [ "請記住:`NaN` 只是用於缺失的浮點值;整數、字串或布林值沒有 `NaN` 的對應值。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "kj6EKdsAgRsA" }, "source": [ "### `NaN` 和 `None`:pandas 中的空值\n", "\n", "即使 `NaN` 和 `None` 的行為略有不同,pandas 仍然設計為可以互換處理它們。為了說明這一點,請考慮一個整數的 `Series`:\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Nji-KGdNgRsA", "outputId": "36aa14d2-8efa-4bfd-c0ed-682991288822", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ] }, { "cell_type": "markdown", "metadata": { "id": "WklCzqb8gRsB" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true, "id": "Cy-gqX5-gRsB", "trusted": false }, "outputs": [], "source": [ "# Now set an element of int_series equal to None.\n", "# How does that element show up in the Series?\n", "# What is the dtype of the Series?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "WjMQwltNgRsB" }, "source": [ "在將資料型別向上轉型以建立 `Series` 和 `DataFrame` 的資料一致性時,pandas 會自動在 `None` 和 `NaN` 之間切換缺失值。由於這種設計特性,將 `None` 和 `NaN` 視為 pandas 中兩種不同形式的「空值」是很有幫助的。事實上,pandas 中一些處理缺失值的核心方法名稱就反映了這個概念:\n", "\n", "- `isnull()`: 生成一個布林遮罩,標示出缺失值\n", "- `notnull()`: 與 `isnull()` 相反\n", "- `dropna()`: 返回一個過濾後的資料版本\n", "- `fillna()`: 返回一個填補或推算缺失值後的資料副本\n", "\n", "這些方法非常重要,必須熟練掌握並運用自如,因此我們將逐一深入探討它們。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Yh5ifd9FgRsB" }, "source": [ "### 偵測空值\n", "\n", "現在我們已經了解了缺失值的重要性,在處理它們之前,我們需要在數據集中偵測它們。 \n", "`isnull()` 和 `notnull()` 是偵測空值的主要方法。這兩個方法都會對數據返回布林遮罩。\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true, "id": "e-vFp5lvgRsC", "trusted": false }, "outputs": [], "source": [ "example3 = pd.Series([0, np.nan, '', None])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1XdaJJ7PgRsC", "outputId": "92fc363a-1874-471f-846d-f4f9ce1f51d0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 True\n", "dtype: bool" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull()" ] }, { "cell_type": "markdown", "metadata": { "id": "PaSZ0SQygRsC" }, "source": [ "仔細看看輸出的內容。有什麼讓你感到意外的嗎?雖然 `0` 是一個算術上的「空值」,但它仍然是一個完全有效的整數,而 pandas 也將其視為如此。`''` 則稍微微妙一些。雖然我們在第 1 節中使用它來表示一個空字串值,但對 pandas 而言,它仍然是一個字串物件,而不是空值的表示。\n", "\n", "現在,讓我們換個角度,將這些方法以更接近實際使用的方式來應用。你可以直接將布林遮罩用作 ``Series`` 或 ``DataFrame`` 的索引,這在處理孤立的缺失值(或存在值)時非常有用。\n", "\n", "如果我們想要計算缺失值的總數,只需對 `isnull()` 方法生成的遮罩進行一次加總即可。\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JCcQVoPkHDUv", "outputId": "001daa72-54f8-4bd5-842a-4df627a79d4d" }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "PlBqEo3mgRsC" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true, "id": "ggDVf5uygRsD", "trusted": false }, "outputs": [], "source": [ "# Try running example3[example3.notnull()].\n", "# Before you do so, what do you expect to see?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "D_jWN7mHgRsD" }, "source": [ "**主要收穫**:當您在 DataFrame 中使用 `isnull()` 和 `notnull()` 方法時,它們會產生類似的結果:顯示結果及其索引,這將在您處理數據時對您有極大的幫助。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "BvnoojWsgRr4" }, "source": [ "### 處理缺失數據\n", "\n", "> **學習目標:** 在本小節結束時,您應該了解如何以及何時替換或移除 DataFrame 中的空值。\n", "\n", "機器學習模型本身無法處理缺失數據。因此,在將數據傳入模型之前,我們需要先處理這些缺失值。\n", "\n", "處理缺失數據的方法涉及微妙的取捨,可能會影響您的最終分析結果以及實際應用的效果。\n", "\n", "主要有兩種處理缺失數據的方法:\n", "\n", "1. 移除包含缺失值的行\n", "2. 用其他值替換缺失值\n", "\n", "我們將詳細討論這兩種方法及其優缺點。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "3VaYC1TvgRsD" }, "source": [ "### 刪除空值\n", "\n", "我們傳遞給模型的數據量會直接影響其性能。刪除空值意味著我們減少了數據點的數量,從而縮小了數據集的規模。因此,當數據集相當大時,建議刪除包含空值的行。\n", "\n", "另一種情況可能是某一行或列有大量缺失值。在這種情況下,可以刪除它們,因為這些行/列的大部分數據都缺失,對我們的分析幫助不大。\n", "\n", "除了識別缺失值之外,pandas 還提供了一種方便的方法來從 `Series` 和 `DataFrame` 中移除空值。為了實際了解這一點,我們回到 `example3`。`DataFrame.dropna()` 函數可以幫助刪除包含空值的行。\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7uIvS097gRsD", "outputId": "c13fc117-4ca1-4145-a0aa-42ac89e6e218", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "2 \n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3 = example3.dropna()\n", "example3" ] }, { "cell_type": "markdown", "metadata": { "id": "hil2cr64gRsD" }, "source": [ "請注意,這應該看起來像是您從 `example3[example3.notnull()]` 的輸出。這裡的不同之處在於,`dropna` 並非僅僅索引遮罩值,而是從 `Series` `example3` 中移除了那些缺失值。\n", "\n", "由於 DataFrame 是二維的,因此在刪除數據時提供了更多選擇。\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "an-l74sPgRsE", "outputId": "340876a0-63ad-40f6-bd54-6240cdae50ab", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN7
12.05.08
2NaN6.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 7\n", "1 2.0 5.0 8\n", "2 NaN 6.0 9" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4 = pd.DataFrame([[1, np.nan, 7], \n", " [2, 5, 8], \n", " [np.nan, 6, 9]])\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "66wwdHZrgRsE" }, "source": [ "(你是否注意到 pandas 將其中兩列提升為浮點數類型以容納 `NaN` 值?)\n", "\n", "你無法僅刪除 `DataFrame` 中的一個值,因此必須刪除整行或整列。根據你的需求,你可能會選擇刪除其中一種,而 pandas 提供了這兩種選項。由於在資料科學中,列通常代表變數,行則代表觀測值,因此你更有可能刪除資料的行;`dropna()` 的預設設定是刪除所有包含任何空值的行:\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "jAVU24RXgRsE", "outputId": "0b5e5aee-7187-4d3f-b583-a44136ae5f80", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
12.05.08
\n", "
" ], "text/plain": [ " 0 1 2\n", "1 2.0 5.0 8" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna()" ] }, { "cell_type": "markdown", "metadata": { "id": "TrQRBuTDgRsE" }, "source": [ "如果需要,您可以從列中刪除 NA 值。使用 `axis=1` 來執行:\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "GrBhxu9GgRsE", "outputId": "ff4001f3-2e61-4509-d60e-0093d1068437", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2
07
18
29
\n", "
" ], "text/plain": [ " 2\n", "0 7\n", "1 8\n", "2 9" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='columns')" ] }, { "cell_type": "markdown", "metadata": { "id": "KWXiKTfMgRsF" }, "source": [ "請注意,這可能會刪除許多您可能希望保留的數據,特別是在較小的數據集中。如果您只想刪除包含幾個甚至所有空值的行或列該怎麼辦?您可以在 `dropna` 中使用 `how` 和 `thresh` 參數來指定這些設置。\n", "\n", "預設情況下,`how='any'`(如果您想自行檢查或查看該方法的其他參數,可以在程式碼單元中執行 `example4.dropna?`)。或者,您可以指定 `how='all'`,這樣只會刪除包含所有空值的行或列。讓我們擴展我們的範例 `DataFrame`,在下一個練習中看看這是如何運作的。\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "Bcf_JWTsgRsF", "outputId": "72e0b1b8-52fa-4923-98ce-b6fbed6e44b1", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4[3] = np.nan\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "pNZer7q9JPNC" }, "source": [ "> 關鍵重點:\n", "1. 僅在資料集足夠大的情況下,刪除空值才是一個好主意。\n", "2. 如果某些行或列的大部分數據都缺失,可以刪除整行或整列。\n", "3. `DataFrame.dropna(axis=)` 方法可以用來刪除空值。`axis` 參數表示是要刪除行還是列。\n", "4. 還可以使用 `how` 參數。預設值為 `any`,因此它只會刪除包含任意空值的行/列。可以將其設置為 `all`,以指定僅刪除所有值皆為空的行/列。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oXXSfQFHgRsF" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true, "id": "ExUwQRxpgRsF", "trusted": false }, "outputs": [], "source": [ "# How might you go about dropping just column 3?\n", "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "38kwAihWgRsG" }, "source": [ "`thresh` 參數提供更細緻的控制:您可以設定一行或一列需要具有的*非空*值數量,以便保留:\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "M9dCNMaagRsG", "outputId": "8093713a-54d2-4e54-c73f-4eea315cb6f2", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
12.05.08NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "1 2.0 5.0 8 NaN" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='rows', thresh=3)" ] }, { "cell_type": "markdown", "metadata": { "id": "fmSFnzZegRsG" }, "source": [ "這裡,第一行和最後一行已被刪除,因為它們僅包含兩個非空值。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "mCcxLGyUgRsG" }, "source": [ "### 填補空值\n", "\n", "有時候,用可能有效的值來填補缺失值是有意義的。填補空值有幾種技術。第一種是使用領域知識(即對數據集所基於主題的了解)來某種程度上估算缺失值。\n", "\n", "你可以使用 `isnull` 直接進行操作,但這可能會很繁瑣,特別是當你有很多值需要填補時。由於這在數據科學中是一個非常常見的任務,pandas 提供了 `fillna`,它會返回一個 `Series` 或 `DataFrame` 的副本,將缺失值替換為你選擇的值。我們來創建另一個範例 `Series`,看看這在實際操作中是如何運作的。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "CE8S7louLezV" }, "source": [ "### 類別型資料(非數值型)\n", "首先,我們來討論非數值型資料。在數據集中,我們可能會有包含類別型資料的欄位,例如性別、True 或 False 等。\n", "\n", "在大多數情況下,我們會用該欄位的「眾數」來替換缺失值。假設我們有 100 筆數據,其中 90 筆為 True,8 筆為 False,還有 2 筆未填寫。那麼,我們可以將這 2 筆未填寫的數據補為 True,因為 True 是該欄位的眾數。\n", "\n", "此外,在這裡我們也可以運用領域知識來進行補值。以下是一個使用眾數進行補值的例子。\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "MY5faq4yLdpQ", "outputId": "19ab472e-1eed-4de8-f8a7-db2a3af3cb1a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134None
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 None\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", " [3,4,None],\n", " [5,6,\"False\"],\n", " [7,8,\"True\"],\n", " [9,10,\"True\"]])\n", "\n", "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "MLAoMQOfNPlA" }, "source": [] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WKy-9Y2tN5jv", "outputId": "8da9fa16-e08c-447e-dea1-d4b1db2feebf" }, "outputs": [ { "data": { "text/plain": [ "True 3\n", "False 1\n", "Name: 2, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode[2].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "6iNz_zG_OKrx" }, "source": [] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "TxPKteRvNPOs" }, "outputs": [], "source": [ "fill_with_mode[2].fillna('True',inplace=True)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "tvas7c9_OPWE", "outputId": "ec3c8e44-d644-475e-9e22-c65101965850" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134True
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 True\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "SktitLxxOR16" }, "source": [ "正如我們所見,空值已被替換。毋庸置疑,我們本可以在 `'True'` 的位置寫任何內容,它都會被替代。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "heYe1I0dOmQ_" }, "source": [ "### 數值資料\n", "現在來談談數值資料。在這裡,我們有兩種常見的方法來替換缺失值:\n", "\n", "1. 用該行的中位數替換\n", "2. 用該行的平均值替換\n", "\n", "當資料存在偏態且有異常值時,我們使用中位數進行替換。這是因為中位數對異常值具有穩健性。\n", "\n", "當資料已經正規化時,我們可以使用平均值,因為在這種情況下,平均值和中位數會非常接近。\n", "\n", "首先,我們選擇一個呈正態分佈的欄位,並用該欄的平均值填補缺失值。\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "09HM_2feOj5Y", "outputId": "7e309013-9acb-411c-9b06-4de795bbeeff" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
2NaN45
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 NaN 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [np.nan,4,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "ka7-wNfzSxbx" }, "source": [ "該列的平均值是\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XYtYEf5BSxFL", "outputId": "68a78d18-f0e5-4a9a-a959-2c3676a57c70" }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(fill_with_mean[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "oBSRGxKRS39K" }, "source": [] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "FzncQLmuS5jh", "outputId": "00f74fff-01f4-4024-c261-796f50f01d2e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
20.045
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 0.0 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "CwpVFCrPTC5z" }, "source": [ "正如我們所見,缺失值已被其平均值取代。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jIvF13a1i00Z" }, "source": [ "現在讓我們嘗試另一個數據框,這次我們將用該列的中位數替換 None 值。\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DA59Bqo3jBYZ", "outputId": "85dae6ec-7394-4c36-fda0-e04769ec4a32" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
20NaN5
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 NaN 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [0,np.nan,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "mM1GpXYmjHnc" }, "source": [ "第二列的中位數是\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uiDy5v3xjHHX", "outputId": "564b6b74-2004-4486-90d4-b39330a64b88" }, "outputs": [ { "data": { "text/plain": [ "4.0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].median()" ] }, { "cell_type": "markdown", "metadata": { "id": "z9PLF75Jj_1s" }, "source": [] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "lFKbOxCMkBbg", "outputId": "a8bd18fb-2765-47d4-e5fe-e965f57ed1f4" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
204.05
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 4.0 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "8JtQ53GSkKWC" }, "source": [ "正如我們所見,NaN 值已被該列的中位數取代\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0ybtWLDdgRsG", "outputId": "b8c238ef-6024-4ee2-be2b-aa1f0fcac61d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b NaN\n", "c 2.0\n", "d NaN\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", "example5" ] }, { "cell_type": "markdown", "metadata": { "id": "yrsigxRggRsH" }, "source": [ "您可以使用單一值(例如 `0`)填充所有的空值條目:\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KXMIPsQdgRsH", "outputId": "aeedfa0a-a421-4c2f-cb0d-183ce8f0c91d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 0.0\n", "c 2.0\n", "d 0.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(0)" ] }, { "cell_type": "markdown", "metadata": { "id": "RRlI5f_hkfKe" }, "source": [ "> 關鍵要點:\n", "1. 當數據較少或有策略填補缺失數據時,應進行缺失值填補。\n", "2. 可以利用領域知識通過近似方法填補缺失值。\n", "3. 對於分類數據,通常使用該列的眾數來替代缺失值。\n", "4. 對於數值數據,缺失值通常用平均值(針對正規化數據集)或該列的中位數來填補。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "FI9MmqFJgRsH" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true, "id": "af-ezpXdgRsH", "trusted": false }, "outputs": [], "source": [ "# What happens if you try to fill null values with a string, like ''?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "kq3hw1kLgRsI" }, "source": [ "您可以使用**向前填充**空值,即使用最後一個有效值來填充空值:\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vO3BuNrggRsI", "outputId": "e2bc591b-0b48-4e88-ee65-754f2737c196", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 1.0\n", "c 2.0\n", "d 2.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='ffill')" ] }, { "cell_type": "markdown", "metadata": { "id": "nDXeYuHzgRsI" }, "source": [ "您也可以使用**回填**將下一個有效值向後傳播以填補空值:\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4M5onHcEgRsI", "outputId": "8f32b185-40dd-4a9f-bd85-54d6b6a414fe", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 2.0\n", "c 2.0\n", "d 3.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='bfill')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "id": "MbBzTom5gRsI" }, "source": [ "正如你可能猜到的,這與 DataFrame 的操作方式相同,但你也可以指定一個 `axis` 來填充空值:\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "aRpIvo4ZgRsI", "outputId": "905a980a-a808-4eca-d0ba-224bd7d85955", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "VM1qtACAgRsI", "outputId": "71f2ad28-9b4e-4ff4-f5c3-e731eb489ade", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.01.07.07.0
12.05.08.08.0
2NaN6.09.09.0
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 1.0 7.0 7.0\n", "1 2.0 5.0 8.0 8.0\n", "2 NaN 6.0 9.0 9.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(method='ffill', axis=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZeMc-I1EgRsI" }, "source": [ "請注意,當先前的值無法用於向前填充時,空值將保留。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "eeAoOU0RgRsJ" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true, "id": "e8S-CjW8gRsJ", "trusted": false }, "outputs": [], "source": [ "# What output does example4.fillna(method='bfill', axis=1) produce?\n", "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YHgy0lIrgRsJ" }, "source": [ "您可以創意地使用 `fillna`。例如,我們再來看看 `example4`,但這次讓我們用 `DataFrame` 中所有值的平均值來填補缺失值:\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "OtYVErEygRsJ", "outputId": "708b1e67-45ca-44bf-a5ee-8b2de09ece73", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.05.57NaN
12.05.08NaN
21.56.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 5.5 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 1.5 6.0 9 NaN" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(example4.mean())" ] }, { "cell_type": "markdown", "metadata": { "id": "zpMvCkLSgRsJ" }, "source": [ "注意,第 3 欄仍然是空的:預設方向是按行填充值。\n", "\n", "> **重點提示:** 處理數據集中缺失值的方法有很多種。具體採用的策略(移除、替換,甚至替換的方式)應該根據該數據的特點來決定。隨著你處理和接觸更多數據集,你將更好地掌握如何應對缺失值的技巧。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bauDnESIl9FH" }, "source": [ "### 編碼分類數據\n", "\n", "機器學習模型只能處理數字以及任何形式的數值數據。它無法分辨「是」和「否」,但能區分 0 和 1。因此,在填補缺失值之後,我們需要將分類數據編碼為某種數字形式,讓模型能夠理解。\n", "\n", "編碼可以通過兩種方式完成。我們接下來將討論這些方法。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "uDq9SxB7mu5i" }, "source": [ "**標籤編碼**\n", "\n", "標籤編碼基本上是將每個類別轉換為一個數字。例如,假設我們有一個航空乘客的數據集,其中有一列包含他們的艙等,分別是 ['business class', 'economy class', 'first class']。如果對這些數據進行標籤編碼,則會被轉換為 [0, 1, 2]。讓我們通過程式碼來看一個例子。由於我們會在接下來的筆記本中學習 `scikit-learn`,這裡我們不會使用它。\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "1vGz7uZyoWHL", "outputId": "9e252855-d193-4103-a54d-028ea7787b34" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "IDHnkwTYov-h" }, "source": [ "要對第一列執行標籤編碼,我們必須先描述每個類別到數字的映射,然後再進行替換\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZC5URJG3o1ES", "outputId": "aab0f1e7-e0f3-4c14-8459-9f9168c85437" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
0100
1202
2301
3401
4501
5600
\n", "
" ], "text/plain": [ " ID class\n", "0 10 0\n", "1 20 2\n", "2 30 1\n", "3 40 1\n", "4 50 1\n", "5 60 0" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class_labels = {'business class':0,'economy class':1,'first class':2}\n", "label['class'] = label['class'].replace(class_labels)\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "ftnF-TyapOPt" }, "source": [ "正如我們所見,輸出結果與我們預期的一致。那麼,我們什麼時候使用標籤編碼呢?標籤編碼適用於以下一種或兩種情況:\n", "\n", "1. 當類別數量很多時 \n", "2. 當類別具有順序性時 \n" ] }, { "cell_type": "markdown", "metadata": { "id": "eQPAPVwsqWT7" }, "source": [ "**獨熱編碼**\n", "\n", "另一種編碼方式是獨熱編碼(One Hot Encoding)。在這種編碼方式中,每個欄位的類別會被新增為一個單獨的欄位,並且每個數據點根據是否包含該類別,分別被賦值為 0 或 1。因此,如果有 n 種不同的類別,則會在資料框中新增 n 個欄位。\n", "\n", "例如,我們以相同的飛機艙等為例。類別為:['business class', 'economy class', 'first class']。那麼,如果我們執行獨熱編碼,數據集中將新增以下三個欄位:['class_business class', 'class_economy class', 'class_first class']。\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZM0eVh0ArKUL", "outputId": "83238a76-b3a5-418d-c0b6-605b02b6891b" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "one_hot" ] }, { "cell_type": "markdown", "metadata": { "id": "aVnZ7paDrWmb" }, "source": [ "讓我們對第一列執行獨熱編碼\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "RUPxf7egrYKr" }, "outputs": [], "source": [ "one_hot_data = pd.get_dummies(one_hot,columns=['class'])" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "TM37pHsFr4ge", "outputId": "7be15f53-79b2-447a-979c-822658339a9e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass_business classclass_economy classclass_first class
010100
120001
230010
340010
450010
560100
\n", "
" ], "text/plain": [ " ID class_business class class_economy class class_first class\n", "0 10 1 0 0\n", "1 20 0 0 1\n", "2 30 0 1 0\n", "3 40 0 1 0\n", "4 50 0 1 0\n", "5 60 1 0 0" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot_data" ] }, { "cell_type": "markdown", "metadata": { "id": "_zXRLOjXujdA" }, "source": [ "每個獨熱編碼欄位包含 0 或 1,表示該數據點是否存在該類別。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bDnC4NQOu0qr" }, "source": [ "什麼時候使用獨熱編碼(One Hot Encoding)? \n", "獨熱編碼通常在以下一種或兩種情況下使用:\n", "\n", "1. 當類別數量和數據集的規模較小時。\n", "2. 當類別之間沒有特定的順序時。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XnUmci_4uvyu" }, "source": [ "> 關鍵重點: \n", "1. 編碼的目的是將非數值型資料轉換為數值型資料。 \n", "2. 編碼主要有兩種類型:標籤編碼(Label encoding)和獨熱編碼(One Hot encoding),可以根據資料集的需求選擇使用。 \n" ] }, { "cell_type": "markdown", "metadata": { "id": "K8UXOJYRgRsJ" }, "source": [ "## 移除重複數據\n", "\n", "> **學習目標:** 在本小節結束時,您應該能夠熟練地識別並移除 DataFrame 中的重複值。\n", "\n", "除了缺失數據之外,您在處理真實世界的數據集時,經常會遇到重複的數據。幸運的是,pandas 提供了一種簡單的方法來檢測和移除重複的條目。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qrEG-Wa0gRsJ" }, "source": [ "### 識別重複值:`duplicated`\n", "\n", "你可以使用 pandas 的 `duplicated` 方法輕鬆找出重複的值。該方法會返回一個布林遮罩,指示 `DataFrame` 中某個條目是否是之前條目的重複值。讓我們創建另一個範例 `DataFrame` 來實際看看這個方法的效果。\n" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "ZLu6FEnZgRsJ", "outputId": "376512d1-d842-4db1-aea3-71052aeeecaf", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
2A1
3B3
4B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "2 A 1\n", "3 B 3\n", "4 B 3" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n", " 'numbers': [1, 2, 1, 3, 3]})\n", "example6" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cIduB5oBgRsK", "outputId": "3da27b3d-4d69-4e1d-bb52-0af21bae87f2", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "4 True\n", "dtype: bool" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.duplicated()" ] }, { "cell_type": "markdown", "metadata": { "id": "0eDRJD4SgRsK" }, "source": [ "### 刪除重複值:`drop_duplicates`\n", "`drop_duplicates` 會返回一份數據的副本,其中所有 `duplicated` 值均為 `False`:\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "w_YPpqIqgRsK", "outputId": "ac66bd2f-8671-4744-87f5-8b8d96553dea", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
3B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "3 B 3" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": { "id": "69AqoCZAgRsK" }, "source": [ "`duplicated` 和 `drop_duplicates` 預設會考慮所有列,但您可以指定它們僅檢查 `DataFrame` 中的一部分列:\n" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "BILjDs67gRsK", "outputId": "ef6dcc08-db8b-4352-c44e-5aa9e2bec0d3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates(['letters'])" ] }, { "cell_type": "markdown", "metadata": { "id": "GvX4og1EgRsL" }, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免責聲明**: \n本文件使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。應以原始語言的文件作為權威來源。對於關鍵資訊,建議尋求專業人工翻譯。我們對因使用此翻譯而產生的任何誤解或錯誤解讀概不負責。\n" ] } ], "metadata": { "anaconda-cloud": {}, "colab": { "name": "notebook.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "coopTranslator": { "original_hash": "8533b3a2230311943339963fc7f04c21", "translation_date": "2025-09-02T08:15:52+00:00", "source_file": "2-Working-With-Data/08-data-preparation/notebook.ipynb", "language_code": "tw" } }, "nbformat": 4, "nbformat_minor": 0 }