{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "rQ8UhzFpgRra" }, "source": [ "# 數據準備\n", "\n", "[原始筆記本來源:*Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio by Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n", "\n", "## 探索 `DataFrame` 資訊\n", "\n", "> **學習目標:** 完成本小節後,您應該能夠熟悉如何查找存儲在 pandas DataFrame 中的數據的一般資訊。\n", "\n", "當您將數據載入到 pandas 中時,數據通常會以 `DataFrame` 的形式存在。然而,如果您的 `DataFrame` 中的數據集有 60,000 行和 400 列,您該如何開始了解自己正在處理的內容呢?幸運的是,pandas 提供了一些方便的工具,可以快速查看 `DataFrame` 的整體資訊,以及前幾行和後幾行的數據。\n", "\n", "為了探索這些功能,我們將導入 Python 的 scikit-learn 庫,並使用一個每位數據科學家都看過數百次的經典數據集:英國生物學家 Ronald Fisher 在他1936年的論文《多重測量在分類問題中的應用》中使用的 *Iris* 數據集:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "id": "hB1RofhdgRrp", "trusted": false }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])" ] }, { "cell_type": "markdown", "metadata": { "id": "AGA0A_Y8hMdz" }, "source": [ "### `DataFrame.shape`\n", "我們已將 Iris 數據集加載到變數 `iris_df` 中。在深入分析數據之前,了解我們擁有的數據點數量以及整個數據集的大小是很有價值的。這有助於我們了解正在處理的數據量。\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LOe5jQohhulf", "outputId": "fb0577ac-3b4a-4623-cb41-20e1b264b3e9" }, "outputs": [ { "data": { "text/plain": [ "(150, 4)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "smE7AGzOhxk2" }, "source": [ "所以,我們正在處理150行和4列的數據。每一行代表一個數據點,而每一列則代表與數據框相關的一個特徵。基本上,就是150個數據點,每個數據點包含4個特徵。\n", "\n", "`shape` 在這裡是數據框的一個屬性,而不是一個函數,所以它的結尾並沒有一對括號。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "d3AZKs0PinGP" }, "source": [ "### `DataFrame.columns`\n", "現在讓我們看看這個數據框的四個欄位。每個欄位到底代表什麼?`columns` 屬性會告訴我們數據框中欄位的名稱。\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YPGh_ziji-CY", "outputId": "74e7a43a-77cc-4c80-da56-7f50767c37a0" }, "outputs": [ { "data": { "text/plain": [ "Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n", " 'petal width (cm)'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.columns" ] }, { "cell_type": "markdown", "metadata": { "id": "TsobcU_VjCC_" }, "source": [ "正如我們所見,有四(4)列。`columns` 屬性告訴我們列的名稱,基本上沒有其他內容。當我們想要識別數據集包含的特徵時,這個屬性就顯得重要了。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "2UTlvkjmgRrs" }, "source": [ "### `DataFrame.info`\n", "透過 `shape` 屬性提供的數據量以及 `columns` 屬性提供的特徵或欄位名稱,可以讓我們了解一些關於數據集的資訊。現在,我們希望更深入地探索這個數據集。`DataFrame.info()` 函數非常有用。\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dHHRyG0_gRrt", "outputId": "d8fb0c40-4f18-4e19-da48-c8db77d1d3a5", "trusted": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 150 entries, 0 to 149\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 sepal length (cm) 150 non-null float64\n", " 1 sepal width (cm) 150 non-null float64\n", " 2 petal length (cm) 150 non-null float64\n", " 3 petal width (cm) 150 non-null float64\n", "dtypes: float64(4)\n", "memory usage: 4.8 KB\n" ] } ], "source": [ "iris_df.info()" ] }, { "cell_type": "markdown", "metadata": { "id": "1XgVMpvigRru" }, "source": [ "從這裡,我們可以作出以下幾點觀察: \n", "1. 每列的數據類型:在這個數據集中,所有數據都以64位浮點數的形式存儲。 \n", "2. 非空值的數量:處理空值是數據準備中的一個重要步驟,稍後會在筆記本中處理。 \n" ] }, { "cell_type": "markdown", "metadata": { "id": "IYlyxbpWFEF4" }, "source": [ "### DataFrame.describe()\n", "假設我們的數據集包含大量數值數據。可以對每個欄位單獨進行單變量統計計算,例如平均值、中位數、四分位數等。`DataFrame.describe()` 函數為我們提供了數據集中數值欄位的統計摘要。\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "id": "tWV-CMstFIRA", "outputId": "4fc49941-bc13-4b0c-a412-cb39e7d3f289" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count150.000000150.000000150.000000150.000000
mean5.8433333.0573333.7580001.199333
std0.8280660.4358661.7652980.762238
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "count 150.000000 150.000000 150.000000 150.000000\n", "mean 5.843333 3.057333 3.758000 1.199333\n", "std 0.828066 0.435866 1.765298 0.762238\n", "min 4.300000 2.000000 1.000000 0.100000\n", "25% 5.100000 2.800000 1.600000 0.300000\n", "50% 5.800000 3.000000 4.350000 1.300000\n", "75% 6.400000 3.300000 5.100000 1.800000\n", "max 7.900000 4.400000 6.900000 2.500000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.describe()" ] }, { "cell_type": "markdown", "metadata": { "id": "zjjtW5hPGMuM" }, "source": [ "上面的輸出顯示了每列的數據點總數、平均值、標準差、最小值、下四分位數(25%)、中位數(50%)、上四分位數(75%)和最大值。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-lviAu99gRrv" }, "source": [ "### `DataFrame.head`\n", "透過以上所有的函數和屬性,我們已經對數據集有了一個高層次的概覽。我們知道數據集中有多少個數據點,有多少個特徵,每個特徵的數據類型,以及每個特徵中非空值的數量。\n", "\n", "現在是時候看看數據本身了。讓我們看看我們的 `DataFrame` 的前幾行(前幾個數據點)是什麼樣子:\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DZMJZh0OgRrw", "outputId": "d9393ee5-c106-4797-f815-218f17160e00", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "EBHEimZuEFQK" }, "source": [ "作為這裡的輸出,我們可以看到數據集的五(5)個條目。如果我們查看左側的索引,我們會發現這是前五行。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oj7GkrTdgRry" }, "source": [ "### 練習:\n", "\n", "從上述範例可以看出,預設情況下,`DataFrame.head` 會返回 `DataFrame` 的前五行。在下面的程式碼區塊中,你能找到一種方法來顯示多於五行的內容嗎?\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true, "id": "EKRmRFFegRrz", "trusted": false }, "outputs": [], "source": [ "# Hint: Consult the documentation by using iris_df.head?" ] }, { "cell_type": "markdown", "metadata": { "id": "BJ_cpZqNgRr1" }, "source": [ "### `DataFrame.tail`\n", "另一種查看數據的方法是從結尾開始(而不是從開頭)。`DataFrame.head` 的相反方法是 `DataFrame.tail`,它會返回 `DataFrame` 的最後五行:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "heanjfGWgRr2", "outputId": "6ae09a21-fe09-4110-b0d7-1a1fbf34d7f3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
1456.73.05.22.3
1466.32.55.01.9
1476.53.05.22.0
1486.23.45.42.3
1495.93.05.11.8
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "145 6.7 3.0 5.2 2.3\n", "146 6.3 2.5 5.0 1.9\n", "147 6.5 3.0 5.2 2.0\n", "148 6.2 3.4 5.4 2.3\n", "149 5.9 3.0 5.1 1.8" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.tail()" ] }, { "cell_type": "markdown", "metadata": { "id": "31kBWfyLgRr3" }, "source": [ "在實際操作中,能夠輕鬆查看 `DataFrame` 的前幾行或後幾行非常有用,特別是在檢查有序數據集中的異常值時。\n", "\n", "上面展示的所有函數和屬性,通過代碼示例的幫助,讓我們能夠快速了解數據的概況。\n", "\n", "> **重點提示:** 即使僅僅查看 `DataFrame` 中的元數據或前幾個和後幾個值,也能讓你立即了解所處理數據的大小、形狀和內容。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "TvurZyLSDxq_" }, "source": [ "### 缺失數據\n", "讓我們深入了解缺失數據。缺失數據是指某些欄位中沒有存儲任何值。\n", "\n", "舉個例子:假設某人對自己的體重非常在意,因此在調查中沒有填寫體重欄位。那麼,該人的體重值就會是缺失的。\n", "\n", "在現實世界的數據集中,缺失值是非常常見的。\n", "\n", "**Pandas 如何處理缺失數據**\n", "\n", "Pandas 有兩種方式來處理缺失值。第一種方式你可能在之前的章節中已經見過:`NaN`,即「非數值」(Not a Number)。這實際上是一個特殊值,是 IEEE 浮點規範的一部分,僅用於表示缺失的浮點值。\n", "\n", "對於非浮點類型的缺失值,Pandas 使用 Python 的 `None` 對象。雖然遇到兩種不同的值來表示同樣的事情可能會讓人感到困惑,但這種設計選擇有其合理的編程原因。在實際應用中,這種方式使得 Pandas 能夠在絕大多數情況下提供良好的平衡。不過,`None` 和 `NaN` 都有一些限制,使用時需要注意它們的適用範圍和使用方式。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lOHqUlZFgRr5" }, "source": [ "### `None`:非浮點類型的缺失數據\n", "由於 `None` 是來自 Python 的,因此它無法用於數據類型不是 `'object'` 的 NumPy 和 pandas 陣列。請記住,NumPy 陣列(以及 pandas 中的數據結構)只能包含一種類型的數據。這正是它們在大規模數據和計算工作中具有強大功能的原因,但同時也限制了它們的靈活性。這類陣列必須向“最低公分母”進行上轉型,也就是能夠涵蓋陣列中所有內容的數據類型。如果陣列中包含 `None`,這意味著你正在處理 Python 對象。\n", "\n", "以下是一個示例陣列(注意它的 `dtype`)來展示這一點:\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QIoNdY4ngRr7", "outputId": "92779f18-62f4-4a03-eca2-e9a101604336", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "array([2, None, 6, 8], dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "example1 = np.array([2, None, 6, 8])\n", "example1" ] }, { "cell_type": "markdown", "metadata": { "id": "pdlgPNbhgRr7" }, "source": [ "上轉數據類型的現實情況帶來了兩個副作用。首先,操作將在解釋執行的 Python 代碼層面進行,而不是在編譯的 NumPy 代碼層面進行。本質上,這意味著任何涉及包含 `None` 的 `Series` 或 `DataFrame` 的操作都會變得較慢。雖然你可能不會注意到這種性能影響,但對於大型數據集來說,這可能會成為一個問題。\n", "\n", "第二個副作用源於第一個副作用。由於 `None` 本質上將 `Series` 或 `DataFrame` 拉回到原生 Python 的世界,因此對包含 `None` 值的數組使用 NumPy/pandas 的聚合函數(例如 `sum()` 或 `min()`)通常會產生錯誤:\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 292 }, "id": "gWbx-KB9gRr8", "outputId": "ecba710a-22ec-41d5-a39c-11f67e645b50", "trusted": false }, "outputs": [ { "ename": "TypeError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" ] } ], "source": [ "example1.sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "LcEwO8UogRr9" }, "source": [ "**主要重點**:整數與 `None` 值之間的加法(以及其他運算)是未定義的,這可能會限制您對包含這些值的數據集所能執行的操作。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "pWvVHvETgRr9" }, "source": [ "### `NaN`:缺失的浮點值\n", "\n", "與 `None` 不同,NumPy(因此也包括 pandas)支援使用 `NaN` 來進行快速的向量化操作和 ufuncs。壞消息是,任何對 `NaN` 進行的算術運算,結果都會是 `NaN`。例如:\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "rcFYfMG9gRr9", "outputId": "699e81b7-5c11-4b46-df1d-06071768690f", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan + 1" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BW3zQD2-gRr-", "outputId": "4525b6c4-495d-4f7b-a979-efce1dae9bd0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan * 0" ] }, { "cell_type": "markdown", "metadata": { "id": "fU5IPRcCgRr-" }, "source": [ "好消息:在包含 `NaN` 的數組上運行的聚合不會出現錯誤。壞消息:結果並非一律有用:\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LCInVgSSgRr_", "outputId": "fa06495a-0930-4867-87c5-6023031ea8b5", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "(nan, nan, nan)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ] }, { "cell_type": "markdown", "metadata": { "id": "nhlnNJT7gRr_" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true, "id": "yan3QRaOgRr_", "trusted": false }, "outputs": [], "source": [ "# What happens if you add np.nan and None together?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_iDvIRC8gRsA" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "kj6EKdsAgRsA" }, "source": [ "### `NaN` 和 `None`:pandas 中的空值\n", "\n", "雖然 `NaN` 和 `None` 的行為可能有些不同,但 pandas 仍然能夠將它們互換處理。為了更清楚地了解,請看以下一個整數的 `Series`:\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Nji-KGdNgRsA", "outputId": "36aa14d2-8efa-4bfd-c0ed-682991288822", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ] }, { "cell_type": "markdown", "metadata": { "id": "WklCzqb8gRsB" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true, "id": "Cy-gqX5-gRsB", "trusted": false }, "outputs": [], "source": [ "# Now set an element of int_series equal to None.\n", "# How does that element show up in the Series?\n", "# What is the dtype of the Series?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "WjMQwltNgRsB" }, "source": [ "在將資料類型向上轉型以建立 `Series` 和 `DataFrame` 的資料一致性時,pandas 會自動在 `None` 和 `NaN` 之間切換缺失值。由於這種設計特性,將 `None` 和 `NaN` 視為 pandas 中兩種不同形式的「空值」會更容易理解。事實上,pandas 中一些處理缺失值的核心方法名稱也反映了這個概念:\n", "\n", "- `isnull()`: 生成一個布林遮罩,標示出缺失值的位置\n", "- `notnull()`: 與 `isnull()` 相反\n", "- `dropna()`: 返回一個已過濾掉缺失值的資料版本\n", "- `fillna()`: 返回一個已填補或補全缺失值的資料副本\n", "\n", "這些方法非常重要,熟練掌握它們對於處理缺失值至關重要,因此我們將逐一深入探討這些方法。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Yh5ifd9FgRsB" }, "source": [ "### 檢測空值\n", "\n", "既然我們已經了解缺失值的重要性,在處理它們之前,我們需要先在數據集中檢測它們。 \n", "`isnull()` 和 `notnull()` 是檢測空值的主要方法。這兩個方法都會對數據返回布林掩碼。\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true, "id": "e-vFp5lvgRsC", "trusted": false }, "outputs": [], "source": [ "example3 = pd.Series([0, np.nan, '', None])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1XdaJJ7PgRsC", "outputId": "92fc363a-1874-471f-846d-f4f9ce1f51d0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 True\n", "dtype: bool" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull()" ] }, { "cell_type": "markdown", "metadata": { "id": "PaSZ0SQygRsC" }, "source": [ "仔細看看輸出內容。有沒有任何部分讓你感到意外?雖然 `0` 是一個算術上的「空值」,但它仍然是一個完全有效的整數,而 pandas 也將其視為如此。至於 `''`,情況稍微微妙一點。雖然我們在第 1 節中使用它來表示一個空字串值,但對 pandas 而言,它仍然是一個字串物件,而不是空值的表示。\n", "\n", "現在,讓我們換個角度,將這些方法以更接近實際使用的方式來應用。你可以直接將布林遮罩用作 ``Series`` 或 ``DataFrame`` 的索引,這在處理孤立的缺失值(或存在值)時非常有用。\n", "\n", "如果我們想要計算缺失值的總數,只需對 `isnull()` 方法生成的遮罩進行一次加總即可。\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JCcQVoPkHDUv", "outputId": "001daa72-54f8-4bd5-842a-4df627a79d4d" }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "PlBqEo3mgRsC" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true, "id": "ggDVf5uygRsD", "trusted": false }, "outputs": [], "source": [ "# Try running example3[example3.notnull()].\n", "# Before you do so, what do you expect to see?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "D_jWN7mHgRsD" }, "source": [ "**關鍵重點**:當你在 DataFrame 中使用 `isnull()` 和 `notnull()` 方法時,它們會產生相似的結果:顯示結果及其索引,這將在你處理數據時對你有極大的幫助。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "BvnoojWsgRr4" }, "source": [ "### 處理缺失數據\n", "\n", "> **學習目標:** 完成本小節後,你應該了解如何以及何時替換或移除 DataFrame 中的空值。\n", "\n", "機器學習模型無法直接處理缺失數據。因此,在將數據傳入模型之前,我們需要先處理這些缺失值。\n", "\n", "處理缺失數據的方法涉及微妙的取捨,可能會影響你的最終分析結果以及實際應用的效果。\n", "\n", "主要有兩種處理缺失數據的方法:\n", "\n", "1. 刪除包含缺失值的行\n", "2. 用其他值替換缺失值\n", "\n", "我們將詳細討論這兩種方法及其優缺點。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "3VaYC1TvgRsD" }, "source": [ "### 刪除空值\n", "\n", "我們傳遞給模型的數據量會直接影響其性能。刪除空值意味著我們減少了數據點的數量,從而縮小了數據集的規模。因此,當數據集相當大的時候,建議刪除包含空值的行。\n", "\n", "另一種情況可能是某一行或某一列有大量缺失值。在這種情況下,它們可能會被刪除,因為該行/列的大部分數據都缺失,對我們的分析幫助不大。\n", "\n", "除了識別缺失值之外,pandas 還提供了一種方便的方法來從 `Series` 和 `DataFrame` 中移除空值。為了實際了解這一點,我們可以回到 `example3`。`DataFrame.dropna()` 函數可以幫助刪除包含空值的行。\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7uIvS097gRsD", "outputId": "c13fc117-4ca1-4145-a0aa-42ac89e6e218", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "2 \n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3 = example3.dropna()\n", "example3" ] }, { "cell_type": "markdown", "metadata": { "id": "hil2cr64gRsD" }, "source": [ "請注意,這應該看起來像你的輸出 `example3[example3.notnull()]`。這裡的不同之處在於,`dropna` 不僅僅是基於遮罩值進行索引,而是從 `Series` `example3` 中移除了那些缺失值。\n", "\n", "由於 DataFrame 是二維的,因此它提供了更多刪除數據的選項。\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "an-l74sPgRsE", "outputId": "340876a0-63ad-40f6-bd54-6240cdae50ab", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN7
12.05.08
2NaN6.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 7\n", "1 2.0 5.0 8\n", "2 NaN 6.0 9" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4 = pd.DataFrame([[1, np.nan, 7], \n", " [2, 5, 8], \n", " [np.nan, 6, 9]])\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "66wwdHZrgRsE" }, "source": [ "(你有留意到 pandas 將其中兩列數據提升為浮點數以容納 `NaN` 嗎?)\n", "\n", "你無法從 `DataFrame` 中刪除單個值,因此你必須刪除整行或整列。根據你的需求,你可能會選擇其中一種方式,因此 pandas 提供了兩種選項。由於在數據科學中,列通常代表變量,而行代表觀測值,因此你更有可能刪除數據行;`dropna()` 的預設設定是刪除所有包含任何空值的行:\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "jAVU24RXgRsE", "outputId": "0b5e5aee-7187-4d3f-b583-a44136ae5f80", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
12.05.08
\n", "
" ], "text/plain": [ " 0 1 2\n", "1 2.0 5.0 8" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna()" ] }, { "cell_type": "markdown", "metadata": { "id": "TrQRBuTDgRsE" }, "source": [ "如果有需要,你可以從列中刪除 NA 值。使用 `axis=1` 來執行:\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "GrBhxu9GgRsE", "outputId": "ff4001f3-2e61-4509-d60e-0093d1068437", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2
07
18
29
\n", "
" ], "text/plain": [ " 2\n", "0 7\n", "1 8\n", "2 9" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='columns')" ] }, { "cell_type": "markdown", "metadata": { "id": "KWXiKTfMgRsF" }, "source": [ "請注意,這可能會刪除許多您可能想保留的數據,特別是在較小的數據集中。如果您只想刪除包含多個甚至全部空值的行或列,該怎麼辦?您可以在 `dropna` 中使用 `how` 和 `thresh` 參數來指定這些設置。\n", "\n", "預設情況下,`how='any'`(如果您想自行檢查或查看該方法的其他參數,可以在程式碼單元中執行 `example4.dropna?`)。您也可以選擇指定 `how='all'`,這樣只會刪除包含所有空值的行或列。我們來擴展我們的範例 `DataFrame`,在下一個練習中看看這是如何運作的。\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "Bcf_JWTsgRsF", "outputId": "72e0b1b8-52fa-4923-98ce-b6fbed6e44b1", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4[3] = np.nan\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "pNZer7q9JPNC" }, "source": [ "> 關鍵要點:\n", "1. 只有在數據集足夠大的情況下,刪除空值才是明智的選擇。\n", "2. 如果整行或整列的大部分數據都缺失,可以考慮刪除。\n", "3. `DataFrame.dropna(axis=)` 方法可用於刪除空值。`axis` 參數表示是刪除行還是列。\n", "4. `how` 參數也可以使用。默認情況下設置為 `any`,因此它只會刪除包含任何空值的行或列。可以將其設置為 `all`,以指定僅刪除所有值均為空的行或列。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oXXSfQFHgRsF" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true, "id": "ExUwQRxpgRsF", "trusted": false }, "outputs": [], "source": [ "# How might you go about dropping just column 3?\n", "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "38kwAihWgRsG" }, "source": [ "`thresh` 參數為您提供更細緻的控制:您可以設定一行或一列需要擁有的*非空*值的數量,才能被保留:\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "M9dCNMaagRsG", "outputId": "8093713a-54d2-4e54-c73f-4eea315cb6f2", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
12.05.08NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "1 2.0 5.0 8 NaN" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='rows', thresh=3)" ] }, { "cell_type": "markdown", "metadata": { "id": "fmSFnzZegRsG" }, "source": [ "這裡,第一行和最後一行已被刪除,因為它們僅包含兩個非空值。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "mCcxLGyUgRsG" }, "source": [ "### 填充空值\n", "\n", "有時候填補缺失值為可能有效的值是合理的。有幾種方法可以用來填充空值。第一種方法是使用領域知識(即對數據集所基於的主題的了解)來大致估算缺失值。\n", "\n", "你可以使用 `isnull` 直接進行操作,但這可能會很繁瑣,特別是當你需要填充大量值時。由於這在數據科學中是一個非常常見的任務,pandas 提供了 `fillna` 方法,它會返回一個 `Series` 或 `DataFrame` 的副本,並將缺失值替換為你選擇的值。讓我們創建另一個例子 `Series`,來看看這在實際操作中是如何工作的。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "CE8S7louLezV" }, "source": [ "### 類別型數據(非數字)\n", "首先,我們來看看非數字型數據。在數據集裡,我們有一些包含類別型數據的欄位,例如性別、True 或 False 等。\n", "\n", "在大多數情況下,我們會用該欄位的 `眾數` 來替換缺失值。假設我們有 100 個數據點,其中 90 個選擇 True,8 個選擇 False,還有 2 個未填寫。那麼,我們可以用 True 填補這 2 個缺失值,基於整個欄位的情況。\n", "\n", "此外,我們也可以在這裡運用領域知識。讓我們來看一個用眾數填補的例子。\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "MY5faq4yLdpQ", "outputId": "19ab472e-1eed-4de8-f8a7-db2a3af3cb1a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134None
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 None\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", " [3,4,None],\n", " [5,6,\"False\"],\n", " [7,8,\"True\"],\n", " [9,10,\"True\"]])\n", "\n", "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "MLAoMQOfNPlA" }, "source": [] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WKy-9Y2tN5jv", "outputId": "8da9fa16-e08c-447e-dea1-d4b1db2feebf" }, "outputs": [ { "data": { "text/plain": [ "True 3\n", "False 1\n", "Name: 2, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode[2].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "6iNz_zG_OKrx" }, "source": [] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "TxPKteRvNPOs" }, "outputs": [], "source": [ "fill_with_mode[2].fillna('True',inplace=True)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "tvas7c9_OPWE", "outputId": "ec3c8e44-d644-475e-9e22-c65101965850" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134True
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 True\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "SktitLxxOR16" }, "source": [ "正如我們所見,空值已被替換。毋庸置疑,我們本可以在 `'True'` 的位置寫任何內容,並且它都會被替換。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "heYe1I0dOmQ_" }, "source": [ "### 數值數據\n", "現在來談談數值數據。在這裡,我們有兩種常見的方法來替換缺失值:\n", "\n", "1. 用該行的中位數替換 \n", "2. 用該行的平均值替換 \n", "\n", "當數據存在偏態且有異常值時,我們會選擇用中位數替換。這是因為中位數對異常值具有穩健性。\n", "\n", "當數據已經正規化時,我們可以使用平均值,因為在這種情況下,平均值和中位數會非常接近。\n", "\n", "首先,我們選擇一個呈正態分佈的列,並用該列的平均值填補缺失值。\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "09HM_2feOj5Y", "outputId": "7e309013-9acb-411c-9b06-4de795bbeeff" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
2NaN45
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 NaN 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [np.nan,4,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "ka7-wNfzSxbx" }, "source": [ "該列的平均值是\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XYtYEf5BSxFL", "outputId": "68a78d18-f0e5-4a9a-a959-2c3676a57c70" }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(fill_with_mean[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "oBSRGxKRS39K" }, "source": [] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "FzncQLmuS5jh", "outputId": "00f74fff-01f4-4024-c261-796f50f01d2e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
20.045
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 0.0 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "CwpVFCrPTC5z" }, "source": [ "正如我們所見,缺失值已被其平均值取代。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jIvF13a1i00Z" }, "source": [ "現在讓我們嘗試另一個數據框,這次我們將用該列的中位數替換 None 值。\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DA59Bqo3jBYZ", "outputId": "85dae6ec-7394-4c36-fda0-e04769ec4a32" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
20NaN5
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 NaN 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [0,np.nan,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "mM1GpXYmjHnc" }, "source": [ "第二列的中位數是\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uiDy5v3xjHHX", "outputId": "564b6b74-2004-4486-90d4-b39330a64b88" }, "outputs": [ { "data": { "text/plain": [ "4.0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].median()" ] }, { "cell_type": "markdown", "metadata": { "id": "z9PLF75Jj_1s" }, "source": [] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "lFKbOxCMkBbg", "outputId": "a8bd18fb-2765-47d4-e5fe-e965f57ed1f4" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
204.05
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 4.0 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "8JtQ53GSkKWC" }, "source": [ "正如我們所見,NaN 值已被該列的中位數取代\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0ybtWLDdgRsG", "outputId": "b8c238ef-6024-4ee2-be2b-aa1f0fcac61d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b NaN\n", "c 2.0\n", "d NaN\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", "example5" ] }, { "cell_type": "markdown", "metadata": { "id": "yrsigxRggRsH" }, "source": [ "您可以使用單一值(例如 `0`)填充所有的空值條目:\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KXMIPsQdgRsH", "outputId": "aeedfa0a-a421-4c2f-cb0d-183ce8f0c91d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 0.0\n", "c 2.0\n", "d 0.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(0)" ] }, { "cell_type": "markdown", "metadata": { "id": "RRlI5f_hkfKe" }, "source": [ "> 關鍵重點:\n", "1. 當數據較少或有策略填補缺失數據時,應進行缺失值填補。\n", "2. 可以利用領域知識來估算並填補缺失值。\n", "3. 對於分類數據,缺失值通常以該列的眾數替代。\n", "4. 對於數值數據,缺失值通常以平均值(針對已正規化的數據集)或該列的中位數填補。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "FI9MmqFJgRsH" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true, "id": "af-ezpXdgRsH", "trusted": false }, "outputs": [], "source": [ "# What happens if you try to fill null values with a string, like ''?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "kq3hw1kLgRsI" }, "source": [ "您可以使用**前向填充**空值,即使用最後一個有效值來填充空值:\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vO3BuNrggRsI", "outputId": "e2bc591b-0b48-4e88-ee65-754f2737c196", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 1.0\n", "c 2.0\n", "d 2.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='ffill')" ] }, { "cell_type": "markdown", "metadata": { "id": "nDXeYuHzgRsI" }, "source": [ "您還可以**向後填充**,將下一個有效值向後傳播以填充空值:\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4M5onHcEgRsI", "outputId": "8f32b185-40dd-4a9f-bd85-54d6b6a414fe", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 2.0\n", "c 2.0\n", "d 3.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='bfill')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "id": "MbBzTom5gRsI" }, "source": [ "正如你可能猜到的,這與 DataFrame 的操作方式相同,但你也可以指定一個 `axis` 來填充空值:\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "aRpIvo4ZgRsI", "outputId": "905a980a-a808-4eca-d0ba-224bd7d85955", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "VM1qtACAgRsI", "outputId": "71f2ad28-9b4e-4ff4-f5c3-e731eb489ade", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.01.07.07.0
12.05.08.08.0
2NaN6.09.09.0
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 1.0 7.0 7.0\n", "1 2.0 5.0 8.0 8.0\n", "2 NaN 6.0 9.0 9.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(method='ffill', axis=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZeMc-I1EgRsI" }, "source": [ "請注意,當前向填充時沒有可用的先前值,空值將保持不變。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "eeAoOU0RgRsJ" }, "source": [ "### 運動:\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true, "id": "e8S-CjW8gRsJ", "trusted": false }, "outputs": [], "source": [ "# What output does example4.fillna(method='bfill', axis=1) produce?\n", "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YHgy0lIrgRsJ" }, "source": [ "您可以創意地使用 `fillna`。例如,我們再看看 `example4`,但這次我們用 `DataFrame` 中所有值的平均值來填補缺失值:\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "OtYVErEygRsJ", "outputId": "708b1e67-45ca-44bf-a5ee-8b2de09ece73", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.05.57NaN
12.05.08NaN
21.56.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 5.5 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 1.5 6.0 9 NaN" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(example4.mean())" ] }, { "cell_type": "markdown", "metadata": { "id": "zpMvCkLSgRsJ" }, "source": [ "注意,第 3 欄仍然是空的:預設方向是按行填充值。\n", "\n", "> **重點提示:** 處理數據集中缺失值的方法有很多。你採用的具體策略(移除、替換,甚至替換的方式)應該根據該數據的具體情況來決定。隨著你處理和接觸更多數據集,你將更能掌握如何應對缺失值的技巧。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bauDnESIl9FH" }, "source": [ "### 編碼分類數據\n", "\n", "機器學習模型只能處理數字和任何形式的數值數據。它無法分辨 Yes 和 No,但能夠區分 0 和 1。因此,在填補缺失值之後,我們需要將分類數據編碼成某種數值形式,讓模型能夠理解。\n", "\n", "編碼可以通過兩種方式完成。我們接下來會討論這些方法。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "uDq9SxB7mu5i" }, "source": [ "**標籤編碼**\n", "\n", "標籤編碼基本上是將每個類別轉換為一個數字。例如,假設我們有一個航空乘客的數據集,其中有一列包含以下類別:['商務艙', '經濟艙', '頭等艙']。如果對這些類別進行標籤編碼,則會被轉換為 [0,1,2]。讓我們通過程式碼來看一個例子。由於我們會在接下來的筆記本中學習 `scikit-learn`,所以這裡不會使用它。\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "1vGz7uZyoWHL", "outputId": "9e252855-d193-4103-a54d-028ea7787b34" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "IDHnkwTYov-h" }, "source": [ "要對第一列進行標籤編碼,我們必須先描述每個類別到數字的映射,然後再進行替換\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZC5URJG3o1ES", "outputId": "aab0f1e7-e0f3-4c14-8459-9f9168c85437" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
0100
1202
2301
3401
4501
5600
\n", "
" ], "text/plain": [ " ID class\n", "0 10 0\n", "1 20 2\n", "2 30 1\n", "3 40 1\n", "4 50 1\n", "5 60 0" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class_labels = {'business class':0,'economy class':1,'first class':2}\n", "label['class'] = label['class'].replace(class_labels)\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "ftnF-TyapOPt" }, "source": [ "正如我們所見,輸出結果與我們預期的一致。那麼,我們什麼時候應該使用標籤編碼呢?標籤編碼適用於以下其中一種或兩種情況:\n", "1. 當分類數量很多時\n", "2. 當分類具有順序性時\n" ] }, { "cell_type": "markdown", "metadata": { "id": "eQPAPVwsqWT7" }, "source": [ "**獨熱編碼(One Hot Encoding)**\n", "\n", "另一種編碼方式是獨熱編碼。在這種編碼方式中,每個欄位的類別會被新增為一個獨立的欄位,而每個數據點會根據是否包含該類別被賦予 0 或 1。因此,如果有 n 個不同的類別,數據框架中將會新增 n 個欄位。\n", "\n", "例如,我們以飛機艙等的例子來說。類別包括:['商務艙', '經濟艙', '頭等艙']。那麼,如果我們執行獨熱編碼,數據集中將會新增以下三個欄位:['class_business class', 'class_economy class', 'class_first class']。\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZM0eVh0ArKUL", "outputId": "83238a76-b3a5-418d-c0b6-605b02b6891b" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "one_hot" ] }, { "cell_type": "markdown", "metadata": { "id": "aVnZ7paDrWmb" }, "source": [ "讓我們對第一列進行獨熱編碼\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "RUPxf7egrYKr" }, "outputs": [], "source": [ "one_hot_data = pd.get_dummies(one_hot,columns=['class'])" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "TM37pHsFr4ge", "outputId": "7be15f53-79b2-447a-979c-822658339a9e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass_business classclass_economy classclass_first class
010100
120001
230010
340010
450010
560100
\n", "
" ], "text/plain": [ " ID class_business class class_economy class class_first class\n", "0 10 1 0 0\n", "1 20 0 0 1\n", "2 30 0 1 0\n", "3 40 0 1 0\n", "4 50 0 1 0\n", "5 60 1 0 0" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot_data" ] }, { "cell_type": "markdown", "metadata": { "id": "_zXRLOjXujdA" }, "source": [ "每個獨熱編碼列包含 0 或 1,指定該數據點是否存在該類別。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bDnC4NQOu0qr" }, "source": [ "什麼時候使用獨熱編碼?獨熱編碼通常在以下其中一種或兩種情況下使用:\n", "\n", "1. 當分類數量和數據集的規模較小時。\n", "2. 當分類沒有特定的順序時。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XnUmci_4uvyu" }, "source": [ "> 關鍵重點:\n", "1. 編碼的目的是將非數值數據轉換為數值數據。\n", "2. 編碼主要分為兩種類型:標籤編碼和獨熱編碼,兩者可根據數據集的需求進行選擇和應用。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "K8UXOJYRgRsJ" }, "source": [ "## 移除重複數據\n", "\n", "> **學習目標:** 完成本小節後,你應該能夠熟練地識別並移除 DataFrame 中的重複值。\n", "\n", "除了缺失數據外,你在處理真實世界的數據集時,經常會遇到重複的數據。幸運的是,pandas 提供了一個簡便的方法來檢測和移除重複的條目。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qrEG-Wa0gRsJ" }, "source": [ "### 識別重複項目:`duplicated`\n", "\n", "你可以使用 pandas 的 `duplicated` 方法輕鬆找出重複的值。該方法會返回一個布林遮罩,指示 `DataFrame` 中某個條目是否是之前條目的重複項。讓我們建立另一個範例 `DataFrame` 來看看這個方法的運作方式。\n" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "ZLu6FEnZgRsJ", "outputId": "376512d1-d842-4db1-aea3-71052aeeecaf", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
2A1
3B3
4B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "2 A 1\n", "3 B 3\n", "4 B 3" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n", " 'numbers': [1, 2, 1, 3, 3]})\n", "example6" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cIduB5oBgRsK", "outputId": "3da27b3d-4d69-4e1d-bb52-0af21bae87f2", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "4 True\n", "dtype: bool" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.duplicated()" ] }, { "cell_type": "markdown", "metadata": { "id": "0eDRJD4SgRsK" }, "source": [ "### 刪除重複值:`drop_duplicates`\n", "`drop_duplicates` 會返回一份數據的副本,其中所有 `duplicated` 值均為 `False`:\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "w_YPpqIqgRsK", "outputId": "ac66bd2f-8671-4744-87f5-8b8d96553dea", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
3B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "3 B 3" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": { "id": "69AqoCZAgRsK" }, "source": [ "`duplicated` 和 `drop_duplicates` 預設會考慮所有列,但你可以指定它們僅檢查 `DataFrame` 中的一部分列:\n" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "BILjDs67gRsK", "outputId": "ef6dcc08-db8b-4352-c44e-5aa9e2bec0d3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates(['letters'])" ] }, { "cell_type": "markdown", "metadata": { "id": "GvX4og1EgRsL" }, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免責聲明**: \n本文件已使用人工智能翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。雖然我們致力於提供準確的翻譯,但請注意,自動翻譯可能包含錯誤或不準確之處。原始語言的文件應被視為權威來源。對於重要信息,建議使用專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或錯誤解釋概不負責。\n" ] } ], "metadata": { "anaconda-cloud": {}, "colab": { "name": "notebook.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "coopTranslator": { "original_hash": "8533b3a2230311943339963fc7f04c21", "translation_date": "2025-09-02T08:25:17+00:00", "source_file": "2-Working-With-Data/08-data-preparation/notebook.ipynb", "language_code": "hk" } }, "nbformat": 4, "nbformat_minor": 0 }