fillna for Categorical columns added

3 years ago · 87ef4f0875
parent d58a5f67e2
commit 87ef4f0875
1 changed files with 294 additions and 0 deletions
--- a/2-Working-With-Data/08-data-preparation/notebook.ipynb
+++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb
@ -1614,6 +1614,300 @@
        "You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice."
      ]
    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "CE8S7louLezV"
+      },
+      "source": [
+        "First let us consider non-numeric data. In datasets, we have columns with categorical data. Eg. Gender, True or False etc.\n",
+        "\n",
+        "In most of these cases, we replace missing values with the `mode` of the column. Say, we have 100 data points and 90 have said True, 8 have said False and 2 have not filled. Then, we can will the 2 with True, considering the full column. \n",
+        "\n",
+        "Again, here we can use domain knowledge here. Let us consider an example of filling with the mode."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "MY5faq4yLdpQ",
+        "outputId": "c3838b07-0d15-471e-8dad-370de91d4bdc",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 204
+        }
+      },
+      "source": [
+        "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n",
+        "                               [3,4,None],\n",
+        "                               [5,6,\"False\"],\n",
+        "                               [7,8,\"True\"],\n",
+        "                               [9,10,\"True\"]])\n",
+        "\n",
+        "fill_with_mode"
+      ],
+      "execution_count": 28,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>0</th>\n",
+              "      <th>1</th>\n",
+              "      <th>2</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>1</td>\n",
+              "      <td>2</td>\n",
+              "      <td>True</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>3</td>\n",
+              "      <td>4</td>\n",
+              "      <td>None</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>5</td>\n",
+              "      <td>6</td>\n",
+              "      <td>False</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>7</td>\n",
+              "      <td>8</td>\n",
+              "      <td>True</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>9</td>\n",
+              "      <td>10</td>\n",
+              "      <td>True</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "   0   1      2\n",
+              "0  1   2   True\n",
+              "1  3   4   None\n",
+              "2  5   6  False\n",
+              "3  7   8   True\n",
+              "4  9  10   True"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 28
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MLAoMQOfNPlA"
+      },
+      "source": [
+        "Now, lets first find the mode before filling the `None` value with the mode."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "WKy-9Y2tN5jv",
+        "outputId": "41f5064e-502d-4aec-dc2d-86f885068b4f",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "source": [
+        "fill_with_mode[2].value_counts()"
+      ],
+      "execution_count": 29,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "True     3\n",
+              "False    1\n",
+              "Name: 2, dtype: int64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 29
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "6iNz_zG_OKrx"
+      },
+      "source": [
+        "So, we will replace None with True"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "TxPKteRvNPOs"
+      },
+      "source": [
+        "fill_with_mode[2].fillna('True',inplace=True)"
+      ],
+      "execution_count": 30,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "tvas7c9_OPWE",
+        "outputId": "7282c4f7-0e59-4398-b4f2-5919baf61164",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 204
+        }
+      },
+      "source": [
+        "fill_with_mode"
+      ],
+      "execution_count": 31,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>0</th>\n",
+              "      <th>1</th>\n",
+              "      <th>2</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>1</td>\n",
+              "      <td>2</td>\n",
+              "      <td>True</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>3</td>\n",
+              "      <td>4</td>\n",
+              "      <td>True</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>5</td>\n",
+              "      <td>6</td>\n",
+              "      <td>False</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>7</td>\n",
+              "      <td>8</td>\n",
+              "      <td>True</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>9</td>\n",
+              "      <td>10</td>\n",
+              "      <td>True</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "   0   1      2\n",
+              "0  1   2   True\n",
+              "1  3   4   True\n",
+              "2  5   6  False\n",
+              "3  7   8   True\n",
+              "4  9  10   True"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 31
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "SktitLxxOR16"
+      },
+      "source": [
+        "As we can see, the null value has been replaced. Needless to say, we could have written anything in place or `'True'` and it would have got substituted."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "heYe1I0dOmQ_"
+      },
+      "source": [
+        "Now, coming to numeric data. Here, we have a two common ways of replacing missing values:\n",
+        "\n",
+        "1. Replace with Median of the row\n",
+        "2. Replace with Mean of the row \n",
+        "\n",
+        "We replace with Median, in case of skewed data with outliers. This is beacuse median is robust to outliers.\n",
+        "\n",
+        "When the data is normalized, we can use mean, as in that case, mean and median would be pretty close."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "09HM_2feOj5Y"
+      },
+      "source": [
+        ""
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
    {
      "cell_type": "code",
      "metadata": {