Enhanced dropna

4 years ago · d58a5f67e2
parent fb02303073
commit d58a5f67e2
1 changed files with 377 additions and 40 deletions
--- a/2-Working-With-Data/08-data-preparation/notebook.ipynb
+++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb
@ -654,10 +654,10 @@
      "metadata": {
        "trusted": false,
        "id": "QIoNdY4ngRr7",
-        "outputId": "e2ea93a4-b967-4319-904b-85479c36b169",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "outputId": "e2ea93a4-b967-4319-904b-85479c36b169"
      },
      "source": [
        "import numpy as np\n",
@ -695,11 +695,11 @@
      "metadata": {
        "trusted": false,
        "id": "gWbx-KB9gRr8",
-        "outputId": "ff2a899b-5419-4a5c-b054-bc1e6ab906c5",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 292
-        }
+        },
+        "outputId": "ff2a899b-5419-4a5c-b054-bc1e6ab906c5"
      },
      "source": [
        "example1.sum()"
@ -745,10 +745,10 @@
      "metadata": {
        "trusted": false,
        "id": "rcFYfMG9gRr9",
-        "outputId": "a452b675-2131-47a7-ff38-2b4d6e923d50",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "outputId": "a452b675-2131-47a7-ff38-2b4d6e923d50"
      },
      "source": [
        "np.nan + 1"
@ -772,10 +772,10 @@
      "metadata": {
        "trusted": false,
        "id": "BW3zQD2-gRr-",
-        "outputId": "6956b57f-8ae7-4880-cc1d-0cf54edfe6ee",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "outputId": "6956b57f-8ae7-4880-cc1d-0cf54edfe6ee"
      },
      "source": [
        "np.nan * 0"
@ -808,10 +808,10 @@
      "metadata": {
        "trusted": false,
        "id": "LCInVgSSgRr_",
-        "outputId": "57ad3201-3958-48c6-924b-d46b61d4aeba",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "outputId": "57ad3201-3958-48c6-924b-d46b61d4aeba"
      },
      "source": [
        "example2 = np.array([2, np.nan, 6, 8]) \n",
@ -878,10 +878,10 @@
      "metadata": {
        "trusted": false,
        "id": "Nji-KGdNgRsA",
-        "outputId": "8dbdf129-cd8b-40b5-96ba-21a7f3fa0044",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "outputId": "8dbdf129-cd8b-40b5-96ba-21a7f3fa0044"
      },
      "source": [
        "int_series = pd.Series([1, 2, 3], dtype=int)\n",
@ -974,10 +974,10 @@
      "metadata": {
        "trusted": false,
        "id": "1XdaJJ7PgRsC",
-        "outputId": "1fd6c6af-19e0-4568-e837-985d571604f4",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "outputId": "1fd6c6af-19e0-4568-e837-985d571604f4"
      },
      "source": [
        "example3.isnull()"
@ -1016,11 +1016,11 @@
    {
      "cell_type": "code",
      "metadata": {
-        "id": "JCcQVoPkHDUv",
-        "outputId": "c0002689-f529-4e3e-c73b-41ac513c59d3",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "id": "JCcQVoPkHDUv",
+        "outputId": "c0002689-f529-4e3e-c73b-41ac513c59d3"
      },
      "source": [
        "example3.isnull().sum()"
@ -1103,21 +1103,42 @@
      "source": [
        "### Dropping null values\n",
        "\n",
-        "Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to `example3`:"
+        "The amount of data we pass on to our model has a direct effect on its performance. Dropping null values means that we are reducing the number of datapoints, and hence reducing the size of the dataset. So, it is advisable to drop rows with null values when the dataset is quite large.\n",
+        "\n",
+        "Another instance maybe that a certain row or column has a lot of missing values. Then, they maybe dropped because they wouldn't add much value to our analysis as most of the data is missing for that row/column.\n",
+        "\n",
+        "Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. To see this in action, let's return to `example3`. The `DataFrame.dropna()` function helps in dropping the rows with null values. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "7uIvS097gRsD"
+        "id": "7uIvS097gRsD",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "3d2d43e7-99ca-45ca-adc4-cef2c737e5bf"
      },
      "source": [
        "example3 = example3.dropna()\n",
        "example3"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 21,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "0    0\n",
+              "2     \n",
+              "dtype: object"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 21
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -1127,14 +1148,19 @@
      "source": [
        "Note that this should look like your output from `example3[example3.notnull()]`. The difference here is that, rather than just indexing on the masked values, `dropna` has removed those missing values from the `Series` `example3`.\n",
        "\n",
-        "Because `DataFrame`s have two dimensions, they afford more options for dropping data."
+        "Because DataFrames have two dimensions, they afford more options for dropping data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "an-l74sPgRsE"
+        "id": "an-l74sPgRsE",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 142
+        },
+        "outputId": "961427aa-9bce-445b-d230-61d02bc16c92"
      },
      "source": [
        "example4 = pd.DataFrame([[1,      np.nan, 7], \n",
@ -1142,8 +1168,69 @@
        "                         [np.nan, 6,      9]])\n",
        "example4"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 22,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>0</th>\n",
+              "      <th>1</th>\n",
+              "      <th>2</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>1.0</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>7</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2.0</td>\n",
+              "      <td>5.0</td>\n",
+              "      <td>8</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>NaN</td>\n",
+              "      <td>6.0</td>\n",
+              "      <td>9</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "     0    1  2\n",
+              "0  1.0  NaN  7\n",
+              "1  2.0  5.0  8\n",
+              "2  NaN  6.0  9"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 22
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -1160,13 +1247,65 @@
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "jAVU24RXgRsE"
+        "id": "jAVU24RXgRsE",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 80
+        },
+        "outputId": "aaeac6bc-ca6f-4eda-de0c-119e0c50ba83"
      },
      "source": [
        "example4.dropna()"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 23,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>0</th>\n",
+              "      <th>1</th>\n",
+              "      <th>2</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2.0</td>\n",
+              "      <td>5.0</td>\n",
+              "      <td>8</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "     0    1  2\n",
+              "1  2.0  5.0  8"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 23
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -1181,13 +1320,71 @@
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "GrBhxu9GgRsE"
+        "id": "GrBhxu9GgRsE",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 142
+        },
+        "outputId": "89fee273-d71b-4400-9484-b4bf93b69ee5"
      },
      "source": [
        "example4.dropna(axis='columns')"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 24,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>2</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>7</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>8</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>9</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "   2\n",
+              "0  7\n",
+              "1  8\n",
+              "2  9"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 24
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -1197,21 +1394,104 @@
      "source": [
        "Notice that this can drop a lot of data that you might want to keep, particularly in smaller datasets. What if you just want to drop rows or columns that contain several or even just all null values? You specify those setting in `dropna` with the `how` and `thresh` parameters.\n",
        "\n",
-        "By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action."
+        "By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action in the next exercise."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "Bcf_JWTsgRsF"
+        "id": "Bcf_JWTsgRsF",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 142
+        },
+        "outputId": "07e8f4eb-18c8-4e5d-9317-6a9a3db38b73"
      },
      "source": [
        "example4[3] = np.nan\n",
        "example4"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 25,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>0</th>\n",
+              "      <th>1</th>\n",
+              "      <th>2</th>\n",
+              "      <th>3</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>1.0</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>7</td>\n",
+              "      <td>NaN</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2.0</td>\n",
+              "      <td>5.0</td>\n",
+              "      <td>8</td>\n",
+              "      <td>NaN</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>NaN</td>\n",
+              "      <td>6.0</td>\n",
+              "      <td>9</td>\n",
+              "      <td>NaN</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "     0    1  2   3\n",
+              "0  1.0  NaN  7 NaN\n",
+              "1  2.0  5.0  8 NaN\n",
+              "2  NaN  6.0  9 NaN"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 25
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pNZer7q9JPNC"
+      },
+      "source": [
+        "> Key takeaways: \n",
+        "1. Dropping null values is a good idea only if the dataset is large enough.\n",
+        "2. Full rows or columns can be dropped if they have most of their data missing.\n",
+        "3. The `DataFrame.dropna(axis=)` method helps in dropping null values. The `axis` argument signifies whether rows are to be dropped or columns. \n",
+        "4. The `how` argument can also be used. By default it is set to `any`. So, it drops only those rows/columns which contain any null values. It can be set to `all` to specify that we will drop only those rows/columns where all values are null."
+      ]
    },
    {
      "cell_type": "markdown",
@ -1233,7 +1513,7 @@
        "# How might you go about dropping just column 3?\n",
        "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n"
      ],
-      "execution_count": null,
+      "execution_count": 26,
      "outputs": []
    },
    {
@ -1249,13 +1529,67 @@
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "M9dCNMaagRsG"
+        "id": "M9dCNMaagRsG",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 80
+        },
+        "outputId": "b2c00415-95a6-4a5c-e3f9-781ff5cc8625"
      },
      "source": [
        "example4.dropna(axis='rows', thresh=3)"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 27,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>0</th>\n",
+              "      <th>1</th>\n",
+              "      <th>2</th>\n",
+              "      <th>3</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2.0</td>\n",
+              "      <td>5.0</td>\n",
+              "      <td>8</td>\n",
+              "      <td>NaN</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "     0    1  2   3\n",
+              "1  2.0  5.0  8 NaN"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 27
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -1274,7 +1608,10 @@
      "source": [
        "### Filling null values\n",
        "\n",
-        "Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice."
+        "It sometimes makes sense to fill in missing values with ones which could be valid. There are a few techniques to fill null values. The first is using Domain Knowledge(knowledge of the subject on which the dataset is based) to somehow approximate the missing values. \n",
+        "\n",
+        "\n",
+        "You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice."
      ]
    },
    {