Added Missing data , detecting null values codes and enhanced explanations

3 years ago · fb02303073
parent 0827651f78
commit fb02303073
1 changed files with 220 additions and 47 deletions
--- a/2-Working-With-Data/08-data-preparation/notebook.ipynb
+++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb
@ -79,7 +79,7 @@
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
-        "outputId": "968f9fb0-6cb7-4985-c64b-b332c086bdbf"
+        "outputId": "4641a412-8abb-4e2f-d1ec-ff9b5004e361"
      },
      "source": [
        "iris_df.shape"
@ -126,7 +126,7 @@
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
-        "outputId": "ffad1c9f-06b4-49d9-b409-5e4cc1b9f19b"
+        "outputId": "0f9c41ea-d480-4245-d7e2-56d514ac7724"
      },
      "source": [
        "iris_df.columns"
@ -174,7 +174,7 @@
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
-        "outputId": "325edd04-3809-4d71-b6c3-94c65b162882"
+        "outputId": "94d5e48a-746c-4e58-b08f-c63b377a61b1"
      },
      "source": [
        "iris_df.info()"
@ -226,16 +226,16 @@
      "cell_type": "code",
      "metadata": {
        "id": "tWV-CMstFIRA",
-        "outputId": "7c5cd72f-51d8-474c-966b-d2fbbdb7b7fc",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 297
-        }
+        },
+        "outputId": "b01322a1-4296-4ad0-f990-6e0dcba668f6"
      },
      "source": [
        "iris_df.describe()"
      ],
-      "execution_count": 8,
+      "execution_count": 5,
      "outputs": [
        {
          "output_type": "execute_result",
@ -339,7 +339,7 @@
            ]
          },
          "metadata": {},
-          "execution_count": 8
+          "execution_count": 5
        }
      ]
    },
@ -369,16 +369,16 @@
      "metadata": {
        "trusted": false,
        "id": "DZMJZh0OgRrw",
-        "outputId": "c12ac408-abdb-48a5-ca3f-93b02f963b2f",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
-        }
+        },
+        "outputId": "14b1e3cd-54ac-47dc-f7b2-231d51d93741"
      },
      "source": [
        "iris_df.head()"
      ],
-      "execution_count": 5,
+      "execution_count": 6,
      "outputs": [
        {
          "output_type": "execute_result",
@ -458,7 +458,7 @@
            ]
          },
          "metadata": {},
-          "execution_count": 5
+          "execution_count": 6
        }
      ]
    },
@ -492,7 +492,7 @@
      "source": [
        "# Hint: Consult the documentation by using iris_df.head?"
      ],
-      "execution_count": 6,
+      "execution_count": 7,
      "outputs": []
    },
    {
@ -510,16 +510,16 @@
      "metadata": {
        "trusted": false,
        "id": "heanjfGWgRr2",
-        "outputId": "2930cf87-bfeb-4ddc-8be1-53d0e57a06b3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
-        }
+        },
+        "outputId": "d4e22b38-ba5d-4dd1-bbd2-b9cd9ad7b150"
      },
      "source": [
        "iris_df.tail()"
      ],
-      "execution_count": 7,
+      "execution_count": 8,
      "outputs": [
        {
          "output_type": "execute_result",
@ -599,7 +599,7 @@
            ]
          },
          "metadata": {},
-          "execution_count": 7
+          "execution_count": 8
        }
      ]
    },
@ -619,14 +619,18 @@
    {
      "cell_type": "markdown",
      "metadata": {
-        "id": "BvnoojWsgRr4"
+        "id": "TvurZyLSDxq_"
      },
      "source": [
-        "## Dealing with missing data\n",
+        "### Missing Data\n",
+        "Let us dive into missing data. Missing data occurs, when no value is sotred in some of the columns. \n",
+        "\n",
+        "Let us take an example: say someone is concious about his/her weight and doesn't fill the weight field in a survey. Then, the weight value for that certain person will be missing. \n",
+        "\n",
+        "Most of the time, in real world datasets, missing values occur.\n",
        "\n",
-        "> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.\n",
+        "**How Pandas Handles missing data**\n",
        "\n",
-        "Most of the time the datasets you want to use (of have to use) have missing values in them. How missing data is handled carries with it subtle tradeoffs that can affect your final analysis and real-world outcomes.\n",
        "\n",
        "Pandas handles missing values in two ways. The first you've seen before in previous sections: `NaN`, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values.\n",
        "\n",
@ -649,7 +653,11 @@
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "QIoNdY4ngRr7"
+        "id": "QIoNdY4ngRr7",
+        "outputId": "e2ea93a4-b967-4319-904b-85479c36b169",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
      },
      "source": [
        "import numpy as np\n",
@ -657,8 +665,19 @@
        "example1 = np.array([2, None, 6, 8])\n",
        "example1"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "array([2, None, 6, 8], dtype=object)"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -675,13 +694,31 @@
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "gWbx-KB9gRr8"
+        "id": "gWbx-KB9gRr8",
+        "outputId": "ff2a899b-5419-4a5c-b054-bc1e6ab906c5",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 292
+        }
      },
      "source": [
        "example1.sum()"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 10,
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "TypeError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
+            "\u001b[0;32m<ipython-input-10-ce9901ad18bd>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m     45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m     46\u001b[0m          initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n",
+            "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'"
+          ]
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -707,25 +744,55 @@
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "rcFYfMG9gRr9"
+        "id": "rcFYfMG9gRr9",
+        "outputId": "a452b675-2131-47a7-ff38-2b4d6e923d50",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
      },
      "source": [
        "np.nan + 1"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 11,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "nan"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "BW3zQD2-gRr-"
+        "id": "BW3zQD2-gRr-",
+        "outputId": "6956b57f-8ae7-4880-cc1d-0cf54edfe6ee",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
      },
      "source": [
        "np.nan * 0"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 12,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "nan"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 12
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -740,14 +807,29 @@
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "LCInVgSSgRr_"
+        "id": "LCInVgSSgRr_",
+        "outputId": "57ad3201-3958-48c6-924b-d46b61d4aeba",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
      },
      "source": [
        "example2 = np.array([2, np.nan, 6, 8]) \n",
        "example2.sum(), example2.min(), example2.max()"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 13,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "(nan, nan, nan)"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 13
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -768,7 +850,7 @@
      "source": [
        "# What happens if you add np.nan and None together?\n"
      ],
-      "execution_count": null,
+      "execution_count": 14,
      "outputs": []
    },
    {
@ -795,14 +877,32 @@
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "Nji-KGdNgRsA"
+        "id": "Nji-KGdNgRsA",
+        "outputId": "8dbdf129-cd8b-40b5-96ba-21a7f3fa0044",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
      },
      "source": [
        "int_series = pd.Series([1, 2, 3], dtype=int)\n",
        "int_series"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 15,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "0    1\n",
+              "1    2\n",
+              "2    3\n",
+              "dtype: int64"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 15
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -825,7 +925,7 @@
        "# How does that element show up in the Series?\n",
        "# What is the dtype of the Series?\n"
      ],
-      "execution_count": null,
+      "execution_count": 16,
      "outputs": []
    },
    {
@ -851,6 +951,8 @@
      },
      "source": [
        "### Detecting null values\n",
+        "\n",
+        "Now that we have understood the importance of missing values, we need to detect them in our dataset, before dealing with them.\n",
        "Both `isnull()` and `notnull()` are your primary methods for detecting null data. Both return Boolean masks over your data."
      ]
    },
@ -864,20 +966,39 @@
      "source": [
        "example3 = pd.Series([0, np.nan, '', None])"
      ],
-      "execution_count": null,
+      "execution_count": 17,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "1XdaJJ7PgRsC"
+        "id": "1XdaJJ7PgRsC",
+        "outputId": "1fd6c6af-19e0-4568-e837-985d571604f4",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
      },
      "source": [
        "example3.isnull()"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 18,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "0    False\n",
+              "1     True\n",
+              "2    False\n",
+              "3     True\n",
+              "dtype: bool"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 18
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -887,7 +1008,35 @@
      "source": [
        "Look closely at the output. Does any of it surprise you? While `0` is an arithmetic null, it's nevertheless a perfectly good integer and pandas treats it as such. `''` is a little more subtle. While we used it in Section 1 to represent an empty string value, it is nevertheless a string object and not a representation of null as far as pandas is concerned.\n",
        "\n",
-        "Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks  directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values."
+        "Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks  directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values.\n",
+        "\n",
+        "If we want the total number of missing values, we can just do a sum over the mask produced by the `isnull()` method."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "JCcQVoPkHDUv",
+        "outputId": "c0002689-f529-4e3e-c73b-41ac513c59d3",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "source": [
+        "example3.isnull().sum()"
+      ],
+      "execution_count": 19,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "2"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 19
+        }
      ]
    },
    {
@ -910,7 +1059,7 @@
        "# Try running example3[example3.notnull()].\n",
        "# Before you do so, what do you expect to see?\n"
      ],
-      "execution_count": null,
+      "execution_count": 20,
      "outputs": []
    },
    {
@ -919,7 +1068,31 @@
        "id": "D_jWN7mHgRsD"
      },
      "source": [
-        "**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data."
+        "**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in DataFrames: they show the results and the index of those results, which will help you enormously as you wrestle with your data."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "BvnoojWsgRr4"
+      },
+      "source": [
+        "### Dealing with missing data\n",
+        "\n",
+        "> **Learning goal:** By the end of this subsection, you should know how and when to replace or remove null values from DataFrames.\n",
+        "\n",
+        "Machine Learning models can't deal with missing data themselves. So, before passing the data into the model, we need to deal with these missing values.\n",
+        "\n",
+        "How missing data is handled carries with it subtle tradeoffs, can affect your final analysis and real-world outcomes.\n",
+        "\n",
+        "There are primarily two ways of dealing with missing data:\n",
+        "\n",
+        "\n",
+        "1.   Drop the row containing the missing value\n",
+        "2.   Replace the missing value with some other value\n",
+        "\n",
+        "We will discuss both these methods and their pros and cons in details.\n",
+        "\n"
      ]
    },
    {