Added DataFrame.describe() and elaborated on some of the existing explanations.

3 years ago · 943172fd55
parent 503468f6ea
commit 943172fd55
1 changed files with 350 additions and 22 deletions
--- a/2-Working-With-Data/08-data-preparation/notebook.ipynb
+++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb
@ -76,10 +76,10 @@
      "cell_type": "code",
      "metadata": {
        "id": "LOe5jQohhulf",
-        "outputId": "9cf67a6a-5779-453b-b2ed-58f4f1aab507",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "outputId": "968f9fb0-6cb7-4985-c64b-b332c086bdbf"
      },
      "source": [
        "iris_df.shape"
@ -123,15 +123,15 @@
      "cell_type": "code",
      "metadata": {
        "id": "YPGh_ziji-CY",
-        "outputId": "ca186194-a126-4348-f58e-aab7ebc8f7b7",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "outputId": "ffad1c9f-06b4-49d9-b409-5e4cc1b9f19b"
      },
      "source": [
        "iris_df.columns"
      ],
-      "execution_count": 4,
+      "execution_count": 3,
      "outputs": [
        {
          "output_type": "execute_result",
@ -143,7 +143,7 @@
            ]
          },
          "metadata": {},
-          "execution_count": 4
+          "execution_count": 3
        }
      ]
    },
@ -163,7 +163,7 @@
      },
      "source": [
        "### `DataFrame.info`\n",
-        "Let's take a look at this dataset to see what we have:"
+        "The amount of data(given by the `shape` attribute) and the name of the features or columns(given by the `columns` attribute) tell us something about the dataset. Now, we would want to dive deeper into the dataset. The `DataFrame.info()` function is quite useful for this. "
      ]
    },
    {
@ -171,15 +171,15 @@
      "metadata": {
        "trusted": false,
        "id": "dHHRyG0_gRrt",
-        "outputId": "ca9de335-9e65-486a-d1e2-3e73d060c701",
        "colab": {
          "base_uri": "https://localhost:8080/"
-        }
+        },
+        "outputId": "325edd04-3809-4d71-b6c3-94c65b162882"
      },
      "source": [
        "iris_df.info()"
      ],
-      "execution_count": 3,
+      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
@ -206,7 +206,150 @@
        "id": "1XgVMpvigRru"
      },
      "source": [
-        "From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers."
+        "From here, we get to can make a few observations:\n",
+        "1. The DataType of each column: In this dataset, all of the data is stored as 64-bit floating-point numbers.\n",
+        "2. Number of Non-Null values: Dealing with null values is an important step in data preparation. It will be dealt with later in the notebook."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "IYlyxbpWFEF4"
+      },
+      "source": [
+        "### DataFrame.describe()\n",
+        "Say we have a lot of numerical data in our dataset. Univariate statistical calculations such as the mean, median, quartiles etc. can be done on each of the columns individually. The `DataFrame.describe()` function provides us with a statistical summary of the numerical columns of a dataset.\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "tWV-CMstFIRA",
+        "outputId": "7c5cd72f-51d8-474c-966b-d2fbbdb7b7fc",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 297
+        }
+      },
+      "source": [
+        "iris_df.describe()"
+      ],
+      "execution_count": 8,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>sepal length (cm)</th>\n",
+              "      <th>sepal width (cm)</th>\n",
+              "      <th>petal length (cm)</th>\n",
+              "      <th>petal width (cm)</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>150.000000</td>\n",
+              "      <td>150.000000</td>\n",
+              "      <td>150.000000</td>\n",
+              "      <td>150.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>5.843333</td>\n",
+              "      <td>3.057333</td>\n",
+              "      <td>3.758000</td>\n",
+              "      <td>1.199333</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>0.828066</td>\n",
+              "      <td>0.435866</td>\n",
+              "      <td>1.765298</td>\n",
+              "      <td>0.762238</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>4.300000</td>\n",
+              "      <td>2.000000</td>\n",
+              "      <td>1.000000</td>\n",
+              "      <td>0.100000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>5.100000</td>\n",
+              "      <td>2.800000</td>\n",
+              "      <td>1.600000</td>\n",
+              "      <td>0.300000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>5.800000</td>\n",
+              "      <td>3.000000</td>\n",
+              "      <td>4.350000</td>\n",
+              "      <td>1.300000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>6.400000</td>\n",
+              "      <td>3.300000</td>\n",
+              "      <td>5.100000</td>\n",
+              "      <td>1.800000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>7.900000</td>\n",
+              "      <td>4.400000</td>\n",
+              "      <td>6.900000</td>\n",
+              "      <td>2.500000</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "       sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
+              "count         150.000000        150.000000         150.000000        150.000000\n",
+              "mean            5.843333          3.057333           3.758000          1.199333\n",
+              "std             0.828066          0.435866           1.765298          0.762238\n",
+              "min             4.300000          2.000000           1.000000          0.100000\n",
+              "25%             5.100000          2.800000           1.600000          0.300000\n",
+              "50%             5.800000          3.000000           4.350000          1.300000\n",
+              "75%             6.400000          3.300000           5.100000          1.800000\n",
+              "max             7.900000          4.400000           6.900000          2.500000"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 8
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "zjjtW5hPGMuM"
+      },
+      "source": [
+        "The output above shows the total number of data points, mean, standard deviation, minimum, lower quartile(25%), median(50%), upper quartile(75%) and the maximum value of each column."
      ]
    },
    {
@ -216,20 +359,117 @@
      },
      "source": [
        "### `DataFrame.head`\n",
-        "Next, let's see what the first few rows of our `DataFrame` look like:"
+        "With all the above functions and attributes, we have got a top level view of the dataset. We know how many data points are there, how many features are there, the data type of each feature and the number of non-null values for each feature.\n",
+        "\n",
+        "Now its time to look at the data itself. Let's see what the first few rows(the first few datapoints) of our `DataFrame` look like:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "DZMJZh0OgRrw"
+        "id": "DZMJZh0OgRrw",
+        "outputId": "c12ac408-abdb-48a5-ca3f-93b02f963b2f",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 204
+        }
      },
      "source": [
        "iris_df.head()"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 5,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>sepal length (cm)</th>\n",
+              "      <th>sepal width (cm)</th>\n",
+              "      <th>petal length (cm)</th>\n",
+              "      <th>petal width (cm)</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>5.1</td>\n",
+              "      <td>3.5</td>\n",
+              "      <td>1.4</td>\n",
+              "      <td>0.2</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>4.9</td>\n",
+              "      <td>3.0</td>\n",
+              "      <td>1.4</td>\n",
+              "      <td>0.2</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>4.7</td>\n",
+              "      <td>3.2</td>\n",
+              "      <td>1.3</td>\n",
+              "      <td>0.2</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>4.6</td>\n",
+              "      <td>3.1</td>\n",
+              "      <td>1.5</td>\n",
+              "      <td>0.2</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>5.0</td>\n",
+              "      <td>3.6</td>\n",
+              "      <td>1.4</td>\n",
+              "      <td>0.2</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
+              "0                5.1               3.5                1.4               0.2\n",
+              "1                4.9               3.0                1.4               0.2\n",
+              "2                4.7               3.2                1.3               0.2\n",
+              "3                4.6               3.1                1.5               0.2\n",
+              "4                5.0               3.6                1.4               0.2"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 5
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "EBHEimZuEFQK"
+      },
+      "source": [
+        "As the output here, we can see five(5) entries of the dataset. If we look at the index at the left, we find out that these are the first five rows."
+      ]
    },
    {
      "cell_type": "markdown",
@ -239,7 +479,7 @@
      "source": [
        "### Exercise:\n",
        "\n",
-        "By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?"
+        "From the example given above, it is clear that, by default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out a way to display more than five rows?"
      ]
    },
    {
@ -252,7 +492,7 @@
      "source": [
        "# Hint: Consult the documentation by using iris_df.head?"
      ],
-      "execution_count": null,
+      "execution_count": 6,
      "outputs": []
    },
    {
@ -262,20 +502,106 @@
      },
      "source": [
        "### `DataFrame.tail`\n",
-        "The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:"
+        "Another way of looking at the data can be from the end(instead of the beginning). The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": false,
-        "id": "heanjfGWgRr2"
+        "id": "heanjfGWgRr2",
+        "outputId": "2930cf87-bfeb-4ddc-8be1-53d0e57a06b3",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 204
+        }
      },
      "source": [
        "iris_df.tail()"
      ],
-      "execution_count": null,
-      "outputs": []
+      "execution_count": 7,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>sepal length (cm)</th>\n",
+              "      <th>sepal width (cm)</th>\n",
+              "      <th>petal length (cm)</th>\n",
+              "      <th>petal width (cm)</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>145</th>\n",
+              "      <td>6.7</td>\n",
+              "      <td>3.0</td>\n",
+              "      <td>5.2</td>\n",
+              "      <td>2.3</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>146</th>\n",
+              "      <td>6.3</td>\n",
+              "      <td>2.5</td>\n",
+              "      <td>5.0</td>\n",
+              "      <td>1.9</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>147</th>\n",
+              "      <td>6.5</td>\n",
+              "      <td>3.0</td>\n",
+              "      <td>5.2</td>\n",
+              "      <td>2.0</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>148</th>\n",
+              "      <td>6.2</td>\n",
+              "      <td>3.4</td>\n",
+              "      <td>5.4</td>\n",
+              "      <td>2.3</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>149</th>\n",
+              "      <td>5.9</td>\n",
+              "      <td>3.0</td>\n",
+              "      <td>5.1</td>\n",
+              "      <td>1.8</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
+              "145                6.7               3.0                5.2               2.3\n",
+              "146                6.3               2.5                5.0               1.9\n",
+              "147                6.5               3.0                5.2               2.0\n",
+              "148                6.2               3.4                5.4               2.3\n",
+              "149                5.9               3.0                5.1               1.8"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ]
    },
    {
      "cell_type": "markdown",
@ -285,6 +611,8 @@
      "source": [
        "In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets. \n",
        "\n",
+        "All the functions and attributes shown above with the help of code examples, help us get a look and feel of the data. \n",
+        "\n",
        "> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with."
      ]
    },