From 943172fd55d7d9b08d8bee906086cf43402041af Mon Sep 17 00:00:00 2001
From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com>
Date: Mon, 4 Oct 2021 21:50:02 +0530
Subject: [PATCH] Added DataFrame.describe() and elaborated on some of the
existing explanations.
---
.../08-data-preparation/notebook.ipynb | 372 ++++++++++++++++--
1 file changed, 350 insertions(+), 22 deletions(-)
diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb
index b1c1d7a..c6ca05d 100644
--- a/2-Working-With-Data/08-data-preparation/notebook.ipynb
+++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb
@@ -76,10 +76,10 @@
"cell_type": "code",
"metadata": {
"id": "LOe5jQohhulf",
- "outputId": "9cf67a6a-5779-453b-b2ed-58f4f1aab507",
"colab": {
"base_uri": "https://localhost:8080/"
- }
+ },
+ "outputId": "968f9fb0-6cb7-4985-c64b-b332c086bdbf"
},
"source": [
"iris_df.shape"
@@ -123,15 +123,15 @@
"cell_type": "code",
"metadata": {
"id": "YPGh_ziji-CY",
- "outputId": "ca186194-a126-4348-f58e-aab7ebc8f7b7",
"colab": {
"base_uri": "https://localhost:8080/"
- }
+ },
+ "outputId": "ffad1c9f-06b4-49d9-b409-5e4cc1b9f19b"
},
"source": [
"iris_df.columns"
],
- "execution_count": 4,
+ "execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
@@ -143,7 +143,7 @@
]
},
"metadata": {},
- "execution_count": 4
+ "execution_count": 3
}
]
},
@@ -163,7 +163,7 @@
},
"source": [
"### `DataFrame.info`\n",
- "Let's take a look at this dataset to see what we have:"
+ "The amount of data(given by the `shape` attribute) and the name of the features or columns(given by the `columns` attribute) tell us something about the dataset. Now, we would want to dive deeper into the dataset. The `DataFrame.info()` function is quite useful for this. "
]
},
{
@@ -171,15 +171,15 @@
"metadata": {
"trusted": false,
"id": "dHHRyG0_gRrt",
- "outputId": "ca9de335-9e65-486a-d1e2-3e73d060c701",
"colab": {
"base_uri": "https://localhost:8080/"
- }
+ },
+ "outputId": "325edd04-3809-4d71-b6c3-94c65b162882"
},
"source": [
"iris_df.info()"
],
- "execution_count": 3,
+ "execution_count": 4,
"outputs": [
{
"output_type": "stream",
@@ -206,7 +206,150 @@
"id": "1XgVMpvigRru"
},
"source": [
- "From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers."
+ "From here, we get to can make a few observations:\n",
+ "1. The DataType of each column: In this dataset, all of the data is stored as 64-bit floating-point numbers.\n",
+ "2. Number of Non-Null values: Dealing with null values is an important step in data preparation. It will be dealt with later in the notebook."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IYlyxbpWFEF4"
+ },
+ "source": [
+ "### DataFrame.describe()\n",
+ "Say we have a lot of numerical data in our dataset. Univariate statistical calculations such as the mean, median, quartiles etc. can be done on each of the columns individually. The `DataFrame.describe()` function provides us with a statistical summary of the numerical columns of a dataset.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tWV-CMstFIRA",
+ "outputId": "7c5cd72f-51d8-474c-966b-d2fbbdb7b7fc",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 297
+ }
+ },
+ "source": [
+ "iris_df.describe()"
+ ],
+ "execution_count": 8,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal length (cm) | \n",
+ " sepal width (cm) | \n",
+ " petal length (cm) | \n",
+ " petal width (cm) | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " count | \n",
+ " 150.000000 | \n",
+ " 150.000000 | \n",
+ " 150.000000 | \n",
+ " 150.000000 | \n",
+ "
\n",
+ " \n",
+ " mean | \n",
+ " 5.843333 | \n",
+ " 3.057333 | \n",
+ " 3.758000 | \n",
+ " 1.199333 | \n",
+ "
\n",
+ " \n",
+ " std | \n",
+ " 0.828066 | \n",
+ " 0.435866 | \n",
+ " 1.765298 | \n",
+ " 0.762238 | \n",
+ "
\n",
+ " \n",
+ " min | \n",
+ " 4.300000 | \n",
+ " 2.000000 | \n",
+ " 1.000000 | \n",
+ " 0.100000 | \n",
+ "
\n",
+ " \n",
+ " 25% | \n",
+ " 5.100000 | \n",
+ " 2.800000 | \n",
+ " 1.600000 | \n",
+ " 0.300000 | \n",
+ "
\n",
+ " \n",
+ " 50% | \n",
+ " 5.800000 | \n",
+ " 3.000000 | \n",
+ " 4.350000 | \n",
+ " 1.300000 | \n",
+ "
\n",
+ " \n",
+ " 75% | \n",
+ " 6.400000 | \n",
+ " 3.300000 | \n",
+ " 5.100000 | \n",
+ " 1.800000 | \n",
+ "
\n",
+ " \n",
+ " max | \n",
+ " 7.900000 | \n",
+ " 4.400000 | \n",
+ " 6.900000 | \n",
+ " 2.500000 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
+ "count 150.000000 150.000000 150.000000 150.000000\n",
+ "mean 5.843333 3.057333 3.758000 1.199333\n",
+ "std 0.828066 0.435866 1.765298 0.762238\n",
+ "min 4.300000 2.000000 1.000000 0.100000\n",
+ "25% 5.100000 2.800000 1.600000 0.300000\n",
+ "50% 5.800000 3.000000 4.350000 1.300000\n",
+ "75% 6.400000 3.300000 5.100000 1.800000\n",
+ "max 7.900000 4.400000 6.900000 2.500000"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 8
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zjjtW5hPGMuM"
+ },
+ "source": [
+ "The output above shows the total number of data points, mean, standard deviation, minimum, lower quartile(25%), median(50%), upper quartile(75%) and the maximum value of each column."
]
},
{
@@ -216,20 +359,117 @@
},
"source": [
"### `DataFrame.head`\n",
- "Next, let's see what the first few rows of our `DataFrame` look like:"
+ "With all the above functions and attributes, we have got a top level view of the dataset. We know how many data points are there, how many features are there, the data type of each feature and the number of non-null values for each feature.\n",
+ "\n",
+ "Now its time to look at the data itself. Let's see what the first few rows(the first few datapoints) of our `DataFrame` look like:"
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false,
- "id": "DZMJZh0OgRrw"
+ "id": "DZMJZh0OgRrw",
+ "outputId": "c12ac408-abdb-48a5-ca3f-93b02f963b2f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 204
+ }
},
"source": [
"iris_df.head()"
],
- "execution_count": null,
- "outputs": []
+ "execution_count": 5,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal length (cm) | \n",
+ " sepal width (cm) | \n",
+ " petal length (cm) | \n",
+ " petal width (cm) | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.1 | \n",
+ " 3.5 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 4.9 | \n",
+ " 3.0 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 4.7 | \n",
+ " 3.2 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 4.6 | \n",
+ " 3.1 | \n",
+ " 1.5 | \n",
+ " 0.2 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 5.0 | \n",
+ " 3.6 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
+ "0 5.1 3.5 1.4 0.2\n",
+ "1 4.9 3.0 1.4 0.2\n",
+ "2 4.7 3.2 1.3 0.2\n",
+ "3 4.6 3.1 1.5 0.2\n",
+ "4 5.0 3.6 1.4 0.2"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 5
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EBHEimZuEFQK"
+ },
+ "source": [
+ "As the output here, we can see five(5) entries of the dataset. If we look at the index at the left, we find out that these are the first five rows."
+ ]
},
{
"cell_type": "markdown",
@@ -239,7 +479,7 @@
"source": [
"### Exercise:\n",
"\n",
- "By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?"
+ "From the example given above, it is clear that, by default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out a way to display more than five rows?"
]
},
{
@@ -252,7 +492,7 @@
"source": [
"# Hint: Consult the documentation by using iris_df.head?"
],
- "execution_count": null,
+ "execution_count": 6,
"outputs": []
},
{
@@ -262,20 +502,106 @@
},
"source": [
"### `DataFrame.tail`\n",
- "The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:"
+ "Another way of looking at the data can be from the end(instead of the beginning). The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:"
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false,
- "id": "heanjfGWgRr2"
+ "id": "heanjfGWgRr2",
+ "outputId": "2930cf87-bfeb-4ddc-8be1-53d0e57a06b3",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 204
+ }
},
"source": [
"iris_df.tail()"
],
- "execution_count": null,
- "outputs": []
+ "execution_count": 7,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal length (cm) | \n",
+ " sepal width (cm) | \n",
+ " petal length (cm) | \n",
+ " petal width (cm) | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 145 | \n",
+ " 6.7 | \n",
+ " 3.0 | \n",
+ " 5.2 | \n",
+ " 2.3 | \n",
+ "
\n",
+ " \n",
+ " 146 | \n",
+ " 6.3 | \n",
+ " 2.5 | \n",
+ " 5.0 | \n",
+ " 1.9 | \n",
+ "
\n",
+ " \n",
+ " 147 | \n",
+ " 6.5 | \n",
+ " 3.0 | \n",
+ " 5.2 | \n",
+ " 2.0 | \n",
+ "
\n",
+ " \n",
+ " 148 | \n",
+ " 6.2 | \n",
+ " 3.4 | \n",
+ " 5.4 | \n",
+ " 2.3 | \n",
+ "
\n",
+ " \n",
+ " 149 | \n",
+ " 5.9 | \n",
+ " 3.0 | \n",
+ " 5.1 | \n",
+ " 1.8 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
+ "145 6.7 3.0 5.2 2.3\n",
+ "146 6.3 2.5 5.0 1.9\n",
+ "147 6.5 3.0 5.2 2.0\n",
+ "148 6.2 3.4 5.4 2.3\n",
+ "149 5.9 3.0 5.1 1.8"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 7
+ }
+ ]
},
{
"cell_type": "markdown",
@@ -283,7 +609,9 @@
"id": "31kBWfyLgRr3"
},
"source": [
- "In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.\n",
+ "In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets. \n",
+ "\n",
+ "All the functions and attributes shown above with the help of code examples, help us get a look and feel of the data. \n",
"\n",
"> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with."
]