From 943172fd55d7d9b08d8bee906086cf43402041af Mon Sep 17 00:00:00 2001 From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com> Date: Mon, 4 Oct 2021 21:50:02 +0530 Subject: [PATCH] Added DataFrame.describe() and elaborated on some of the existing explanations. --- .../08-data-preparation/notebook.ipynb | 372 ++++++++++++++++-- 1 file changed, 350 insertions(+), 22 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index b1c1d7a..c6ca05d 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -76,10 +76,10 @@ "cell_type": "code", "metadata": { "id": "LOe5jQohhulf", - "outputId": "9cf67a6a-5779-453b-b2ed-58f4f1aab507", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "968f9fb0-6cb7-4985-c64b-b332c086bdbf" }, "source": [ "iris_df.shape" @@ -123,15 +123,15 @@ "cell_type": "code", "metadata": { "id": "YPGh_ziji-CY", - "outputId": "ca186194-a126-4348-f58e-aab7ebc8f7b7", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "ffad1c9f-06b4-49d9-b409-5e4cc1b9f19b" }, "source": [ "iris_df.columns" ], - "execution_count": 4, + "execution_count": 3, "outputs": [ { "output_type": "execute_result", @@ -143,7 +143,7 @@ ] }, "metadata": {}, - "execution_count": 4 + "execution_count": 3 } ] }, @@ -163,7 +163,7 @@ }, "source": [ "### `DataFrame.info`\n", - "Let's take a look at this dataset to see what we have:" + "The amount of data(given by the `shape` attribute) and the name of the features or columns(given by the `columns` attribute) tell us something about the dataset. Now, we would want to dive deeper into the dataset. The `DataFrame.info()` function is quite useful for this. " ] }, { @@ -171,15 +171,15 @@ "metadata": { "trusted": false, "id": "dHHRyG0_gRrt", - "outputId": "ca9de335-9e65-486a-d1e2-3e73d060c701", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "325edd04-3809-4d71-b6c3-94c65b162882" }, "source": [ "iris_df.info()" ], - "execution_count": 3, + "execution_count": 4, "outputs": [ { "output_type": "stream", @@ -206,7 +206,150 @@ "id": "1XgVMpvigRru" }, "source": [ - "From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers." + "From here, we get to can make a few observations:\n", + "1. The DataType of each column: In this dataset, all of the data is stored as 64-bit floating-point numbers.\n", + "2. Number of Non-Null values: Dealing with null values is an important step in data preparation. It will be dealt with later in the notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IYlyxbpWFEF4" + }, + "source": [ + "### DataFrame.describe()\n", + "Say we have a lot of numerical data in our dataset. Univariate statistical calculations such as the mean, median, quartiles etc. can be done on each of the columns individually. The `DataFrame.describe()` function provides us with a statistical summary of the numerical columns of a dataset.\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tWV-CMstFIRA", + "outputId": "7c5cd72f-51d8-474c-966b-d2fbbdb7b7fc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 297 + } + }, + "source": [ + "iris_df.describe()" + ], + "execution_count": 8, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count150.000000150.000000150.000000150.000000
mean5.8433333.0573333.7580001.199333
std0.8280660.4358661.7652980.762238
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", + "count 150.000000 150.000000 150.000000 150.000000\n", + "mean 5.843333 3.057333 3.758000 1.199333\n", + "std 0.828066 0.435866 1.765298 0.762238\n", + "min 4.300000 2.000000 1.000000 0.100000\n", + "25% 5.100000 2.800000 1.600000 0.300000\n", + "50% 5.800000 3.000000 4.350000 1.300000\n", + "75% 6.400000 3.300000 5.100000 1.800000\n", + "max 7.900000 4.400000 6.900000 2.500000" + ] + }, + "metadata": {}, + "execution_count": 8 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zjjtW5hPGMuM" + }, + "source": [ + "The output above shows the total number of data points, mean, standard deviation, minimum, lower quartile(25%), median(50%), upper quartile(75%) and the maximum value of each column." ] }, { @@ -216,20 +359,117 @@ }, "source": [ "### `DataFrame.head`\n", - "Next, let's see what the first few rows of our `DataFrame` look like:" + "With all the above functions and attributes, we have got a top level view of the dataset. We know how many data points are there, how many features are there, the data type of each feature and the number of non-null values for each feature.\n", + "\n", + "Now its time to look at the data itself. Let's see what the first few rows(the first few datapoints) of our `DataFrame` look like:" ] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "DZMJZh0OgRrw" + "id": "DZMJZh0OgRrw", + "outputId": "c12ac408-abdb-48a5-ca3f-93b02f963b2f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } }, "source": [ "iris_df.head()" ], - "execution_count": null, - "outputs": [] + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", + "0 5.1 3.5 1.4 0.2\n", + "1 4.9 3.0 1.4 0.2\n", + "2 4.7 3.2 1.3 0.2\n", + "3 4.6 3.1 1.5 0.2\n", + "4 5.0 3.6 1.4 0.2" + ] + }, + "metadata": {}, + "execution_count": 5 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EBHEimZuEFQK" + }, + "source": [ + "As the output here, we can see five(5) entries of the dataset. If we look at the index at the left, we find out that these are the first five rows." + ] }, { "cell_type": "markdown", @@ -239,7 +479,7 @@ "source": [ "### Exercise:\n", "\n", - "By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?" + "From the example given above, it is clear that, by default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out a way to display more than five rows?" ] }, { @@ -252,7 +492,7 @@ "source": [ "# Hint: Consult the documentation by using iris_df.head?" ], - "execution_count": null, + "execution_count": 6, "outputs": [] }, { @@ -262,20 +502,106 @@ }, "source": [ "### `DataFrame.tail`\n", - "The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:" + "Another way of looking at the data can be from the end(instead of the beginning). The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:" ] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "heanjfGWgRr2" + "id": "heanjfGWgRr2", + "outputId": "2930cf87-bfeb-4ddc-8be1-53d0e57a06b3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } }, "source": [ "iris_df.tail()" ], - "execution_count": null, - "outputs": [] + "execution_count": 7, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
1456.73.05.22.3
1466.32.55.01.9
1476.53.05.22.0
1486.23.45.42.3
1495.93.05.11.8
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", + "145 6.7 3.0 5.2 2.3\n", + "146 6.3 2.5 5.0 1.9\n", + "147 6.5 3.0 5.2 2.0\n", + "148 6.2 3.4 5.4 2.3\n", + "149 5.9 3.0 5.1 1.8" + ] + }, + "metadata": {}, + "execution_count": 7 + } + ] }, { "cell_type": "markdown", @@ -283,7 +609,9 @@ "id": "31kBWfyLgRr3" }, "source": [ - "In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.\n", + "In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets. \n", + "\n", + "All the functions and attributes shown above with the help of code examples, help us get a look and feel of the data. \n", "\n", "> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with." ]