Added DataFrame.describe() and elaborated on some of the existing explanations.

pull/166/head
Nirmalya Misra 3 years ago
parent 503468f6ea
commit 943172fd55

@ -76,10 +76,10 @@
"cell_type": "code",
"metadata": {
"id": "LOe5jQohhulf",
"outputId": "9cf67a6a-5779-453b-b2ed-58f4f1aab507",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputId": "968f9fb0-6cb7-4985-c64b-b332c086bdbf"
},
"source": [
"iris_df.shape"
@ -123,15 +123,15 @@
"cell_type": "code",
"metadata": {
"id": "YPGh_ziji-CY",
"outputId": "ca186194-a126-4348-f58e-aab7ebc8f7b7",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputId": "ffad1c9f-06b4-49d9-b409-5e4cc1b9f19b"
},
"source": [
"iris_df.columns"
],
"execution_count": 4,
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
@ -143,7 +143,7 @@
]
},
"metadata": {},
"execution_count": 4
"execution_count": 3
}
]
},
@ -163,7 +163,7 @@
},
"source": [
"### `DataFrame.info`\n",
"Let's take a look at this dataset to see what we have:"
"The amount of data(given by the `shape` attribute) and the name of the features or columns(given by the `columns` attribute) tell us something about the dataset. Now, we would want to dive deeper into the dataset. The `DataFrame.info()` function is quite useful for this. "
]
},
{
@ -171,15 +171,15 @@
"metadata": {
"trusted": false,
"id": "dHHRyG0_gRrt",
"outputId": "ca9de335-9e65-486a-d1e2-3e73d060c701",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputId": "325edd04-3809-4d71-b6c3-94c65b162882"
},
"source": [
"iris_df.info()"
],
"execution_count": 3,
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
@ -206,7 +206,150 @@
"id": "1XgVMpvigRru"
},
"source": [
"From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers."
"From here, we get to can make a few observations:\n",
"1. The DataType of each column: In this dataset, all of the data is stored as 64-bit floating-point numbers.\n",
"2. Number of Non-Null values: Dealing with null values is an important step in data preparation. It will be dealt with later in the notebook."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IYlyxbpWFEF4"
},
"source": [
"### DataFrame.describe()\n",
"Say we have a lot of numerical data in our dataset. Univariate statistical calculations such as the mean, median, quartiles etc. can be done on each of the columns individually. The `DataFrame.describe()` function provides us with a statistical summary of the numerical columns of a dataset.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "tWV-CMstFIRA",
"outputId": "7c5cd72f-51d8-474c-966b-d2fbbdb7b7fc",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 297
}
},
"source": [
"iris_df.describe()"
],
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>5.843333</td>\n",
" <td>3.057333</td>\n",
" <td>3.758000</td>\n",
" <td>1.199333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.828066</td>\n",
" <td>0.435866</td>\n",
" <td>1.765298</td>\n",
" <td>0.762238</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>4.300000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.100000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>5.100000</td>\n",
" <td>2.800000</td>\n",
" <td>1.600000</td>\n",
" <td>0.300000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>5.800000</td>\n",
" <td>3.000000</td>\n",
" <td>4.350000</td>\n",
" <td>1.300000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>6.400000</td>\n",
" <td>3.300000</td>\n",
" <td>5.100000</td>\n",
" <td>1.800000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>7.900000</td>\n",
" <td>4.400000</td>\n",
" <td>6.900000</td>\n",
" <td>2.500000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"count 150.000000 150.000000 150.000000 150.000000\n",
"mean 5.843333 3.057333 3.758000 1.199333\n",
"std 0.828066 0.435866 1.765298 0.762238\n",
"min 4.300000 2.000000 1.000000 0.100000\n",
"25% 5.100000 2.800000 1.600000 0.300000\n",
"50% 5.800000 3.000000 4.350000 1.300000\n",
"75% 6.400000 3.300000 5.100000 1.800000\n",
"max 7.900000 4.400000 6.900000 2.500000"
]
},
"metadata": {},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zjjtW5hPGMuM"
},
"source": [
"The output above shows the total number of data points, mean, standard deviation, minimum, lower quartile(25%), median(50%), upper quartile(75%) and the maximum value of each column."
]
},
{
@ -216,20 +359,117 @@
},
"source": [
"### `DataFrame.head`\n",
"Next, let's see what the first few rows of our `DataFrame` look like:"
"With all the above functions and attributes, we have got a top level view of the dataset. We know how many data points are there, how many features are there, the data type of each feature and the number of non-null values for each feature.\n",
"\n",
"Now its time to look at the data itself. Let's see what the first few rows(the first few datapoints) of our `DataFrame` look like:"
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false,
"id": "DZMJZh0OgRrw"
"id": "DZMJZh0OgRrw",
"outputId": "c12ac408-abdb-48a5-ca3f-93b02f963b2f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
}
},
"source": [
"iris_df.head()"
],
"execution_count": null,
"outputs": []
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"0 5.1 3.5 1.4 0.2\n",
"1 4.9 3.0 1.4 0.2\n",
"2 4.7 3.2 1.3 0.2\n",
"3 4.6 3.1 1.5 0.2\n",
"4 5.0 3.6 1.4 0.2"
]
},
"metadata": {},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EBHEimZuEFQK"
},
"source": [
"As the output here, we can see five(5) entries of the dataset. If we look at the index at the left, we find out that these are the first five rows."
]
},
{
"cell_type": "markdown",
@ -239,7 +479,7 @@
"source": [
"### Exercise:\n",
"\n",
"By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?"
"From the example given above, it is clear that, by default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out a way to display more than five rows?"
]
},
{
@ -252,7 +492,7 @@
"source": [
"# Hint: Consult the documentation by using iris_df.head?"
],
"execution_count": null,
"execution_count": 6,
"outputs": []
},
{
@ -262,20 +502,106 @@
},
"source": [
"### `DataFrame.tail`\n",
"The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:"
"Another way of looking at the data can be from the end(instead of the beginning). The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:"
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false,
"id": "heanjfGWgRr2"
"id": "heanjfGWgRr2",
"outputId": "2930cf87-bfeb-4ddc-8be1-53d0e57a06b3",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
}
},
"source": [
"iris_df.tail()"
],
"execution_count": null,
"outputs": []
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>145</th>\n",
" <td>6.7</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>146</th>\n",
" <td>6.3</td>\n",
" <td>2.5</td>\n",
" <td>5.0</td>\n",
" <td>1.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>147</th>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>148</th>\n",
" <td>6.2</td>\n",
" <td>3.4</td>\n",
" <td>5.4</td>\n",
" <td>2.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>149</th>\n",
" <td>5.9</td>\n",
" <td>3.0</td>\n",
" <td>5.1</td>\n",
" <td>1.8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"145 6.7 3.0 5.2 2.3\n",
"146 6.3 2.5 5.0 1.9\n",
"147 6.5 3.0 5.2 2.0\n",
"148 6.2 3.4 5.4 2.3\n",
"149 5.9 3.0 5.1 1.8"
]
},
"metadata": {},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
@ -283,7 +609,9 @@
"id": "31kBWfyLgRr3"
},
"source": [
"In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.\n",
"In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets. \n",
"\n",
"All the functions and attributes shown above with the help of code examples, help us get a look and feel of the data. \n",
"\n",
"> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with."
]

Loading…
Cancel
Save