You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Data-Science-For-Beginners/2-Working-With-Data/08-data-preparation/notebook.ipynb

1842 lines
59 KiB

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Data Preparation\n",
"\n",
"[Original Notebook source from *Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio by Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n",
"\n",
"## Exploring `DataFrame` information\n",
"\n",
"> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.\n",
"\n",
"Once you have loaded your data into pandas, it will more likely than not be in a `DataFrame`. \n",
"\n",
"In order to explore our `DataFramme`, we will import the Python `scikit-learn` library and use an iconic dataset that every data scientist has seen hundreds of times: British biologist Ronald Fisher's **Iris data set** used in his 1936 paper \"*The use of multiple measurements in taxonomic problems*\":"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 1,
"source": [
"import pandas as pd\n",
"from sklearn.datasets import load_iris\n",
"\n",
"iris = load_iris()\n",
"iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])"
],
"outputs": [],
"metadata": {
"collapsed": true,
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"### `DataFrame.info`\r\n",
"Let's take a look at this dataset to see what we have:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 2,
"source": [
"iris_df.info()"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 150 entries, 0 to 149\n",
"Data columns (total 4 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 sepal length (cm) 150 non-null float64\n",
" 1 sepal width (cm) 150 non-null float64\n",
" 2 petal length (cm) 150 non-null float64\n",
" 3 petal width (cm) 150 non-null float64\n",
"dtypes: float64(4)\n",
"memory usage: 4.8 KB\n"
]
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### `DataFrame.head`\r\n",
"Next, let's see what the first few rows of our `DataFrame` look like:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 3,
"source": [
"iris_df.head()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"0 5.1 3.5 1.4 0.2\n",
"1 4.9 3.0 1.4 0.2\n",
"2 4.7 3.2 1.3 0.2\n",
"3 4.6 3.1 1.5 0.2\n",
"4 5.0 3.6 1.4 0.2"
]
},
"metadata": {},
"execution_count": 3
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"### Exercise:\r\n",
"\r\n",
"By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 4,
"source": [
"# Hint: Consult the documentation by using iris_df.head?"
],
"outputs": [],
"metadata": {
"collapsed": true,
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"### `DataFrame.tail`\r\n",
"The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 5,
"source": [
"iris_df.tail()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>145</th>\n",
" <td>6.7</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>146</th>\n",
" <td>6.3</td>\n",
" <td>2.5</td>\n",
" <td>5.0</td>\n",
" <td>1.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>147</th>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>148</th>\n",
" <td>6.2</td>\n",
" <td>3.4</td>\n",
" <td>5.4</td>\n",
" <td>2.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>149</th>\n",
" <td>5.9</td>\n",
" <td>3.0</td>\n",
" <td>5.1</td>\n",
" <td>1.8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"145 6.7 3.0 5.2 2.3\n",
"146 6.3 2.5 5.0 1.9\n",
"147 6.5 3.0 5.2 2.0\n",
"148 6.2 3.4 5.4 2.3\n",
"149 5.9 3.0 5.1 1.8"
]
},
"metadata": {},
"execution_count": 5
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.\r\n",
"\r\n",
"> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Dealing with missing data\r\n",
"\r\n",
"> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.\r\n",
"\r\n",
"Most of the time the datasets you want to use (of have to use) have missing values in them. How missing data is handled carries with it subtle tradeoffs that can affect your final analysis and real-world outcomes.\r\n",
"\r\n",
"Pandas handles missing values in two ways. The first you've seen before in previous sections: `NaN`, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values.\r\n",
"\r\n",
"For missing values apart from floats, pandas uses the Python `None` object. While it might seem confusing that you will encounter two different kinds of values that say essentially the same thing, there are sound programmatic reasons for this design choice and, in practice, going this route enables pandas to deliver a good compromise for the vast majority of cases. Notwithstanding this, both `None` and `NaN` carry restrictions that you need to be mindful of with regards to how they can be used."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### `None`: non-float missing data\r\n",
"Because `None` comes from Python, it cannot be used in NumPy and pandas arrays that are not of data type `'object'`. Remember, NumPy arrays (and the data structures in pandas) can contain only one type of data. This is what gives them their tremendous power for large-scale data and computational work, but it also limits their flexibility. Such arrays have to upcast to the “lowest common denominator,” the data type that will encompass everything in the array. When `None` is in the array, it means you are working with Python objects.\r\n",
"\r\n",
"To see this in action, consider the following example array (note the `dtype` for it):"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 6,
"source": [
"import numpy as np\n",
"\n",
"example1 = np.array([2, None, 6, 8])\n",
"example1"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([2, None, 6, 8], dtype=object)"
]
},
"metadata": {},
"execution_count": 6
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"The reality of upcast data types carries two side effects with it. First, operations will be carried out at the level of interpreted Python code rather than compiled NumPy code. Essentially, this means that any operations involving `Series` or `DataFrames` with `None` in them will be slower. While you would probably not notice this performance hit, for large datasets it might become an issue.\n",
"\n",
"The second side effect stems from the first. Because `None` essentially drags `Series` or `DataFrame`s back into the world of vanilla Python, using NumPy/pandas aggregations like `sum()` or `min()` on arrays that contain a ``None`` value will generally produce an error:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 7,
"source": [
"example1.sum()"
],
"outputs": [
{
"output_type": "error",
"ename": "TypeError",
"evalue": "unsupported operand type(s) for +: 'int' and 'NoneType'",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-7-ce9901ad18bd>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/anaconda3/lib/python3.8/site-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n",
"\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'"
]
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"**Key takeaway**: Addition (and other operations) between integers and `None` values is undefined, which can limit what you can do with datasets that contain them."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### `NaN`: missing float values\n",
"\n",
"In contrast to `None`, NumPy (and therefore pandas) supports `NaN` for its fast, vectorized operations and ufuncs. The bad news is that any arithmetic performed on `NaN` always results in `NaN`. For example:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 8,
"source": [
"np.nan + 1"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"nan"
]
},
"metadata": {},
"execution_count": 8
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "code",
"execution_count": 9,
"source": [
"np.nan * 0"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"nan"
]
},
"metadata": {},
"execution_count": 9
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"The good news: aggregations run on arrays with `NaN` in them don't pop errors. The bad news: the results are not uniformly useful:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 10,
"source": [
"example2 = np.array([2, np.nan, 6, 8]) \n",
"example2.sum(), example2.min(), example2.max()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(nan, nan, nan)"
]
},
"metadata": {},
"execution_count": 10
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"### Exercise:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 11,
"source": [
"# What happens if you add np.nan and None together?\n"
],
"outputs": [],
"metadata": {
"collapsed": true,
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"Remember: `NaN` is just for missing floating-point values; there is no `NaN` equivalent for integers, strings, or Booleans."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### `NaN` and `None`: null values in pandas\n",
"\n",
"Even though `NaN` and `None` can behave somewhat differently, pandas is nevertheless built to handle them interchangeably. To see what we mean, consider a `Series` of integers:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 12,
"source": [
"int_series = pd.Series([1, 2, 3], dtype=int)\n",
"int_series"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 1\n",
"1 2\n",
"2 3\n",
"dtype: int64"
]
},
"metadata": {},
"execution_count": 12
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"### Exercise:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 13,
"source": [
"# Now set an element of int_series equal to None.\n",
"# How does that element show up in the Series?\n",
"# What is the dtype of the Series?\n"
],
"outputs": [],
"metadata": {
"collapsed": true,
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"In the process of upcasting data types to establish data homogeneity in `Seires` and `DataFrame`s, pandas will willingly switch missing values between `None` and `NaN`. Because of this design feature, it can be helpful to think of `None` and `NaN` as two different flavors of \"null\" in pandas. Indeed, some of the core methods you will use to deal with missing values in pandas reflect this idea in their names:\n",
"\n",
"- `isnull()`: Generates a Boolean mask indicating missing values\n",
"- `notnull()`: Opposite of `isnull()`\n",
"- `dropna()`: Returns a filtered version of the data\n",
"- `fillna()`: Returns a copy of the data with missing values filled or imputed\n",
"\n",
"These are important methods to master and get comfortable with, so let's go over them each in some depth."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Detecting null values\n",
"Both `isnull()` and `notnull()` are your primary methods for detecting null data. Both return Boolean masks over your data."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 14,
"source": [
"example3 = pd.Series([0, np.nan, '', None])"
],
"outputs": [],
"metadata": {
"collapsed": true,
"trusted": false
}
},
{
"cell_type": "code",
"execution_count": 15,
"source": [
"example3.isnull()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 False\n",
"1 True\n",
"2 False\n",
"3 True\n",
"dtype: bool"
]
},
"metadata": {},
"execution_count": 15
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"Look closely at the output. Does any of it surprise you? While `0` is an arithmetic null, it's nevertheless a perfectly good integer and pandas treats it as such. `''` is a little more subtle. While we used it in Section 1 to represent an empty string value, it is nevertheless a string object and not a representation of null as far as pandas is concerned.\n",
"\n",
"Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Exercise:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 16,
"source": [
"# Try running example3[example3.notnull()].\n",
"# Before you do so, what do you expect to see?\n"
],
"outputs": [],
"metadata": {
"collapsed": true,
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Dropping null values\n",
"\n",
"Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to `example3`:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 17,
"source": [
"example3 = example3.dropna()\n",
"example3"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 0\n",
"2 \n",
"dtype: object"
]
},
"metadata": {},
"execution_count": 17
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"Note that this should look like your output from `example3[example3.notnull()]`. The difference here is that, rather than just indexing on the masked values, `dropna` has removed those missing values from the `Series` `example3`.\n",
"\n",
"Because `DataFrame`s have two dimensions, they afford more options for dropping data."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 18,
"source": [
"example4 = pd.DataFrame([[1, np.nan, 7], \n",
" [2, 5, 8], \n",
" [np.nan, 6, 9]])\n",
"example4"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.0</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 NaN 7\n",
"1 2.0 5.0 8\n",
"2 NaN 6.0 9"
]
},
"metadata": {},
"execution_count": 18
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"(Did you notice that pandas upcast two of the columns to floats to accommodate the `NaN`s?)\n",
"\n",
"You cannot drop a single value from a `DataFrame`, so you have to drop full rows or columns. Depending on what you are doing, you might want to do one or the other, and so pandas gives you options for both. Because in data science, columns generally represent variables and rows represent observations, you are more likely to drop rows of data; the default setting for `dropna()` is to drop all rows that contain any null values:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 19,
"source": [
"example4.dropna()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"1 2.0 5.0 8"
]
},
"metadata": {},
"execution_count": 19
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"If necessary, you can drop NA values from columns. Use `axis=1` to do so:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 20,
"source": [
"example4.dropna(axis='columns')"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 2\n",
"0 7\n",
"1 8\n",
"2 9"
]
},
"metadata": {},
"execution_count": 20
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"Notice that this can drop a lot of data that you might want to keep, particularly in smaller datasets. What if you just want to drop rows or columns that contain several or even just all null values? You specify those setting in `dropna` with the `how` and `thresh` parameters.\n",
"\n",
"By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 21,
"source": [
"example4[3] = np.nan\n",
"example4"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>7</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.0</td>\n",
" <td>9</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 1.0 NaN 7 NaN\n",
"1 2.0 5.0 8 NaN\n",
"2 NaN 6.0 9 NaN"
]
},
"metadata": {},
"execution_count": 21
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"### Exercise:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 22,
"source": [
"# How might you go about dropping just column 3?\n",
"# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n"
],
"outputs": [],
"metadata": {
"collapsed": true,
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 23,
"source": [
"example4.dropna(axis='rows', thresh=3)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"1 2.0 5.0 8 NaN"
]
},
"metadata": {},
"execution_count": 23
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"Here, the first and last row have been dropped, because they contain only two non-null values."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Filling null values\n",
"\n",
"Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 24,
"source": [
"example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n",
"example5"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"a 1.0\n",
"b NaN\n",
"c 2.0\n",
"d NaN\n",
"e 3.0\n",
"dtype: float64"
]
},
"metadata": {},
"execution_count": 24
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"You can fill all of the null entries with a single value, such as `0`:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 25,
"source": [
"example5.fillna(0)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"a 1.0\n",
"b 0.0\n",
"c 2.0\n",
"d 0.0\n",
"e 3.0\n",
"dtype: float64"
]
},
"metadata": {},
"execution_count": 25
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"### Exercise:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 26,
"source": [
"# What happens if you try to fill null values with a string, like ''?\n"
],
"outputs": [],
"metadata": {
"collapsed": true,
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"You can **forward-fill** null values, which is to use the last valid value to fill a null:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 27,
"source": [
"example5.fillna(method='ffill')"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"a 1.0\n",
"b 1.0\n",
"c 2.0\n",
"d 2.0\n",
"e 3.0\n",
"dtype: float64"
]
},
"metadata": {},
"execution_count": 27
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"You can also **back-fill** to propagate the next valid value backward to fill a null:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 28,
"source": [
"example5.fillna(method='bfill')"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"a 1.0\n",
"b 2.0\n",
"c 2.0\n",
"d 3.0\n",
"e 3.0\n",
"dtype: float64"
]
},
"metadata": {},
"execution_count": 28
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"As you might guess, this works the same with `DataFrame`s, but you can also specify an `axis` along which to fill null values:"
],
"metadata": {
"collapsed": true
}
},
{
"cell_type": "code",
"execution_count": 29,
"source": [
"example4"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>7</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.0</td>\n",
" <td>9</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 1.0 NaN 7 NaN\n",
"1 2.0 5.0 8 NaN\n",
"2 NaN 6.0 9 NaN"
]
},
"metadata": {},
"execution_count": 29
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "code",
"execution_count": 30,
"source": [
"example4.fillna(method='ffill', axis=1)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>7.0</td>\n",
" <td>7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 1.0 1.0 7.0 7.0\n",
"1 2.0 5.0 8.0 8.0\n",
"2 NaN 6.0 9.0 9.0"
]
},
"metadata": {},
"execution_count": 30
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"Notice that when a previous value is not available for forward-filling, the null value remains."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Exercise:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 31,
"source": [
"# What output does example4.fillna(method='bfill', axis=1) produce?\n",
"# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n",
"# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n"
],
"outputs": [],
"metadata": {
"collapsed": true,
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"You can be creative about how you use `fillna`. For example, let's look at `example4` again, but this time let's fill the missing values with the average of all of the values in the `DataFrame`:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 32,
"source": [
"example4.fillna(example4.mean())"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>5.5</td>\n",
" <td>7</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.5</td>\n",
" <td>6.0</td>\n",
" <td>9</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 1.0 5.5 7 NaN\n",
"1 2.0 5.0 8 NaN\n",
"2 1.5 6.0 9 NaN"
]
},
"metadata": {},
"execution_count": 32
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"Notice that column 3 is still valueless: the default direction is to fill values row-wise.\n",
"\n",
"> **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Removing duplicate data\n",
"\n",
"> **Learning goal:** By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.\n",
"\n",
"In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Identifying duplicates: `duplicated`\n",
"\n",
"You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an ealier one. Let's create another example `DataFrame` to see this in action."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 33,
"source": [
"example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n",
" 'numbers': [1, 2, 1, 3, 3]})\n",
"example6"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letters</th>\n",
" <th>numbers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>B</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>B</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letters numbers\n",
"0 A 1\n",
"1 B 2\n",
"2 A 1\n",
"3 B 3\n",
"4 B 3"
]
},
"metadata": {},
"execution_count": 33
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "code",
"execution_count": 34,
"source": [
"example6.duplicated()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"4 True\n",
"dtype: bool"
]
},
"metadata": {},
"execution_count": 34
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"### Dropping duplicates: `drop_duplicates`\n",
"`drop_duplicates` simply returns a copy of the data for which all of the `duplicated` values are `False`:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 35,
"source": [
"example6.drop_duplicates()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letters</th>\n",
" <th>numbers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>B</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letters numbers\n",
"0 A 1\n",
"1 B 2\n",
"3 B 3"
]
},
"metadata": {},
"execution_count": 35
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"Both `duplicated` and `drop_duplicates` default to consider all columnsm but you can specify that they examine only a subset of columns in your `DataFrame`:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 36,
"source": [
"example6.drop_duplicates(['letters'])"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letters</th>\n",
" <th>numbers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letters numbers\n",
"0 A 1\n",
"1 B 2"
]
},
"metadata": {},
"execution_count": 36
}
],
"metadata": {
"trusted": false
}
},
{
"cell_type": "markdown",
"source": [
"> **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you inaccurate results!"
],
"metadata": {}
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.8 64-bit ('base': conda)"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"name": "python",
"file_extension": ".py",
"version": "3.8.8",
"pygments_lexer": "ipython3",
"codemirror_mode": {
"version": 3,
"name": "ipython"
}
},
"interpreter": {
"hash": "ac36fb7022a775f2750f61e1a6104d2d5a9eb3fb9bd004b80f1c771537b93945"
}
},
"nbformat": 4,
"nbformat_minor": 1
}