Data-Science-For-Beginners/translations/en/2-Working-With-Data/08-data-preparation/notebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rQ8UhzFpgRra"
   },
   "source": [
    "# Data Preparation\n",
    "\n",
    "[Original Notebook source from *Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio by Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n",
    "\n",
    "## Exploring `DataFrame` information\n",
    "\n",
    "> **Learning goal:** By the end of this subsection, you should feel confident in finding general information about the data stored in pandas DataFrames.\n",
    "\n",
    "Once your data is loaded into pandas, it will most likely be in a `DataFrame`. But if your `DataFrame` contains 60,000 rows and 400 columns, where do you even start to understand what you're working with? Luckily, pandas offers some handy tools to quickly get an overview of a `DataFrame` as well as a glimpse of its first and last few rows.\n",
    "\n",
    "To explore this functionality, we will import the Python scikit-learn library and use a classic dataset that every data scientist has encountered countless times: British biologist Ronald Fisher's *Iris* dataset, featured in his 1936 paper \"The use of multiple measurements in taxonomic problems\":\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "id": "hB1RofhdgRrp",
    "trusted": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.datasets import load_iris\n",
    "\n",
    "iris = load_iris()\n",
    "iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AGA0A_Y8hMdz"
   },
   "source": [
    "### `DataFrame.shape`\n",
    "We have loaded the Iris Dataset into the variable `iris_df`. Before exploring the data, it’s helpful to know how many data points we have and the overall dimensions of the dataset. Understanding the size of the data can provide valuable context for analysis.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "LOe5jQohhulf",
    "outputId": "fb0577ac-3b4a-4623-cb41-20e1b264b3e9"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(150, 4)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "smE7AGzOhxk2"
   },
   "source": [
    "So, we are working with 150 rows and 4 columns of data. Each row corresponds to one data point, and each column represents a specific feature related to the data frame. Essentially, there are 150 data points, each with 4 features.\n",
    "\n",
    "`shape` in this context is an attribute of the dataframe, not a function, which is why it doesn't have parentheses at the end.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d3AZKs0PinGP"
   },
   "source": [
    "### `DataFrame.columns`\n",
    "Now let's dive into the 4 columns of data. What does each of them actually represent? The `columns` attribute provides the names of the columns in the dataframe.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "YPGh_ziji-CY",
    "outputId": "74e7a43a-77cc-4c80-da56-7f50767c37a0"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n",
       "       'petal width (cm)'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TsobcU_VjCC_"
   },
   "source": [
    "As we can see, there are four(4) columns. The `columns` attribute tells us the name of the columns and basically nothing else. This attribute assumes importance when we want to identify the features a dataset contains.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2UTlvkjmgRrs"
   },
   "source": [
    "### `DataFrame.info`\n",
    "The size of the dataset (provided by the `shape` attribute) and the names of the features or columns (provided by the `columns` attribute) give us some initial insights into the dataset. Next, we might want to explore the dataset in more detail. The `DataFrame.info()` function is very helpful for this purpose.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "dHHRyG0_gRrt",
    "outputId": "d8fb0c40-4f18-4e19-da48-c8db77d1d3a5",
    "trusted": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 150 entries, 0 to 149\n",
      "Data columns (total 4 columns):\n",
      " #   Column             Non-Null Count  Dtype  \n",
      "---  ------             --------------  -----  \n",
      " 0   sepal length (cm)  150 non-null    float64\n",
      " 1   sepal width (cm)   150 non-null    float64\n",
      " 2   petal length (cm)  150 non-null    float64\n",
      " 3   petal width (cm)   150 non-null    float64\n",
      "dtypes: float64(4)\n",
      "memory usage: 4.8 KB\n"
     ]
    }
   ],
   "source": [
    "iris_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1XgVMpvigRru"
   },
   "source": [
    "From here, we can make a few observations:\n",
    "1. The data type of each column: In this dataset, all the data is stored as 64-bit floating-point numbers.\n",
    "2. Number of non-null values: Handling null values is a crucial step in data preparation. This will be addressed later in the notebook.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IYlyxbpWFEF4"
   },
   "source": [
    "### DataFrame.describe()\n",
    "Imagine we have a dataset with plenty of numerical data. We can perform univariate statistical calculations like mean, median, quartiles, etc., on each column separately. The `DataFrame.describe()` function gives us a statistical summary of the numerical columns in the dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 297
    },
    "id": "tWV-CMstFIRA",
    "outputId": "4fc49941-bc13-4b0c-a412-cb39e7d3f289"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal length (cm)</th>\n",
       "      <th>sepal width (cm)</th>\n",
       "      <th>petal length (cm)</th>\n",
       "      <th>petal width (cm)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>150.000000</td>\n",
       "      <td>150.000000</td>\n",
       "      <td>150.000000</td>\n",
       "      <td>150.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>5.843333</td>\n",
       "      <td>3.057333</td>\n",
       "      <td>3.758000</td>\n",
       "      <td>1.199333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.828066</td>\n",
       "      <td>0.435866</td>\n",
       "      <td>1.765298</td>\n",
       "      <td>0.762238</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>4.300000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.100000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>5.100000</td>\n",
       "      <td>2.800000</td>\n",
       "      <td>1.600000</td>\n",
       "      <td>0.300000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>5.800000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>4.350000</td>\n",
       "      <td>1.300000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>6.400000</td>\n",
       "      <td>3.300000</td>\n",
       "      <td>5.100000</td>\n",
       "      <td>1.800000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>7.900000</td>\n",
       "      <td>4.400000</td>\n",
       "      <td>6.900000</td>\n",
       "      <td>2.500000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
       "count         150.000000        150.000000         150.000000        150.000000\n",
       "mean            5.843333          3.057333           3.758000          1.199333\n",
       "std             0.828066          0.435866           1.765298          0.762238\n",
       "min             4.300000          2.000000           1.000000          0.100000\n",
       "25%             5.100000          2.800000           1.600000          0.300000\n",
       "50%             5.800000          3.000000           4.350000          1.300000\n",
       "75%             6.400000          3.300000           5.100000          1.800000\n",
       "max             7.900000          4.400000           6.900000          2.500000"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zjjtW5hPGMuM"
   },
   "source": [
    "The output above shows the total number of data points, mean, standard deviation, minimum, lower quartile (25%), median (50%), upper quartile (75%), and the maximum value of each column.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-lviAu99gRrv"
   },
   "source": [
    "### `DataFrame.head`\n",
    "With all the functions and attributes mentioned earlier, we now have a high-level overview of the dataset. We know the total number of data points, the number of features, the data type of each feature, and the count of non-null values for each feature.\n",
    "\n",
    "Now it's time to examine the data itself. Let's take a look at the first few rows (the initial data points) of our `DataFrame`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "DZMJZh0OgRrw",
    "outputId": "d9393ee5-c106-4797-f815-218f17160e00",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal length (cm)</th>\n",
       "      <th>sepal width (cm)</th>\n",
       "      <th>petal length (cm)</th>\n",
       "      <th>petal width (cm)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
       "0                5.1               3.5                1.4               0.2\n",
       "1                4.9               3.0                1.4               0.2\n",
       "2                4.7               3.2                1.3               0.2\n",
       "3                4.6               3.1                1.5               0.2\n",
       "4                5.0               3.6                1.4               0.2"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EBHEimZuEFQK"
   },
   "source": [
    "As the output here, we can see five(5) entries of the dataset. If we look at the index at the left, we find out that these are the first five rows.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oj7GkrTdgRry"
   },
   "source": [
    "### Exercise:\n",
    "\n",
    "From the example given above, it is clear that, by default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out a way to display more than five rows?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true,
    "id": "EKRmRFFegRrz",
    "trusted": false
   },
   "outputs": [],
   "source": [
    "# Hint: Consult the documentation by using iris_df.head?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BJ_cpZqNgRr1"
   },
   "source": [
    "### `DataFrame.tail`\n",
    "Another way to view the data is from the end (instead of the beginning). The counterpart to `DataFrame.head` is `DataFrame.tail`, which retrieves the last five rows of a `DataFrame`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 0
    },
    "id": "heanjfGWgRr2",
    "outputId": "6ae09a21-fe09-4110-b0d7-1a1fbf34d7f3",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal length (cm)</th>\n",
       "      <th>sepal width (cm)</th>\n",
       "      <th>petal length (cm)</th>\n",
       "      <th>petal width (cm)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>6.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>6.3</td>\n",
       "      <td>2.5</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>6.2</td>\n",
       "      <td>3.4</td>\n",
       "      <td>5.4</td>\n",
       "      <td>2.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>5.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.1</td>\n",
       "      <td>1.8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
       "145                6.7               3.0                5.2               2.3\n",
       "146                6.3               2.5                5.0               1.9\n",
       "147                6.5               3.0                5.2               2.0\n",
       "148                6.2               3.4                5.4               2.3\n",
       "149                5.9               3.0                5.1               1.8"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "31kBWfyLgRr3"
   },
   "source": [
    "In practice, it is helpful to quickly inspect the first few rows or the last few rows of a `DataFrame`, especially when searching for anomalies in ordered datasets.\n",
    "\n",
    "All the functions and attributes demonstrated above through code examples assist in providing an overview of the data.\n",
    "\n",
    "> **Takeaway:** Simply examining the metadata of a DataFrame or the first and last few values can give you a quick understanding of the size, structure, and content of the data you are working with.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TvurZyLSDxq_"
   },
   "source": [
    "### Missing Data\n",
    "Let’s explore the concept of missing data. Missing data occurs when certain columns have no value stored.\n",
    "\n",
    "For example, imagine someone is very conscious about their weight and chooses not to fill out the weight field in a survey. In this case, the weight value for that individual will be missing.\n",
    "\n",
    "In real-world datasets, missing values are quite common.\n",
    "\n",
    "**How Pandas Handles Missing Data**\n",
    "\n",
    "Pandas deals with missing values in two ways. The first method, which you’ve encountered in earlier sections, is `NaN` (Not a Number). This is a special value defined by the IEEE floating-point specification and is specifically used to represent missing floating-point values.\n",
    "\n",
    "For missing values that are not floats, pandas uses Python’s `None` object. While it might seem confusing to encounter two different types of values that essentially indicate the same thing, there are solid programming reasons behind this design choice. In practice, this approach allows pandas to strike a good balance for most use cases. However, it’s important to note that both `None` and `NaN` come with certain limitations regarding how they can be used.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lOHqUlZFgRr5"
   },
   "source": [
    "### `None`: non-float missing data\n",
    "Since `None` originates from Python, it cannot be used in NumPy and pandas arrays unless their data type is `'object'`. Keep in mind that NumPy arrays (and pandas data structures) are designed to hold only one type of data. This characteristic is what makes them incredibly powerful for handling large-scale data and performing computations, but it also restricts their flexibility. These arrays must convert to the \"lowest common denominator,\" which is the data type capable of accommodating all elements in the array. If `None` is present in the array, it indicates that you are working with Python objects.\n",
    "\n",
    "To illustrate this, take a look at the following example array (pay attention to its `dtype`):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "QIoNdY4ngRr7",
    "outputId": "92779f18-62f4-4a03-eca2-e9a101604336",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, None, 6, 8], dtype=object)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "example1 = np.array([2, None, 6, 8])\n",
    "example1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pdlgPNbhgRr7"
   },
   "source": [
    "The reality of upcasting data types comes with two consequences. First, operations will be executed at the level of Python's interpreted code rather than NumPy's compiled code. In practice, this means that any operations involving `Series` or `DataFrames` containing `None` will be slower. While this performance impact might not be noticeable in smaller datasets, it could become problematic for larger ones.\n",
    "\n",
    "The second consequence is a direct result of the first. Since `None` essentially pulls `Series` or `DataFrames` back into the realm of standard Python, using NumPy/pandas aggregation functions like `sum()` or `min()` on arrays that include a `None` value will typically result in an error:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 292
    },
    "id": "gWbx-KB9gRr8",
    "outputId": "ecba710a-22ec-41d5-a39c-11f67e645b50",
    "trusted": false
   },
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "ignored",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-10-ce9901ad18bd>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m     45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m     46\u001b[0m          initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n",
      "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'"
     ]
    }
   ],
   "source": [
    "example1.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LcEwO8UogRr9"
   },
   "source": [
    "**Key takeaway**: Addition (and other operations) between integers and `None` values is undefined, which can limit what you can do with datasets that contain them.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pWvVHvETgRr9"
   },
   "source": [
    "### `NaN`: missing float values\n",
    "\n",
    "Unlike `None`, NumPy (and consequently pandas) supports `NaN` for its efficient, vectorized operations and ufuncs. The downside is that any arithmetic operation involving `NaN` will always yield `NaN`. For example:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "rcFYfMG9gRr9",
    "outputId": "699e81b7-5c11-4b46-df1d-06071768690f",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.nan + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "BW3zQD2-gRr-",
    "outputId": "4525b6c4-495d-4f7b-a979-efce1dae9bd0",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.nan * 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fU5IPRcCgRr-"
   },
   "source": [
    "The good news: aggregations run on arrays with `NaN` in them don't pop errors. The bad news: the results are not uniformly useful:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "LCInVgSSgRr_",
    "outputId": "fa06495a-0930-4867-87c5-6023031ea8b5",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(nan, nan, nan)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example2 = np.array([2, np.nan, 6, 8]) \n",
    "example2.sum(), example2.min(), example2.max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nhlnNJT7gRr_"
   },
   "source": [
    "### Exercise:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true,
    "id": "yan3QRaOgRr_",
    "trusted": false
   },
   "outputs": [],
   "source": [
    "# What happens if you add np.nan and None together?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_iDvIRC8gRsA"
   },
   "source": [
    "Remember: `NaN` is just for missing floating-point values; there is no `NaN` equivalent for integers, strings, or Booleans.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kj6EKdsAgRsA"
   },
   "source": [
    "### `NaN` and `None`: null values in pandas\n",
    "\n",
    "Although `NaN` and `None` can act slightly differently, pandas is designed to treat them as equivalent. To understand this better, let's look at a `Series` of integers:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Nji-KGdNgRsA",
    "outputId": "36aa14d2-8efa-4bfd-c0ed-682991288822",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    2\n",
       "2    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "int_series = pd.Series([1, 2, 3], dtype=int)\n",
    "int_series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WklCzqb8gRsB"
   },
   "source": [
    "### Exercise:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true,
    "id": "Cy-gqX5-gRsB",
    "trusted": false
   },
   "outputs": [],
   "source": [
    "# Now set an element of int_series equal to None.\n",
    "# How does that element show up in the Series?\n",
    "# What is the dtype of the Series?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WjMQwltNgRsB"
   },
   "source": [
    "When converting data types to ensure uniformity in `Series` and `DataFrame`s, pandas will seamlessly interchange missing values between `None` and `NaN`. Due to this design, it can be useful to think of `None` and `NaN` as two distinct types of \"null\" in pandas. In fact, some of the key methods for handling missing values in pandas reflect this concept in their names:\n",
    "\n",
    "- `isnull()`: Creates a Boolean mask to identify missing values\n",
    "- `notnull()`: The inverse of `isnull()`\n",
    "- `dropna()`: Produces a filtered version of the data\n",
    "- `fillna()`: Generates a copy of the data with missing values replaced or imputed\n",
    "\n",
    "These methods are essential to learn and become proficient with, so let's explore each one in detail.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Yh5ifd9FgRsB"
   },
   "source": [
    "### Detecting null values\n",
    "\n",
    "Now that we understand the significance of missing values, we need to identify them in our dataset before addressing them.  \n",
    "Both `isnull()` and `notnull()` are your main tools for detecting null data. Both methods return Boolean masks for your data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true,
    "id": "e-vFp5lvgRsC",
    "trusted": false
   },
   "outputs": [],
   "source": [
    "example3 = pd.Series([0, np.nan, '', None])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "1XdaJJ7PgRsC",
    "outputId": "92fc363a-1874-471f-846d-f4f9ce1f51d0",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1     True\n",
       "2    False\n",
       "3     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example3.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PaSZ0SQygRsC"
   },
   "source": [
    "Look closely at the output. Does any of it surprise you? While `0` is an arithmetic null, it's still a perfectly valid integer, and pandas treats it as such. `''` is a bit more nuanced. Although we used it in Section 1 to represent an empty string value, it is still a string object and not considered a null value by pandas.\n",
    "\n",
    "Now, let's approach this differently and use these methods in a way that's closer to how you'll apply them in practice. You can use Boolean masks directly as an index for a ``Series`` or ``DataFrame``, which can be helpful when working with specific missing (or present) values.\n",
    "\n",
    "If we want the total count of missing values, we can simply sum up the mask generated by the `isnull()` method.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "JCcQVoPkHDUv",
    "outputId": "001daa72-54f8-4bd5-842a-4df627a79d4d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example3.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PlBqEo3mgRsC"
   },
   "source": [
    "### Exercise:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true,
    "id": "ggDVf5uygRsD",
    "trusted": false
   },
   "outputs": [],
   "source": [
    "# Try running example3[example3.notnull()].\n",
    "# Before you do so, what do you expect to see?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "D_jWN7mHgRsD"
   },
   "source": [
    "**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in DataFrames: they show the results and the index of those results, which will help you enormously as you wrestle with your data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BvnoojWsgRr4"
   },
   "source": [
    "### Dealing with missing data\n",
    "\n",
    "> **Learning goal:** By the end of this subsection, you should understand how and when to replace or remove null values from DataFrames.\n",
    "\n",
    "Machine Learning models cannot process missing data directly. Therefore, before feeding the data into the model, we need to address these missing values.\n",
    "\n",
    "The way missing data is handled involves subtle tradeoffs and can impact your final analysis as well as real-world outcomes.\n",
    "\n",
    "There are two main approaches to dealing with missing data:\n",
    "\n",
    "1.   Remove the row containing the missing value\n",
    "2.   Replace the missing value with another value\n",
    "\n",
    "We will explore both methods in detail, along with their advantages and disadvantages.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3VaYC1TvgRsD"
   },
   "source": [
    "### Dropping null values\n",
    "\n",
    "The amount of data we provide to our model directly impacts its performance. Dropping null values means reducing the number of data points, which in turn decreases the size of the dataset. Therefore, it is recommended to drop rows with null values when the dataset is sufficiently large.\n",
    "\n",
    "Another scenario could be when a specific row or column contains a significant amount of missing values. In such cases, they can be dropped since they wouldn't contribute much to our analysis due to the lack of sufficient data in that row or column.\n",
    "\n",
    "Beyond detecting missing values, pandas offers a convenient way to remove null values from `Series` and `DataFrame`s. To see this in practice, let's revisit `example3`. The `DataFrame.dropna()` function is useful for removing rows with null values.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "7uIvS097gRsD",
    "outputId": "c13fc117-4ca1-4145-a0aa-42ac89e6e218",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "2     \n",
       "dtype: object"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example3 = example3.dropna()\n",
    "example3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hil2cr64gRsD"
   },
   "source": [
    "Note that this should look like your output from `example3[example3.notnull()]`. The difference here is that, instead of simply indexing the masked values, `dropna` has eliminated those missing values from the `Series` `example3`.\n",
    "\n",
    "Since DataFrames are two-dimensional, they offer more possibilities for removing data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 142
    },
    "id": "an-l74sPgRsE",
    "outputId": "340876a0-63ad-40f6-bd54-6240cdae50ab",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2\n",
       "0  1.0  NaN  7\n",
       "1  2.0  5.0  8\n",
       "2  NaN  6.0  9"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example4 = pd.DataFrame([[1,      np.nan, 7], \n",
    "                         [2,      5,      8], \n",
    "                         [np.nan, 6,      9]])\n",
    "example4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "66wwdHZrgRsE"
   },
   "source": [
    "(Did you notice that pandas converted two of the columns to floats to handle the `NaN`s?)\n",
    "\n",
    "You cannot remove a single value from a `DataFrame`; instead, you have to remove entire rows or columns. Depending on your specific task, you might prefer one approach over the other, and pandas provides options for both. Since, in data science, columns typically represent variables and rows represent observations, it's more common to remove rows of data. The default behavior of `dropna()` is to remove all rows that contain any null values:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 80
    },
    "id": "jAVU24RXgRsE",
    "outputId": "0b5e5aee-7187-4d3f-b583-a44136ae5f80",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2\n",
       "1  2.0  5.0  8"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example4.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TrQRBuTDgRsE"
   },
   "source": [
    "If necessary, you can drop NA values from columns. Use `axis=1` to do so:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 142
    },
    "id": "GrBhxu9GgRsE",
    "outputId": "ff4001f3-2e61-4509-d60e-0093d1068437",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   2\n",
       "0  7\n",
       "1  8\n",
       "2  9"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example4.dropna(axis='columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KWXiKTfMgRsF"
   },
   "source": [
    "Notice that this can remove a significant amount of data that you might want to retain, especially in smaller datasets. What if you only want to drop rows or columns that contain several or even all null values? You can configure this behavior in `dropna` using the `how` and `thresh` parameters.\n",
    "\n",
    "By default, `how='any'` (if you'd like to verify this or explore other parameters of the method, run `example4.dropna?` in a code cell). Alternatively, you can set `how='all'` to drop only rows or columns that contain entirely null values. Let's expand our example `DataFrame` to observe this in action in the next exercise.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 142
    },
    "id": "Bcf_JWTsgRsF",
    "outputId": "72e0b1b8-52fa-4923-98ce-b6fbed6e44b1",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "      <td>9</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2   3\n",
       "0  1.0  NaN  7 NaN\n",
       "1  2.0  5.0  8 NaN\n",
       "2  NaN  6.0  9 NaN"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example4[3] = np.nan\n",
    "example4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pNZer7q9JPNC"
   },
   "source": [
    "> Key takeaways:  \n",
    "1. Removing null values is advisable only when the dataset is sufficiently large.  \n",
    "2. Entire rows or columns can be removed if the majority of their data is missing.  \n",
    "3. The `DataFrame.dropna(axis=)` method is useful for eliminating null values. The `axis` parameter determines whether rows or columns are removed.  \n",
    "4. The `how` parameter can also be utilized. By default, it is set to `any`, meaning it removes rows/columns containing any null values. It can be changed to `all` to specify that only rows/columns where all values are null should be removed.  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oXXSfQFHgRsF"
   },
   "source": [
    "### Exercise:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true,
    "id": "ExUwQRxpgRsF",
    "trusted": false
   },
   "outputs": [],
   "source": [
    "# How might you go about dropping just column 3?\n",
    "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "38kwAihWgRsG"
   },
   "source": [
    "The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 80
    },
    "id": "M9dCNMaagRsG",
    "outputId": "8093713a-54d2-4e54-c73f-4eea315cb6f2",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2   3\n",
       "1  2.0  5.0  8 NaN"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example4.dropna(axis='rows', thresh=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fmSFnzZegRsG"
   },
   "source": [
    "Here, the first and last row have been dropped, because they contain only two non-null values.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mCcxLGyUgRsG"
   },
   "source": [
    "### Filling null values\n",
    "\n",
    "Sometimes, it makes sense to replace missing values with ones that could be considered valid. There are several techniques to handle null values. The first involves using Domain Knowledge (expertise in the subject area related to the dataset) to estimate the missing values.\n",
    "\n",
    "You could use `isnull` to fill these values directly, but this can be tedious, especially if there are many values to replace. Since this is a common task in data science, pandas offers the `fillna` method, which creates a copy of the `Series` or `DataFrame` with the missing values replaced by a value of your choice. Let's create another example `Series` to see how this works in practice.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CE8S7louLezV"
   },
   "source": [
    "### Categorical Data (Non-numeric)\n",
    "First, let's look at non-numeric data. In datasets, there are columns containing categorical data, such as Gender, True or False, etc.\n",
    "\n",
    "In most cases, missing values in these columns are replaced with the `mode` of the column. For example, if we have 100 data points, where 90 indicate True, 8 indicate False, and 2 are missing, we can fill the 2 missing values with True, based on the overall distribution in the column.\n",
    "\n",
    "Additionally, domain knowledge can be applied here. Let's consider an example of filling missing values using the mode.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "MY5faq4yLdpQ",
    "outputId": "19ab472e-1eed-4de8-f8a7-db2a3af3cb1a"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0   1      2\n",
       "0  1   2   True\n",
       "1  3   4   None\n",
       "2  5   6  False\n",
       "3  7   8   True\n",
       "4  9  10   True"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n",
    "                               [3,4,None],\n",
    "                               [5,6,\"False\"],\n",
    "                               [7,8,\"True\"],\n",
    "                               [9,10,\"True\"]])\n",
    "\n",
    "fill_with_mode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MLAoMQOfNPlA"
   },
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "WKy-9Y2tN5jv",
    "outputId": "8da9fa16-e08c-447e-dea1-d4b1db2feebf"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True     3\n",
       "False    1\n",
       "Name: 2, dtype: int64"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fill_with_mode[2].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6iNz_zG_OKrx"
   },
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "id": "TxPKteRvNPOs"
   },
   "outputs": [],
   "source": [
    "fill_with_mode[2].fillna('True',inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "tvas7c9_OPWE",
    "outputId": "ec3c8e44-d644-475e-9e22-c65101965850"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0   1      2\n",
       "0  1   2   True\n",
       "1  3   4   True\n",
       "2  5   6  False\n",
       "3  7   8   True\n",
       "4  9  10   True"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fill_with_mode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SktitLxxOR16"
   },
   "source": [
    "As we can see, the null value has been replaced. Needless to say, we could have written anything in place of `'True'` and it would have got substituted.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "heYe1I0dOmQ_"
   },
   "source": [
    "### Numeric Data\n",
    "Now, let's talk about numeric data. Here, there are two common methods for handling missing values:\n",
    "\n",
    "1. Replace with the median of the row\n",
    "2. Replace with the mean of the row\n",
    "\n",
    "We use the median when the data is skewed and contains outliers. This is because the median is less affected by outliers.\n",
    "\n",
    "When the data is normalized, we can use the mean, as in such cases, the mean and median are usually very close.\n",
    "\n",
    "First, let's take a column that follows a normal distribution and fill the missing values with the mean of the column.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "09HM_2feOj5Y",
    "outputId": "7e309013-9acb-411c-9b06-4de795bbeeff"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-2.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0  1  2\n",
       "0 -2.0  0  1\n",
       "1 -1.0  2  3\n",
       "2  NaN  4  5\n",
       "3  1.0  6  7\n",
       "4  2.0  8  9"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fill_with_mean = pd.DataFrame([[-2,0,1],\n",
    "                               [-1,2,3],\n",
    "                               [np.nan,4,5],\n",
    "                               [1,6,7],\n",
    "                               [2,8,9]])\n",
    "\n",
    "fill_with_mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ka7-wNfzSxbx"
   },
   "source": [
    "The mean of the column is\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "XYtYEf5BSxFL",
    "outputId": "68a78d18-f0e5-4a9a-a959-2c3676a57c70"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.mean(fill_with_mean[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oBSRGxKRS39K"
   },
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "FzncQLmuS5jh",
    "outputId": "00f74fff-01f4-4024-c261-796f50f01d2e"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-2.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0  1  2\n",
       "0 -2.0  0  1\n",
       "1 -1.0  2  3\n",
       "2  0.0  4  5\n",
       "3  1.0  6  7\n",
       "4  2.0  8  9"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n",
    "fill_with_mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CwpVFCrPTC5z"
   },
   "source": [
    "As we can see, the missing value has been replaced with its mean.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jIvF13a1i00Z"
   },
   "source": [
    "Now let us try another dataframe, and this time we will replace the None values with the median of the column.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "DA59Bqo3jBYZ",
    "outputId": "85dae6ec-7394-4c36-fda0-e04769ec4a32"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-2</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>6.0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>8.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0    1  2\n",
       "0 -2  0.0  1\n",
       "1 -1  2.0  3\n",
       "2  0  NaN  5\n",
       "3  1  6.0  7\n",
       "4  2  8.0  9"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fill_with_median = pd.DataFrame([[-2,0,1],\n",
    "                               [-1,2,3],\n",
    "                               [0,np.nan,5],\n",
    "                               [1,6,7],\n",
    "                               [2,8,9]])\n",
    "\n",
    "fill_with_median"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mM1GpXYmjHnc"
   },
   "source": [
    "The median of the second column is\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "uiDy5v3xjHHX",
    "outputId": "564b6b74-2004-4486-90d4-b39330a64b88"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4.0"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fill_with_median[1].median()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "z9PLF75Jj_1s"
   },
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "lFKbOxCMkBbg",
    "outputId": "a8bd18fb-2765-47d4-e5fe-e965f57ed1f4"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-2</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>6.0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>8.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0    1  2\n",
       "0 -2  0.0  1\n",
       "1 -1  2.0  3\n",
       "2  0  4.0  5\n",
       "3  1  6.0  7\n",
       "4  2  8.0  9"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n",
    "fill_with_median"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8JtQ53GSkKWC"
   },
   "source": [
    "As we can see, the NaN value has been replaced by the median of the column\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "0ybtWLDdgRsG",
    "outputId": "b8c238ef-6024-4ee2-be2b-aa1f0fcac61d",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    1.0\n",
       "b    NaN\n",
       "c    2.0\n",
       "d    NaN\n",
       "e    3.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n",
    "example5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yrsigxRggRsH"
   },
   "source": [
    "You can fill all of the null entries with a single value, such as `0`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "KXMIPsQdgRsH",
    "outputId": "aeedfa0a-a421-4c2f-cb0d-183ce8f0c91d",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    1.0\n",
       "b    0.0\n",
       "c    2.0\n",
       "d    0.0\n",
       "e    3.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example5.fillna(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RRlI5f_hkfKe"
   },
   "source": [
    "> Key takeaways:\n",
    "1. Missing values should be filled in when there is limited data or a clear strategy for handling the missing information.\n",
    "2. Domain expertise can help approximate and fill in missing values effectively.\n",
    "3. For categorical data, missing values are often replaced with the most frequent value (mode) of the column.\n",
    "4. For numerical data, missing values are typically filled with the mean (for normalized datasets) or the median of the column.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FI9MmqFJgRsH"
   },
   "source": [
    "### Exercise:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true,
    "id": "af-ezpXdgRsH",
    "trusted": false
   },
   "outputs": [],
   "source": [
    "# What happens if you try to fill null values with a string, like ''?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kq3hw1kLgRsI"
   },
   "source": [
    "You can **forward-fill** null values, which is to use the last valid value to fill a null:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "vO3BuNrggRsI",
    "outputId": "e2bc591b-0b48-4e88-ee65-754f2737c196",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    1.0\n",
       "b    1.0\n",
       "c    2.0\n",
       "d    2.0\n",
       "e    3.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example5.fillna(method='ffill')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nDXeYuHzgRsI"
   },
   "source": [
    "You can also **back-fill** to propagate the next valid value backward to fill a null:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4M5onHcEgRsI",
    "outputId": "8f32b185-40dd-4a9f-bd85-54d6b6a414fe",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    1.0\n",
       "b    2.0\n",
       "c    2.0\n",
       "d    3.0\n",
       "e    3.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example5.fillna(method='bfill')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "id": "MbBzTom5gRsI"
   },
   "source": [
    "As you might guess, this works the same with DataFrames, but you can also specify an `axis` along which to fill null values:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 142
    },
    "id": "aRpIvo4ZgRsI",
    "outputId": "905a980a-a808-4eca-d0ba-224bd7d85955",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "      <td>9</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2   3\n",
       "0  1.0  NaN  7 NaN\n",
       "1  2.0  5.0  8 NaN\n",
       "2  NaN  6.0  9 NaN"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 142
    },
    "id": "VM1qtACAgRsI",
    "outputId": "71f2ad28-9b4e-4ff4-f5c3-e731eb489ade",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2    3\n",
       "0  1.0  1.0  7.0  7.0\n",
       "1  2.0  5.0  8.0  8.0\n",
       "2  NaN  6.0  9.0  9.0"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example4.fillna(method='ffill', axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZeMc-I1EgRsI"
   },
   "source": [
    "Notice that when a previous value is not available for forward-filling, the null value remains.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eeAoOU0RgRsJ"
   },
   "source": [
    "### Exercise:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": true,
    "id": "e8S-CjW8gRsJ",
    "trusted": false
   },
   "outputs": [],
   "source": [
    "# What output does example4.fillna(method='bfill', axis=1) produce?\n",
    "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n",
    "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YHgy0lIrgRsJ"
   },
   "source": [
    "You can be creative about how you use `fillna`. For example, let's look at `example4` again, but this time let's fill the missing values with the average of all of the values in the `DataFrame`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 142
    },
    "id": "OtYVErEygRsJ",
    "outputId": "708b1e67-45ca-44bf-a5ee-8b2de09ece73",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>5.5</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.5</td>\n",
       "      <td>6.0</td>\n",
       "      <td>9</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2   3\n",
       "0  1.0  5.5  7 NaN\n",
       "1  2.0  5.0  8 NaN\n",
       "2  1.5  6.0  9 NaN"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example4.fillna(example4.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zpMvCkLSgRsJ"
   },
   "source": [
    "Notice that column 3 is still empty: the default approach is to fill values row by row.\n",
    "\n",
    "> **Key point:** There are various methods to address missing values in your datasets. The approach you choose (whether to remove them, replace them, or decide how to replace them) should depend on the specific characteristics of the data. The more you work with and analyze datasets, the better you'll become at handling missing values effectively.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bauDnESIl9FH"
   },
   "source": [
    "### Encoding Categorical Data\n",
    "\n",
    "Machine learning models work exclusively with numbers and any type of numeric data. They cannot differentiate between \"Yes\" and \"No,\" but they can distinguish between 0 and 1. Therefore, after addressing missing values, we need to convert categorical data into a numeric format so the model can interpret it.\n",
    "\n",
    "There are two methods for encoding. We will discuss them below.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uDq9SxB7mu5i"
   },
   "source": [
    "**LABEL ENCODING**\n",
    "\n",
    "Label encoding involves converting each category into a numerical value. For instance, imagine we have a dataset of airline passengers with a column indicating their class, such as ['business class', 'economy class', 'first class']. After applying label encoding, this would be transformed into [0, 1, 2]. Let's look at an example using code. Since we will be learning `scikit-learn` in the upcoming notebooks, we won't use it here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 235
    },
    "id": "1vGz7uZyoWHL",
    "outputId": "9e252855-d193-4103-a54d-028ea7787b34"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>business class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>first class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30</td>\n",
       "      <td>economy class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40</td>\n",
       "      <td>economy class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50</td>\n",
       "      <td>economy class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>60</td>\n",
       "      <td>business class</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID           class\n",
       "0  10  business class\n",
       "1  20     first class\n",
       "2  30   economy class\n",
       "3  40   economy class\n",
       "4  50   economy class\n",
       "5  60  business class"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label = pd.DataFrame([\n",
    "                      [10,'business class'],\n",
    "                      [20,'first class'],\n",
    "                      [30, 'economy class'],\n",
    "                      [40, 'economy class'],\n",
    "                      [50, 'economy class'],\n",
    "                      [60, 'business class']\n",
    "],columns=['ID','class'])\n",
    "label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IDHnkwTYov-h"
   },
   "source": [
    "To perform label encoding on the 1st column, we have to first describe a mapping from each class to a number, before replacing\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 235
    },
    "id": "ZC5URJG3o1ES",
    "outputId": "aab0f1e7-e0f3-4c14-8459-9f9168c85437"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>60</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID  class\n",
       "0  10      0\n",
       "1  20      2\n",
       "2  30      1\n",
       "3  40      1\n",
       "4  50      1\n",
       "5  60      0"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class_labels = {'business class':0,'economy class':1,'first class':2}\n",
    "label['class'] = label['class'].replace(class_labels)\n",
    "label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ftnF-TyapOPt"
   },
   "source": [
    "As we can see, the output aligns with our expectations. So, when should we use label encoding? Label encoding is applied in one or both of the following situations:\n",
    "1. When there are a large number of categories\n",
    "2. When the categories have a specific order.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eQPAPVwsqWT7"
   },
   "source": [
    "**ONE HOT ENCODING**\n",
    "\n",
    "Another type of encoding is One Hot Encoding. In this method, each category in the column is represented as a separate column, and each data point is assigned a 0 or a 1 depending on whether it belongs to that category. This means that if there are n distinct categories, n columns will be added to the dataframe.\n",
    "\n",
    "For example, let’s consider the same airplane class example. The categories were: ['business class', 'economy class', 'first class']. If we apply one hot encoding, the following three columns will be added to the dataset: ['class_business class', 'class_economy class', 'class_first class'].\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 235
    },
    "id": "ZM0eVh0ArKUL",
    "outputId": "83238a76-b3a5-418d-c0b6-605b02b6891b"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>business class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>first class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30</td>\n",
       "      <td>economy class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40</td>\n",
       "      <td>economy class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50</td>\n",
       "      <td>economy class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>60</td>\n",
       "      <td>business class</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID           class\n",
       "0  10  business class\n",
       "1  20     first class\n",
       "2  30   economy class\n",
       "3  40   economy class\n",
       "4  50   economy class\n",
       "5  60  business class"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "one_hot = pd.DataFrame([\n",
    "                      [10,'business class'],\n",
    "                      [20,'first class'],\n",
    "                      [30, 'economy class'],\n",
    "                      [40, 'economy class'],\n",
    "                      [50, 'economy class'],\n",
    "                      [60, 'business class']\n",
    "],columns=['ID','class'])\n",
    "one_hot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aVnZ7paDrWmb"
   },
   "source": [
    "Let us perform one hot encoding on the 1st column\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "id": "RUPxf7egrYKr"
   },
   "outputs": [],
   "source": [
    "one_hot_data = pd.get_dummies(one_hot,columns=['class'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 235
    },
    "id": "TM37pHsFr4ge",
    "outputId": "7be15f53-79b2-447a-979c-822658339a9e"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>class_business class</th>\n",
       "      <th>class_economy class</th>\n",
       "      <th>class_first class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>60</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID  class_business class  class_economy class  class_first class\n",
       "0  10                     1                    0                  0\n",
       "1  20                     0                    0                  1\n",
       "2  30                     0                    1                  0\n",
       "3  40                     0                    1                  0\n",
       "4  50                     0                    1                  0\n",
       "5  60                     1                    0                  0"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "one_hot_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_zXRLOjXujdA"
   },
   "source": [
    "Each one hot encoded column contains 0 or 1, which specifies whether that category exists for that datapoint.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bDnC4NQOu0qr"
   },
   "source": [
    "When is one hot encoding used? One hot encoding is applied in one or both of the following situations:\n",
    "\n",
    "1. When the number of categories and the dataset size are relatively small.\n",
    "2. When the categories do not have any specific order.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XnUmci_4uvyu"
   },
   "source": [
    "> Key Takeaways:\n",
    "1. Encoding is used to transform non-numeric data into numeric data.\n",
    "2. There are two types of encoding: Label encoding and One Hot encoding, and either can be applied depending on the dataset's requirements.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "K8UXOJYRgRsJ"
   },
   "source": [
    "## Removing duplicate data\n",
    "\n",
    "> **Learning goal:** By the end of this subsection, you should feel confident identifying and removing duplicate values from DataFrames.\n",
    "\n",
    "Along with missing data, duplicated data is a common issue in real-world datasets. Luckily, pandas offers a straightforward way to detect and eliminate duplicate entries.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qrEG-Wa0gRsJ"
   },
   "source": [
    "### Identifying duplicates: `duplicated`\n",
    "\n",
    "You can quickly identify duplicate values using the `duplicated` method in pandas. This method provides a Boolean mask that shows whether an entry in a `DataFrame` is a duplicate of a previous one. Let's create another example `DataFrame` to demonstrate this.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "ZLu6FEnZgRsJ",
    "outputId": "376512d1-d842-4db1-aea3-71052aeeecaf",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letters</th>\n",
       "      <th>numbers</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letters  numbers\n",
       "0       A        1\n",
       "1       B        2\n",
       "2       A        1\n",
       "3       B        3\n",
       "4       B        3"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n",
    "                         'numbers': [1, 2, 1, 3, 3]})\n",
    "example6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "cIduB5oBgRsK",
    "outputId": "3da27b3d-4d69-4e1d-bb52-0af21bae87f2",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "4     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example6.duplicated()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0eDRJD4SgRsK"
   },
   "source": [
    "### Dropping duplicates: `drop_duplicates`\n",
    "`drop_duplicates` returns a copy of the data where all `duplicated` values are `False`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 142
    },
    "id": "w_YPpqIqgRsK",
    "outputId": "ac66bd2f-8671-4744-87f5-8b8d96553dea",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letters</th>\n",
       "      <th>numbers</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letters  numbers\n",
       "0       A        1\n",
       "1       B        2\n",
       "3       B        3"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example6.drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "69AqoCZAgRsK"
   },
   "source": [
    "Both `duplicated` and `drop_duplicates` default to consider all columns but you can specify that they examine only a subset of columns in your `DataFrame`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 111
    },
    "id": "BILjDs67gRsK",
    "outputId": "ef6dcc08-db8b-4352-c44e-5aa9e2bec0d3",
    "trusted": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letters</th>\n",
       "      <th>numbers</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letters  numbers\n",
       "0       A        1\n",
       "1       B        2"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example6.drop_duplicates(['letters'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GvX4og1EgRsL"
   },
   "source": [
    "**Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you inaccurate results!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**Disclaimer**:  \nThis document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "colab": {
   "name": "notebook.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  },
  "coopTranslator": {
   "original_hash": "8533b3a2230311943339963fc7f04c21",
   "translation_date": "2025-09-03T20:41:17+00:00",
   "source_file": "2-Working-With-Data/08-data-preparation/notebook.ipynb",
   "language_code": "en"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}