From 503468f6ea6ba12e59a63b05165e94209c2d14e4 Mon Sep 17 00:00:00 2001 From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com> Date: Mon, 4 Oct 2021 09:56:58 +0530 Subject: [PATCH 1/8] Added DataFrame.shape and DataFrame.columns --- .../08-data-preparation/notebook.ipynb | 965 +++++++++++------- 1 file changed, 605 insertions(+), 360 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index e45a5cb5..b1c1d7a0 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -1,318 +1,510 @@ { + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "name": "python3", + "display_name": "Python 3", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "3.5.4", + "pygments_lexer": "ipython3", + "codemirror_mode": { + "version": 3, + "name": "ipython" + } + }, + "colab": { + "name": "notebook.ipynb", + "provenance": [] + } + }, "cells": [ { "cell_type": "markdown", + "metadata": { + "id": "rQ8UhzFpgRra" + }, "source": [ - "# Data Preparation\r\n", - "\r\n", - "[Original Notebook source from *Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio by Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\r\n", - "\r\n", - "## Exploring `DataFrame` information\r\n", - "\r\n", - "> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.\r\n", - "\r\n", - "Once you have loaded your data into pandas, it will more likely than not be in a `DataFrame`. However, if the data set in your `DataFrame` has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, pandas provides some convenient tools to quickly look at overall information about a `DataFrame` in addition to the first few and last few rows.\r\n", - "\r\n", + "# Data Preparation\n", + "\n", + "[Original Notebook source from *Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio by Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n", + "\n", + "## Exploring `DataFrame` information\n", + "\n", + "> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.\n", + "\n", + "Once you have loaded your data into pandas, it will more likely than not be in a `DataFrame`. However, if the data set in your `DataFrame` has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, pandas provides some convenient tools to quickly look at overall information about a `DataFrame` in addition to the first few and last few rows.\n", + "\n", "In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset that every data scientist has seen hundreds of times: British biologist Ronald Fisher's *Iris* data set used in his 1936 paper \"The use of multiple measurements in taxonomic problems\":" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "collapsed": true, + "trusted": false, + "id": "hB1RofhdgRrp" + }, "source": [ - "import pandas as pd\r\n", - "from sklearn.datasets import load_iris\r\n", - "\r\n", - "iris = load_iris()\r\n", + "import pandas as pd\n", + "from sklearn.datasets import load_iris\n", + "\n", + "iris = load_iris()\n", "iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])" ], - "outputs": [], + "execution_count": 1, + "outputs": [] + }, + { + "cell_type": "markdown", "metadata": { - "collapsed": true, - "trusted": false - } + "id": "AGA0A_Y8hMdz" + }, + "source": [ + "### `DataFrame.shape`\n", + "We have loaded the Iris Dataset in the variable `iris_df`. Before diving into the data, it would be valuable to know the number of datapoints we have and the overall size of the dataset. It is useful to look at the volume of data we are dealing with. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LOe5jQohhulf", + "outputId": "9cf67a6a-5779-453b-b2ed-58f4f1aab507", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "iris_df.shape" + ], + "execution_count": 2, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(150, 4)" + ] + }, + "metadata": {}, + "execution_count": 2 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "smE7AGzOhxk2" + }, + "source": [ + "So, we are dealing with 150 rows and 4 columns of data. Each row represents one datapoint and each column represents a single feature associated with the data frame. So basically, there are 150 datapoints containing 4 features each.\n", + "\n", + "`shape` here is an attribute of the dataframe and not a function, which is why it doesn't end in a pair of parentheses. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d3AZKs0PinGP" + }, + "source": [ + "### `DataFrame.columns`\n", + "Let us now move into the 4 columns of data. What does each of them exactly represent? The `columns` attribute will give us the name of the columns in the dataframe. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YPGh_ziji-CY", + "outputId": "ca186194-a126-4348-f58e-aab7ebc8f7b7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "iris_df.columns" + ], + "execution_count": 4, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n", + " 'petal width (cm)'],\n", + " dtype='object')" + ] + }, + "metadata": {}, + "execution_count": 4 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsobcU_VjCC_" + }, + "source": [ + "As we can see, there are four(4) columns. The `columns` attribute tells us the name of the columns and basically nothing else. This attribute assumes importance when we want to identify the features a dataset contains." + ] }, { "cell_type": "markdown", + "metadata": { + "id": "2UTlvkjmgRrs" + }, "source": [ - "### `DataFrame.info`\r\n", + "### `DataFrame.info`\n", "Let's take a look at this dataset to see what we have:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "dHHRyG0_gRrt", + "outputId": "ca9de335-9e65-486a-d1e2-3e73d060c701", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, "source": [ "iris_df.info()" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 150 entries, 0 to 149\n", + "Data columns (total 4 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 sepal length (cm) 150 non-null float64\n", + " 1 sepal width (cm) 150 non-null float64\n", + " 2 petal length (cm) 150 non-null float64\n", + " 3 petal width (cm) 150 non-null float64\n", + "dtypes: float64(4)\n", + "memory usage: 4.8 KB\n" + ] + } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "1XgVMpvigRru" + }, "source": [ "From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "-lviAu99gRrv" + }, "source": [ - "### `DataFrame.head`\r\n", + "### `DataFrame.head`\n", "Next, let's see what the first few rows of our `DataFrame` look like:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "DZMJZh0OgRrw" + }, "source": [ "iris_df.head()" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "oj7GkrTdgRry" + }, "source": [ - "### Exercise:\r\n", - "\r\n", + "### Exercise:\n", + "\n", "By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "collapsed": true, + "trusted": false, + "id": "EKRmRFFegRrz" + }, "source": [ "# Hint: Consult the documentation by using iris_df.head?" ], - "outputs": [], - "metadata": { - "collapsed": true, - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "BJ_cpZqNgRr1" + }, "source": [ - "### `DataFrame.tail`\r\n", + "### `DataFrame.tail`\n", "The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "heanjfGWgRr2" + }, "source": [ "iris_df.tail()" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "31kBWfyLgRr3" + }, "source": [ - "In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.\r\n", - "\r\n", + "In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.\n", + "\n", "> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "BvnoojWsgRr4" + }, "source": [ - "## Dealing with missing data\r\n", - "\r\n", - "> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.\r\n", - "\r\n", - "Most of the time the datasets you want to use (of have to use) have missing values in them. How missing data is handled carries with it subtle tradeoffs that can affect your final analysis and real-world outcomes.\r\n", - "\r\n", - "Pandas handles missing values in two ways. The first you've seen before in previous sections: `NaN`, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values.\r\n", - "\r\n", + "## Dealing with missing data\n", + "\n", + "> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.\n", + "\n", + "Most of the time the datasets you want to use (of have to use) have missing values in them. How missing data is handled carries with it subtle tradeoffs that can affect your final analysis and real-world outcomes.\n", + "\n", + "Pandas handles missing values in two ways. The first you've seen before in previous sections: `NaN`, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values.\n", + "\n", "For missing values apart from floats, pandas uses the Python `None` object. While it might seem confusing that you will encounter two different kinds of values that say essentially the same thing, there are sound programmatic reasons for this design choice and, in practice, going this route enables pandas to deliver a good compromise for the vast majority of cases. Notwithstanding this, both `None` and `NaN` carry restrictions that you need to be mindful of with regards to how they can be used." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "lOHqUlZFgRr5" + }, "source": [ - "### `None`: non-float missing data\r\n", - "Because `None` comes from Python, it cannot be used in NumPy and pandas arrays that are not of data type `'object'`. Remember, NumPy arrays (and the data structures in pandas) can contain only one type of data. This is what gives them their tremendous power for large-scale data and computational work, but it also limits their flexibility. Such arrays have to upcast to the “lowest common denominator,” the data type that will encompass everything in the array. When `None` is in the array, it means you are working with Python objects.\r\n", - "\r\n", + "### `None`: non-float missing data\n", + "Because `None` comes from Python, it cannot be used in NumPy and pandas arrays that are not of data type `'object'`. Remember, NumPy arrays (and the data structures in pandas) can contain only one type of data. This is what gives them their tremendous power for large-scale data and computational work, but it also limits their flexibility. Such arrays have to upcast to the “lowest common denominator,” the data type that will encompass everything in the array. When `None` is in the array, it means you are working with Python objects.\n", + "\n", "To see this in action, consider the following example array (note the `dtype` for it):" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "QIoNdY4ngRr7" + }, "source": [ - "import numpy as np\r\n", - "\r\n", - "example1 = np.array([2, None, 6, 8])\r\n", + "import numpy as np\n", + "\n", + "example1 = np.array([2, None, 6, 8])\n", "example1" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "pdlgPNbhgRr7" + }, "source": [ "The reality of upcast data types carries two side effects with it. First, operations will be carried out at the level of interpreted Python code rather than compiled NumPy code. Essentially, this means that any operations involving `Series` or `DataFrames` with `None` in them will be slower. While you would probably not notice this performance hit, for large datasets it might become an issue.\n", "\n", "The second side effect stems from the first. Because `None` essentially drags `Series` or `DataFrame`s back into the world of vanilla Python, using NumPy/pandas aggregations like `sum()` or `min()` on arrays that contain a ``None`` value will generally produce an error:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "gWbx-KB9gRr8" + }, "source": [ "example1.sum()" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "LcEwO8UogRr9" + }, "source": [ "**Key takeaway**: Addition (and other operations) between integers and `None` values is undefined, which can limit what you can do with datasets that contain them." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "pWvVHvETgRr9" + }, "source": [ "### `NaN`: missing float values\n", "\n", "In contrast to `None`, NumPy (and therefore pandas) supports `NaN` for its fast, vectorized operations and ufuncs. The bad news is that any arithmetic performed on `NaN` always results in `NaN`. For example:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "rcFYfMG9gRr9" + }, "source": [ "np.nan + 1" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "BW3zQD2-gRr-" + }, "source": [ "np.nan * 0" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "fU5IPRcCgRr-" + }, "source": [ "The good news: aggregations run on arrays with `NaN` in them don't pop errors. The bad news: the results are not uniformly useful:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "LCInVgSSgRr_" + }, "source": [ - "example2 = np.array([2, np.nan, 6, 8]) \r\n", + "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "nhlnNJT7gRr_" + }, "source": [ "### Exercise:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, - "source": [ - "# What happens if you add np.nan and None together?\r\n" - ], - "outputs": [], "metadata": { "collapsed": true, - "trusted": false - } + "trusted": false, + "id": "yan3QRaOgRr_" + }, + "source": [ + "# What happens if you add np.nan and None together?\n" + ], + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "_iDvIRC8gRsA" + }, "source": [ "Remember: `NaN` is just for missing floating-point values; there is no `NaN` equivalent for integers, strings, or Booleans." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "kj6EKdsAgRsA" + }, "source": [ "### `NaN` and `None`: null values in pandas\n", "\n", "Even though `NaN` and `None` can behave somewhat differently, pandas is nevertheless built to handle them interchangeably. To see what we mean, consider a `Series` of integers:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "Nji-KGdNgRsA" + }, "source": [ - "int_series = pd.Series([1, 2, 3], dtype=int)\r\n", + "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "WklCzqb8gRsB" + }, "source": [ "### Exercise:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, - "source": [ - "# Now set an element of int_series equal to None.\r\n", - "# How does that element show up in the Series?\r\n", - "# What is the dtype of the Series?\r\n" - ], - "outputs": [], "metadata": { "collapsed": true, - "trusted": false - } + "trusted": false, + "id": "Cy-gqX5-gRsB" + }, + "source": [ + "# Now set an element of int_series equal to None.\n", + "# How does that element show up in the Series?\n", + "# What is the dtype of the Series?\n" + ], + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "WjMQwltNgRsB" + }, "source": [ "In the process of upcasting data types to establish data homogeneity in `Seires` and `DataFrame`s, pandas will willingly switch missing values between `None` and `NaN`. Because of this design feature, it can be helpful to think of `None` and `NaN` as two different flavors of \"null\" in pandas. Indeed, some of the core methods you will use to deal with missing values in pandas reflect this idea in their names:\n", "\n", @@ -322,513 +514,566 @@ "- `fillna()`: Returns a copy of the data with missing values filled or imputed\n", "\n", "These are important methods to master and get comfortable with, so let's go over them each in some depth." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "Yh5ifd9FgRsB" + }, "source": [ "### Detecting null values\n", "Both `isnull()` and `notnull()` are your primary methods for detecting null data. Both return Boolean masks over your data." - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "collapsed": true, + "trusted": false, + "id": "e-vFp5lvgRsC" + }, "source": [ "example3 = pd.Series([0, np.nan, '', None])" ], - "outputs": [], - "metadata": { - "collapsed": true, - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "1XdaJJ7PgRsC" + }, "source": [ "example3.isnull()" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "PaSZ0SQygRsC" + }, "source": [ "Look closely at the output. Does any of it surprise you? While `0` is an arithmetic null, it's nevertheless a perfectly good integer and pandas treats it as such. `''` is a little more subtle. While we used it in Section 1 to represent an empty string value, it is nevertheless a string object and not a representation of null as far as pandas is concerned.\n", "\n", "Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "PlBqEo3mgRsC" + }, "source": [ "### Exercise:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, - "source": [ - "# Try running example3[example3.notnull()].\r\n", - "# Before you do so, what do you expect to see?\r\n" - ], - "outputs": [], "metadata": { "collapsed": true, - "trusted": false - } + "trusted": false, + "id": "ggDVf5uygRsD" + }, + "source": [ + "# Try running example3[example3.notnull()].\n", + "# Before you do so, what do you expect to see?\n" + ], + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "D_jWN7mHgRsD" + }, "source": [ "**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "3VaYC1TvgRsD" + }, "source": [ "### Dropping null values\n", "\n", "Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to `example3`:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "7uIvS097gRsD" + }, "source": [ - "example3 = example3.dropna()\r\n", + "example3 = example3.dropna()\n", "example3" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "hil2cr64gRsD" + }, "source": [ "Note that this should look like your output from `example3[example3.notnull()]`. The difference here is that, rather than just indexing on the masked values, `dropna` has removed those missing values from the `Series` `example3`.\n", "\n", "Because `DataFrame`s have two dimensions, they afford more options for dropping data." - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "an-l74sPgRsE" + }, "source": [ - "example4 = pd.DataFrame([[1, np.nan, 7], \r\n", - " [2, 5, 8], \r\n", - " [np.nan, 6, 9]])\r\n", + "example4 = pd.DataFrame([[1, np.nan, 7], \n", + " [2, 5, 8], \n", + " [np.nan, 6, 9]])\n", "example4" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "66wwdHZrgRsE" + }, "source": [ "(Did you notice that pandas upcast two of the columns to floats to accommodate the `NaN`s?)\n", "\n", "You cannot drop a single value from a `DataFrame`, so you have to drop full rows or columns. Depending on what you are doing, you might want to do one or the other, and so pandas gives you options for both. Because in data science, columns generally represent variables and rows represent observations, you are more likely to drop rows of data; the default setting for `dropna()` is to drop all rows that contain any null values:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "jAVU24RXgRsE" + }, "source": [ "example4.dropna()" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "TrQRBuTDgRsE" + }, "source": [ "If necessary, you can drop NA values from columns. Use `axis=1` to do so:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "GrBhxu9GgRsE" + }, "source": [ "example4.dropna(axis='columns')" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "KWXiKTfMgRsF" + }, "source": [ "Notice that this can drop a lot of data that you might want to keep, particularly in smaller datasets. What if you just want to drop rows or columns that contain several or even just all null values? You specify those setting in `dropna` with the `how` and `thresh` parameters.\n", "\n", "By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action." - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "Bcf_JWTsgRsF" + }, "source": [ - "example4[3] = np.nan\r\n", + "example4[3] = np.nan\n", "example4" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "oXXSfQFHgRsF" + }, "source": [ "### Exercise:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, - "source": [ - "# How might you go about dropping just column 3?\r\n", - "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\r\n" - ], - "outputs": [], "metadata": { "collapsed": true, - "trusted": false - } + "trusted": false, + "id": "ExUwQRxpgRsF" + }, + "source": [ + "# How might you go about dropping just column 3?\n", + "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" + ], + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "38kwAihWgRsG" + }, "source": [ "The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "M9dCNMaagRsG" + }, "source": [ "example4.dropna(axis='rows', thresh=3)" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "fmSFnzZegRsG" + }, "source": [ "Here, the first and last row have been dropped, because they contain only two non-null values." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "mCcxLGyUgRsG" + }, "source": [ "### Filling null values\n", "\n", "Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice." - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "0ybtWLDdgRsG" + }, "source": [ - "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\r\n", + "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", "example5" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "yrsigxRggRsH" + }, "source": [ "You can fill all of the null entries with a single value, such as `0`:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "KXMIPsQdgRsH" + }, "source": [ "example5.fillna(0)" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "FI9MmqFJgRsH" + }, "source": [ "### Exercise:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, - "source": [ - "# What happens if you try to fill null values with a string, like ''?\r\n" - ], - "outputs": [], "metadata": { "collapsed": true, - "trusted": false - } + "trusted": false, + "id": "af-ezpXdgRsH" + }, + "source": [ + "# What happens if you try to fill null values with a string, like ''?\n" + ], + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "kq3hw1kLgRsI" + }, "source": [ "You can **forward-fill** null values, which is to use the last valid value to fill a null:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "vO3BuNrggRsI" + }, "source": [ "example5.fillna(method='ffill')" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "nDXeYuHzgRsI" + }, "source": [ "You can also **back-fill** to propagate the next valid value backward to fill a null:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "4M5onHcEgRsI" + }, "source": [ "example5.fillna(method='bfill')" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "collapsed": true, + "id": "MbBzTom5gRsI" + }, "source": [ "As you might guess, this works the same with `DataFrame`s, but you can also specify an `axis` along which to fill null values:" - ], - "metadata": { - "collapsed": true - } + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "aRpIvo4ZgRsI" + }, "source": [ "example4" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "VM1qtACAgRsI" + }, "source": [ "example4.fillna(method='ffill', axis=1)" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "ZeMc-I1EgRsI" + }, "source": [ "Notice that when a previous value is not available for forward-filling, the null value remains." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "eeAoOU0RgRsJ" + }, "source": [ "### Exercise:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, - "source": [ - "# What output does example4.fillna(method='bfill', axis=1) produce?\r\n", - "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\r\n", - "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\r\n" - ], - "outputs": [], "metadata": { "collapsed": true, - "trusted": false - } + "trusted": false, + "id": "e8S-CjW8gRsJ" + }, + "source": [ + "# What output does example4.fillna(method='bfill', axis=1) produce?\n", + "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", + "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" + ], + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "YHgy0lIrgRsJ" + }, "source": [ "You can be creative about how you use `fillna`. For example, let's look at `example4` again, but this time let's fill the missing values with the average of all of the values in the `DataFrame`:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "OtYVErEygRsJ" + }, "source": [ "example4.fillna(example4.mean())" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "zpMvCkLSgRsJ" + }, "source": [ "Notice that column 3 is still valueless: the default direction is to fill values row-wise.\n", "\n", "> **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "K8UXOJYRgRsJ" + }, "source": [ "## Removing duplicate data\n", "\n", "> **Learning goal:** By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.\n", "\n", "In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries." - ], - "metadata": {} + ] }, { "cell_type": "markdown", + "metadata": { + "id": "qrEG-Wa0gRsJ" + }, "source": [ "### Identifying duplicates: `duplicated`\n", "\n", "You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an ealier one. Let's create another example `DataFrame` to see this in action." - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "ZLu6FEnZgRsJ" + }, "source": [ - "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\r\n", - " 'numbers': [1, 2, 1, 3, 3]})\r\n", + "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n", + " 'numbers': [1, 2, 1, 3, 3]})\n", "example6" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "cIduB5oBgRsK" + }, "source": [ "example6.duplicated()" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "0eDRJD4SgRsK" + }, "source": [ "### Dropping duplicates: `drop_duplicates`\n", "`drop_duplicates` simply returns a copy of the data for which all of the `duplicated` values are `False`:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "w_YPpqIqgRsK" + }, "source": [ "example6.drop_duplicates()" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "69AqoCZAgRsK" + }, "source": [ "Both `duplicated` and `drop_duplicates` default to consider all columnsm but you can specify that they examine only a subset of columns in your `DataFrame`:" - ], - "metadata": {} + ] }, { "cell_type": "code", - "execution_count": null, + "metadata": { + "trusted": false, + "id": "BILjDs67gRsK" + }, "source": [ "example6.drop_duplicates(['letters'])" ], - "outputs": [], - "metadata": { - "trusted": false - } + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", + "metadata": { + "id": "GvX4og1EgRsL" + }, "source": [ "> **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you inaccurate results!" - ], - "metadata": {} + ] } - ], - "metadata": { - "anaconda-cloud": {}, - "kernelspec": { - "name": "python3", - "display_name": "Python 3", - "language": "python" - }, - "language_info": { - "mimetype": "text/x-python", - "nbconvert_exporter": "python", - "name": "python", - "file_extension": ".py", - "version": "3.5.4", - "pygments_lexer": "ipython3", - "codemirror_mode": { - "version": 3, - "name": "ipython" - } - } - }, - "nbformat": 4, - "nbformat_minor": 1 + ] } \ No newline at end of file From 943172fd55d7d9b08d8bee906086cf43402041af Mon Sep 17 00:00:00 2001 From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com> Date: Mon, 4 Oct 2021 21:50:02 +0530 Subject: [PATCH 2/8] Added DataFrame.describe() and elaborated on some of the existing explanations. --- .../08-data-preparation/notebook.ipynb | 372 ++++++++++++++++-- 1 file changed, 350 insertions(+), 22 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index b1c1d7a0..c6ca05dc 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -76,10 +76,10 @@ "cell_type": "code", "metadata": { "id": "LOe5jQohhulf", - "outputId": "9cf67a6a-5779-453b-b2ed-58f4f1aab507", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "968f9fb0-6cb7-4985-c64b-b332c086bdbf" }, "source": [ "iris_df.shape" @@ -123,15 +123,15 @@ "cell_type": "code", "metadata": { "id": "YPGh_ziji-CY", - "outputId": "ca186194-a126-4348-f58e-aab7ebc8f7b7", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "ffad1c9f-06b4-49d9-b409-5e4cc1b9f19b" }, "source": [ "iris_df.columns" ], - "execution_count": 4, + "execution_count": 3, "outputs": [ { "output_type": "execute_result", @@ -143,7 +143,7 @@ ] }, "metadata": {}, - "execution_count": 4 + "execution_count": 3 } ] }, @@ -163,7 +163,7 @@ }, "source": [ "### `DataFrame.info`\n", - "Let's take a look at this dataset to see what we have:" + "The amount of data(given by the `shape` attribute) and the name of the features or columns(given by the `columns` attribute) tell us something about the dataset. Now, we would want to dive deeper into the dataset. The `DataFrame.info()` function is quite useful for this. " ] }, { @@ -171,15 +171,15 @@ "metadata": { "trusted": false, "id": "dHHRyG0_gRrt", - "outputId": "ca9de335-9e65-486a-d1e2-3e73d060c701", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "325edd04-3809-4d71-b6c3-94c65b162882" }, "source": [ "iris_df.info()" ], - "execution_count": 3, + "execution_count": 4, "outputs": [ { "output_type": "stream", @@ -206,7 +206,150 @@ "id": "1XgVMpvigRru" }, "source": [ - "From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers." + "From here, we get to can make a few observations:\n", + "1. The DataType of each column: In this dataset, all of the data is stored as 64-bit floating-point numbers.\n", + "2. Number of Non-Null values: Dealing with null values is an important step in data preparation. It will be dealt with later in the notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IYlyxbpWFEF4" + }, + "source": [ + "### DataFrame.describe()\n", + "Say we have a lot of numerical data in our dataset. Univariate statistical calculations such as the mean, median, quartiles etc. can be done on each of the columns individually. The `DataFrame.describe()` function provides us with a statistical summary of the numerical columns of a dataset.\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tWV-CMstFIRA", + "outputId": "7c5cd72f-51d8-474c-966b-d2fbbdb7b7fc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 297 + } + }, + "source": [ + "iris_df.describe()" + ], + "execution_count": 8, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count150.000000150.000000150.000000150.000000
mean5.8433333.0573333.7580001.199333
std0.8280660.4358661.7652980.762238
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", + "count 150.000000 150.000000 150.000000 150.000000\n", + "mean 5.843333 3.057333 3.758000 1.199333\n", + "std 0.828066 0.435866 1.765298 0.762238\n", + "min 4.300000 2.000000 1.000000 0.100000\n", + "25% 5.100000 2.800000 1.600000 0.300000\n", + "50% 5.800000 3.000000 4.350000 1.300000\n", + "75% 6.400000 3.300000 5.100000 1.800000\n", + "max 7.900000 4.400000 6.900000 2.500000" + ] + }, + "metadata": {}, + "execution_count": 8 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zjjtW5hPGMuM" + }, + "source": [ + "The output above shows the total number of data points, mean, standard deviation, minimum, lower quartile(25%), median(50%), upper quartile(75%) and the maximum value of each column." ] }, { @@ -216,20 +359,117 @@ }, "source": [ "### `DataFrame.head`\n", - "Next, let's see what the first few rows of our `DataFrame` look like:" + "With all the above functions and attributes, we have got a top level view of the dataset. We know how many data points are there, how many features are there, the data type of each feature and the number of non-null values for each feature.\n", + "\n", + "Now its time to look at the data itself. Let's see what the first few rows(the first few datapoints) of our `DataFrame` look like:" ] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "DZMJZh0OgRrw" + "id": "DZMJZh0OgRrw", + "outputId": "c12ac408-abdb-48a5-ca3f-93b02f963b2f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } }, "source": [ "iris_df.head()" ], - "execution_count": null, - "outputs": [] + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", + "0 5.1 3.5 1.4 0.2\n", + "1 4.9 3.0 1.4 0.2\n", + "2 4.7 3.2 1.3 0.2\n", + "3 4.6 3.1 1.5 0.2\n", + "4 5.0 3.6 1.4 0.2" + ] + }, + "metadata": {}, + "execution_count": 5 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EBHEimZuEFQK" + }, + "source": [ + "As the output here, we can see five(5) entries of the dataset. If we look at the index at the left, we find out that these are the first five rows." + ] }, { "cell_type": "markdown", @@ -239,7 +479,7 @@ "source": [ "### Exercise:\n", "\n", - "By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?" + "From the example given above, it is clear that, by default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out a way to display more than five rows?" ] }, { @@ -252,7 +492,7 @@ "source": [ "# Hint: Consult the documentation by using iris_df.head?" ], - "execution_count": null, + "execution_count": 6, "outputs": [] }, { @@ -262,20 +502,106 @@ }, "source": [ "### `DataFrame.tail`\n", - "The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:" + "Another way of looking at the data can be from the end(instead of the beginning). The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:" ] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "heanjfGWgRr2" + "id": "heanjfGWgRr2", + "outputId": "2930cf87-bfeb-4ddc-8be1-53d0e57a06b3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } }, "source": [ "iris_df.tail()" ], - "execution_count": null, - "outputs": [] + "execution_count": 7, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
1456.73.05.22.3
1466.32.55.01.9
1476.53.05.22.0
1486.23.45.42.3
1495.93.05.11.8
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", + "145 6.7 3.0 5.2 2.3\n", + "146 6.3 2.5 5.0 1.9\n", + "147 6.5 3.0 5.2 2.0\n", + "148 6.2 3.4 5.4 2.3\n", + "149 5.9 3.0 5.1 1.8" + ] + }, + "metadata": {}, + "execution_count": 7 + } + ] }, { "cell_type": "markdown", @@ -283,7 +609,9 @@ "id": "31kBWfyLgRr3" }, "source": [ - "In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.\n", + "In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets. \n", + "\n", + "All the functions and attributes shown above with the help of code examples, help us get a look and feel of the data. \n", "\n", "> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with." ] From fb023030737854eb77cf29734c69cf8d80c6a470 Mon Sep 17 00:00:00 2001 From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com> Date: Tue, 5 Oct 2021 11:53:41 +0530 Subject: [PATCH 3/8] Added Missing data , detecting null values codes and enhanced explanations --- .../08-data-preparation/notebook.ipynb | 267 +++++++++++++++--- 1 file changed, 220 insertions(+), 47 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index c6ca05dc..4d047f3e 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -79,7 +79,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "968f9fb0-6cb7-4985-c64b-b332c086bdbf" + "outputId": "4641a412-8abb-4e2f-d1ec-ff9b5004e361" }, "source": [ "iris_df.shape" @@ -126,7 +126,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "ffad1c9f-06b4-49d9-b409-5e4cc1b9f19b" + "outputId": "0f9c41ea-d480-4245-d7e2-56d514ac7724" }, "source": [ "iris_df.columns" @@ -174,7 +174,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "325edd04-3809-4d71-b6c3-94c65b162882" + "outputId": "94d5e48a-746c-4e58-b08f-c63b377a61b1" }, "source": [ "iris_df.info()" @@ -226,16 +226,16 @@ "cell_type": "code", "metadata": { "id": "tWV-CMstFIRA", - "outputId": "7c5cd72f-51d8-474c-966b-d2fbbdb7b7fc", "colab": { "base_uri": "https://localhost:8080/", "height": 297 - } + }, + "outputId": "b01322a1-4296-4ad0-f990-6e0dcba668f6" }, "source": [ "iris_df.describe()" ], - "execution_count": 8, + "execution_count": 5, "outputs": [ { "output_type": "execute_result", @@ -339,7 +339,7 @@ ] }, "metadata": {}, - "execution_count": 8 + "execution_count": 5 } ] }, @@ -369,16 +369,16 @@ "metadata": { "trusted": false, "id": "DZMJZh0OgRrw", - "outputId": "c12ac408-abdb-48a5-ca3f-93b02f963b2f", "colab": { "base_uri": "https://localhost:8080/", "height": 204 - } + }, + "outputId": "14b1e3cd-54ac-47dc-f7b2-231d51d93741" }, "source": [ "iris_df.head()" ], - "execution_count": 5, + "execution_count": 6, "outputs": [ { "output_type": "execute_result", @@ -458,7 +458,7 @@ ] }, "metadata": {}, - "execution_count": 5 + "execution_count": 6 } ] }, @@ -492,7 +492,7 @@ "source": [ "# Hint: Consult the documentation by using iris_df.head?" ], - "execution_count": 6, + "execution_count": 7, "outputs": [] }, { @@ -510,16 +510,16 @@ "metadata": { "trusted": false, "id": "heanjfGWgRr2", - "outputId": "2930cf87-bfeb-4ddc-8be1-53d0e57a06b3", "colab": { "base_uri": "https://localhost:8080/", "height": 204 - } + }, + "outputId": "d4e22b38-ba5d-4dd1-bbd2-b9cd9ad7b150" }, "source": [ "iris_df.tail()" ], - "execution_count": 7, + "execution_count": 8, "outputs": [ { "output_type": "execute_result", @@ -599,7 +599,7 @@ ] }, "metadata": {}, - "execution_count": 7 + "execution_count": 8 } ] }, @@ -619,14 +619,18 @@ { "cell_type": "markdown", "metadata": { - "id": "BvnoojWsgRr4" + "id": "TvurZyLSDxq_" }, "source": [ - "## Dealing with missing data\n", + "### Missing Data\n", + "Let us dive into missing data. Missing data occurs, when no value is sotred in some of the columns. \n", + "\n", + "Let us take an example: say someone is concious about his/her weight and doesn't fill the weight field in a survey. Then, the weight value for that certain person will be missing. \n", + "\n", + "Most of the time, in real world datasets, missing values occur.\n", "\n", - "> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.\n", + "**How Pandas Handles missing data**\n", "\n", - "Most of the time the datasets you want to use (of have to use) have missing values in them. How missing data is handled carries with it subtle tradeoffs that can affect your final analysis and real-world outcomes.\n", "\n", "Pandas handles missing values in two ways. The first you've seen before in previous sections: `NaN`, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values.\n", "\n", @@ -649,7 +653,11 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "QIoNdY4ngRr7" + "id": "QIoNdY4ngRr7", + "outputId": "e2ea93a4-b967-4319-904b-85479c36b169", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "source": [ "import numpy as np\n", @@ -657,8 +665,19 @@ "example1 = np.array([2, None, 6, 8])\n", "example1" ], - "execution_count": null, - "outputs": [] + "execution_count": 9, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([2, None, 6, 8], dtype=object)" + ] + }, + "metadata": {}, + "execution_count": 9 + } + ] }, { "cell_type": "markdown", @@ -675,13 +694,31 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "gWbx-KB9gRr8" + "id": "gWbx-KB9gRr8", + "outputId": "ff2a899b-5419-4a5c-b054-bc1e6ab906c5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 292 + } }, "source": [ "example1.sum()" ], - "execution_count": null, - "outputs": [] + "execution_count": 10, + "outputs": [ + { + "output_type": "error", + "ename": "TypeError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n", + "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" + ] + } + ] }, { "cell_type": "markdown", @@ -707,25 +744,55 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "rcFYfMG9gRr9" + "id": "rcFYfMG9gRr9", + "outputId": "a452b675-2131-47a7-ff38-2b4d6e923d50", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "source": [ "np.nan + 1" ], - "execution_count": null, - "outputs": [] + "execution_count": 11, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "nan" + ] + }, + "metadata": {}, + "execution_count": 11 + } + ] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "BW3zQD2-gRr-" + "id": "BW3zQD2-gRr-", + "outputId": "6956b57f-8ae7-4880-cc1d-0cf54edfe6ee", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "source": [ "np.nan * 0" ], - "execution_count": null, - "outputs": [] + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "nan" + ] + }, + "metadata": {}, + "execution_count": 12 + } + ] }, { "cell_type": "markdown", @@ -740,14 +807,29 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "LCInVgSSgRr_" + "id": "LCInVgSSgRr_", + "outputId": "57ad3201-3958-48c6-924b-d46b61d4aeba", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "source": [ "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ], - "execution_count": null, - "outputs": [] + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(nan, nan, nan)" + ] + }, + "metadata": {}, + "execution_count": 13 + } + ] }, { "cell_type": "markdown", @@ -768,7 +850,7 @@ "source": [ "# What happens if you add np.nan and None together?\n" ], - "execution_count": null, + "execution_count": 14, "outputs": [] }, { @@ -795,14 +877,32 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "Nji-KGdNgRsA" + "id": "Nji-KGdNgRsA", + "outputId": "8dbdf129-cd8b-40b5-96ba-21a7f3fa0044", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "source": [ "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ], - "execution_count": null, - "outputs": [] + "execution_count": 15, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 1\n", + "1 2\n", + "2 3\n", + "dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 15 + } + ] }, { "cell_type": "markdown", @@ -825,7 +925,7 @@ "# How does that element show up in the Series?\n", "# What is the dtype of the Series?\n" ], - "execution_count": null, + "execution_count": 16, "outputs": [] }, { @@ -851,6 +951,8 @@ }, "source": [ "### Detecting null values\n", + "\n", + "Now that we have understood the importance of missing values, we need to detect them in our dataset, before dealing with them.\n", "Both `isnull()` and `notnull()` are your primary methods for detecting null data. Both return Boolean masks over your data." ] }, @@ -864,20 +966,39 @@ "source": [ "example3 = pd.Series([0, np.nan, '', None])" ], - "execution_count": null, + "execution_count": 17, "outputs": [] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "1XdaJJ7PgRsC" + "id": "1XdaJJ7PgRsC", + "outputId": "1fd6c6af-19e0-4568-e837-985d571604f4", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "source": [ "example3.isnull()" ], - "execution_count": null, - "outputs": [] + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 False\n", + "1 True\n", + "2 False\n", + "3 True\n", + "dtype: bool" + ] + }, + "metadata": {}, + "execution_count": 18 + } + ] }, { "cell_type": "markdown", @@ -887,7 +1008,35 @@ "source": [ "Look closely at the output. Does any of it surprise you? While `0` is an arithmetic null, it's nevertheless a perfectly good integer and pandas treats it as such. `''` is a little more subtle. While we used it in Section 1 to represent an empty string value, it is nevertheless a string object and not a representation of null as far as pandas is concerned.\n", "\n", - "Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values." + "Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values.\n", + "\n", + "If we want the total number of missing values, we can just do a sum over the mask produced by the `isnull()` method." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JCcQVoPkHDUv", + "outputId": "c0002689-f529-4e3e-c73b-41ac513c59d3", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "example3.isnull().sum()" + ], + "execution_count": 19, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "2" + ] + }, + "metadata": {}, + "execution_count": 19 + } ] }, { @@ -910,7 +1059,7 @@ "# Try running example3[example3.notnull()].\n", "# Before you do so, what do you expect to see?\n" ], - "execution_count": null, + "execution_count": 20, "outputs": [] }, { @@ -919,7 +1068,31 @@ "id": "D_jWN7mHgRsD" }, "source": [ - "**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data." + "**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in DataFrames: they show the results and the index of those results, which will help you enormously as you wrestle with your data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BvnoojWsgRr4" + }, + "source": [ + "### Dealing with missing data\n", + "\n", + "> **Learning goal:** By the end of this subsection, you should know how and when to replace or remove null values from DataFrames.\n", + "\n", + "Machine Learning models can't deal with missing data themselves. So, before passing the data into the model, we need to deal with these missing values.\n", + "\n", + "How missing data is handled carries with it subtle tradeoffs, can affect your final analysis and real-world outcomes.\n", + "\n", + "There are primarily two ways of dealing with missing data:\n", + "\n", + "\n", + "1. Drop the row containing the missing value\n", + "2. Replace the missing value with some other value\n", + "\n", + "We will discuss both these methods and their pros and cons in details.\n", + "\n" ] }, { From d58a5f67e2cfba8378e9b1611efa152bf813d563 Mon Sep 17 00:00:00 2001 From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com> Date: Tue, 5 Oct 2021 12:06:59 +0530 Subject: [PATCH 4/8] Enhanced dropna --- .../08-data-preparation/notebook.ipynb | 417 ++++++++++++++++-- 1 file changed, 377 insertions(+), 40 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index 4d047f3e..ac9bab82 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -654,10 +654,10 @@ "metadata": { "trusted": false, "id": "QIoNdY4ngRr7", - "outputId": "e2ea93a4-b967-4319-904b-85479c36b169", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "e2ea93a4-b967-4319-904b-85479c36b169" }, "source": [ "import numpy as np\n", @@ -695,11 +695,11 @@ "metadata": { "trusted": false, "id": "gWbx-KB9gRr8", - "outputId": "ff2a899b-5419-4a5c-b054-bc1e6ab906c5", "colab": { "base_uri": "https://localhost:8080/", "height": 292 - } + }, + "outputId": "ff2a899b-5419-4a5c-b054-bc1e6ab906c5" }, "source": [ "example1.sum()" @@ -745,10 +745,10 @@ "metadata": { "trusted": false, "id": "rcFYfMG9gRr9", - "outputId": "a452b675-2131-47a7-ff38-2b4d6e923d50", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "a452b675-2131-47a7-ff38-2b4d6e923d50" }, "source": [ "np.nan + 1" @@ -772,10 +772,10 @@ "metadata": { "trusted": false, "id": "BW3zQD2-gRr-", - "outputId": "6956b57f-8ae7-4880-cc1d-0cf54edfe6ee", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "6956b57f-8ae7-4880-cc1d-0cf54edfe6ee" }, "source": [ "np.nan * 0" @@ -808,10 +808,10 @@ "metadata": { "trusted": false, "id": "LCInVgSSgRr_", - "outputId": "57ad3201-3958-48c6-924b-d46b61d4aeba", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "57ad3201-3958-48c6-924b-d46b61d4aeba" }, "source": [ "example2 = np.array([2, np.nan, 6, 8]) \n", @@ -878,10 +878,10 @@ "metadata": { "trusted": false, "id": "Nji-KGdNgRsA", - "outputId": "8dbdf129-cd8b-40b5-96ba-21a7f3fa0044", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "8dbdf129-cd8b-40b5-96ba-21a7f3fa0044" }, "source": [ "int_series = pd.Series([1, 2, 3], dtype=int)\n", @@ -974,10 +974,10 @@ "metadata": { "trusted": false, "id": "1XdaJJ7PgRsC", - "outputId": "1fd6c6af-19e0-4568-e837-985d571604f4", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "1fd6c6af-19e0-4568-e837-985d571604f4" }, "source": [ "example3.isnull()" @@ -1016,11 +1016,11 @@ { "cell_type": "code", "metadata": { - "id": "JCcQVoPkHDUv", - "outputId": "c0002689-f529-4e3e-c73b-41ac513c59d3", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "id": "JCcQVoPkHDUv", + "outputId": "c0002689-f529-4e3e-c73b-41ac513c59d3" }, "source": [ "example3.isnull().sum()" @@ -1103,21 +1103,42 @@ "source": [ "### Dropping null values\n", "\n", - "Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to `example3`:" + "The amount of data we pass on to our model has a direct effect on its performance. Dropping null values means that we are reducing the number of datapoints, and hence reducing the size of the dataset. So, it is advisable to drop rows with null values when the dataset is quite large.\n", + "\n", + "Another instance maybe that a certain row or column has a lot of missing values. Then, they maybe dropped because they wouldn't add much value to our analysis as most of the data is missing for that row/column.\n", + "\n", + "Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. To see this in action, let's return to `example3`. The `DataFrame.dropna()` function helps in dropping the rows with null values. " ] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "7uIvS097gRsD" + "id": "7uIvS097gRsD", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "3d2d43e7-99ca-45ca-adc4-cef2c737e5bf" }, "source": [ "example3 = example3.dropna()\n", "example3" ], - "execution_count": null, - "outputs": [] + "execution_count": 21, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 0\n", + "2 \n", + "dtype: object" + ] + }, + "metadata": {}, + "execution_count": 21 + } + ] }, { "cell_type": "markdown", @@ -1127,14 +1148,19 @@ "source": [ "Note that this should look like your output from `example3[example3.notnull()]`. The difference here is that, rather than just indexing on the masked values, `dropna` has removed those missing values from the `Series` `example3`.\n", "\n", - "Because `DataFrame`s have two dimensions, they afford more options for dropping data." + "Because DataFrames have two dimensions, they afford more options for dropping data." ] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "an-l74sPgRsE" + "id": "an-l74sPgRsE", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 142 + }, + "outputId": "961427aa-9bce-445b-d230-61d02bc16c92" }, "source": [ "example4 = pd.DataFrame([[1, np.nan, 7], \n", @@ -1142,8 +1168,69 @@ " [np.nan, 6, 9]])\n", "example4" ], - "execution_count": null, - "outputs": [] + "execution_count": 22, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
01.0NaN7
12.05.08
2NaN6.09
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 1.0 NaN 7\n", + "1 2.0 5.0 8\n", + "2 NaN 6.0 9" + ] + }, + "metadata": {}, + "execution_count": 22 + } + ] }, { "cell_type": "markdown", @@ -1160,13 +1247,65 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "jAVU24RXgRsE" + "id": "jAVU24RXgRsE", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 80 + }, + "outputId": "aaeac6bc-ca6f-4eda-de0c-119e0c50ba83" }, "source": [ "example4.dropna()" ], - "execution_count": null, - "outputs": [] + "execution_count": 23, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
12.05.08
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "1 2.0 5.0 8" + ] + }, + "metadata": {}, + "execution_count": 23 + } + ] }, { "cell_type": "markdown", @@ -1181,13 +1320,71 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "GrBhxu9GgRsE" + "id": "GrBhxu9GgRsE", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 142 + }, + "outputId": "89fee273-d71b-4400-9484-b4bf93b69ee5" }, "source": [ "example4.dropna(axis='columns')" ], - "execution_count": null, - "outputs": [] + "execution_count": 24, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
2
07
18
29
\n", + "
" + ], + "text/plain": [ + " 2\n", + "0 7\n", + "1 8\n", + "2 9" + ] + }, + "metadata": {}, + "execution_count": 24 + } + ] }, { "cell_type": "markdown", @@ -1197,21 +1394,104 @@ "source": [ "Notice that this can drop a lot of data that you might want to keep, particularly in smaller datasets. What if you just want to drop rows or columns that contain several or even just all null values? You specify those setting in `dropna` with the `how` and `thresh` parameters.\n", "\n", - "By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action." + "By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action in the next exercise." ] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "Bcf_JWTsgRsF" + "id": "Bcf_JWTsgRsF", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 142 + }, + "outputId": "07e8f4eb-18c8-4e5d-9317-6a9a3db38b73" }, "source": [ "example4[3] = np.nan\n", "example4" ], - "execution_count": null, - "outputs": [] + "execution_count": 25, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 1.0 NaN 7 NaN\n", + "1 2.0 5.0 8 NaN\n", + "2 NaN 6.0 9 NaN" + ] + }, + "metadata": {}, + "execution_count": 25 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pNZer7q9JPNC" + }, + "source": [ + "> Key takeaways: \n", + "1. Dropping null values is a good idea only if the dataset is large enough.\n", + "2. Full rows or columns can be dropped if they have most of their data missing.\n", + "3. The `DataFrame.dropna(axis=)` method helps in dropping null values. The `axis` argument signifies whether rows are to be dropped or columns. \n", + "4. The `how` argument can also be used. By default it is set to `any`. So, it drops only those rows/columns which contain any null values. It can be set to `all` to specify that we will drop only those rows/columns where all values are null." + ] }, { "cell_type": "markdown", @@ -1233,7 +1513,7 @@ "# How might you go about dropping just column 3?\n", "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" ], - "execution_count": null, + "execution_count": 26, "outputs": [] }, { @@ -1249,13 +1529,67 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "M9dCNMaagRsG" + "id": "M9dCNMaagRsG", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 80 + }, + "outputId": "b2c00415-95a6-4a5c-e3f9-781ff5cc8625" }, "source": [ "example4.dropna(axis='rows', thresh=3)" ], - "execution_count": null, - "outputs": [] + "execution_count": 27, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
12.05.08NaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "1 2.0 5.0 8 NaN" + ] + }, + "metadata": {}, + "execution_count": 27 + } + ] }, { "cell_type": "markdown", @@ -1274,7 +1608,10 @@ "source": [ "### Filling null values\n", "\n", - "Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice." + "It sometimes makes sense to fill in missing values with ones which could be valid. There are a few techniques to fill null values. The first is using Domain Knowledge(knowledge of the subject on which the dataset is based) to somehow approximate the missing values. \n", + "\n", + "\n", + "You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice." ] }, { From 87ef4f0875c32b8eaaabcf5a904e3cedb10825c1 Mon Sep 17 00:00:00 2001 From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com> Date: Tue, 5 Oct 2021 12:28:48 +0530 Subject: [PATCH 5/8] fillna for Categorical columns added --- .../08-data-preparation/notebook.ipynb | 294 ++++++++++++++++++ 1 file changed, 294 insertions(+) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index ac9bab82..3e8ae01e 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -1614,6 +1614,300 @@ "You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice." ] }, + { + "cell_type": "markdown", + "metadata": { + "id": "CE8S7louLezV" + }, + "source": [ + "First let us consider non-numeric data. In datasets, we have columns with categorical data. Eg. Gender, True or False etc.\n", + "\n", + "In most of these cases, we replace missing values with the `mode` of the column. Say, we have 100 data points and 90 have said True, 8 have said False and 2 have not filled. Then, we can will the 2 with True, considering the full column. \n", + "\n", + "Again, here we can use domain knowledge here. Let us consider an example of filling with the mode." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MY5faq4yLdpQ", + "outputId": "c3838b07-0d15-471e-8dad-370de91d4bdc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", + " [3,4,None],\n", + " [5,6,\"False\"],\n", + " [7,8,\"True\"],\n", + " [9,10,\"True\"]])\n", + "\n", + "fill_with_mode" + ], + "execution_count": 28, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
012True
134None
256False
378True
4910True
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 1 2 True\n", + "1 3 4 None\n", + "2 5 6 False\n", + "3 7 8 True\n", + "4 9 10 True" + ] + }, + "metadata": {}, + "execution_count": 28 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MLAoMQOfNPlA" + }, + "source": [ + "Now, lets first find the mode before filling the `None` value with the mode." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WKy-9Y2tN5jv", + "outputId": "41f5064e-502d-4aec-dc2d-86f885068b4f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "fill_with_mode[2].value_counts()" + ], + "execution_count": 29, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "True 3\n", + "False 1\n", + "Name: 2, dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 29 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6iNz_zG_OKrx" + }, + "source": [ + "So, we will replace None with True" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TxPKteRvNPOs" + }, + "source": [ + "fill_with_mode[2].fillna('True',inplace=True)" + ], + "execution_count": 30, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tvas7c9_OPWE", + "outputId": "7282c4f7-0e59-4398-b4f2-5919baf61164", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "fill_with_mode" + ], + "execution_count": 31, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
012True
134True
256False
378True
4910True
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 1 2 True\n", + "1 3 4 True\n", + "2 5 6 False\n", + "3 7 8 True\n", + "4 9 10 True" + ] + }, + "metadata": {}, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SktitLxxOR16" + }, + "source": [ + "As we can see, the null value has been replaced. Needless to say, we could have written anything in place or `'True'` and it would have got substituted." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "heYe1I0dOmQ_" + }, + "source": [ + "Now, coming to numeric data. Here, we have a two common ways of replacing missing values:\n", + "\n", + "1. Replace with Median of the row\n", + "2. Replace with Mean of the row \n", + "\n", + "We replace with Median, in case of skewed data with outliers. This is beacuse median is robust to outliers.\n", + "\n", + "When the data is normalized, we can use mean, as in that case, mean and median would be pretty close." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "09HM_2feOj5Y" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, { "cell_type": "code", "metadata": { From f5bef705d7480b57f61f4f8e63db49e9cd2e276c Mon Sep 17 00:00:00 2001 From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com> Date: Tue, 5 Oct 2021 12:43:18 +0530 Subject: [PATCH 6/8] fillna with mean added --- .../08-data-preparation/notebook.ipynb | 261 +++++++++++++++++- 1 file changed, 247 insertions(+), 14 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index 3e8ae01e..b5a6bac0 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -1630,12 +1630,12 @@ { "cell_type": "code", "metadata": { - "id": "MY5faq4yLdpQ", - "outputId": "c3838b07-0d15-471e-8dad-370de91d4bdc", "colab": { "base_uri": "https://localhost:8080/", "height": 204 - } + }, + "id": "MY5faq4yLdpQ", + "outputId": "c3838b07-0d15-471e-8dad-370de91d4bdc" }, "source": [ "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", @@ -1736,11 +1736,11 @@ { "cell_type": "code", "metadata": { - "id": "WKy-9Y2tN5jv", - "outputId": "41f5064e-502d-4aec-dc2d-86f885068b4f", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "id": "WKy-9Y2tN5jv", + "outputId": "41f5064e-502d-4aec-dc2d-86f885068b4f" }, "source": [ "fill_with_mode[2].value_counts()" @@ -1784,12 +1784,12 @@ { "cell_type": "code", "metadata": { - "id": "tvas7c9_OPWE", - "outputId": "7282c4f7-0e59-4398-b4f2-5919baf61164", "colab": { "base_uri": "https://localhost:8080/", "height": 204 - } + }, + "id": "tvas7c9_OPWE", + "outputId": "7282c4f7-0e59-4398-b4f2-5919baf61164" }, "source": [ "fill_with_mode" @@ -1894,19 +1894,252 @@ "\n", "We replace with Median, in case of skewed data with outliers. This is beacuse median is robust to outliers.\n", "\n", - "When the data is normalized, we can use mean, as in that case, mean and median would be pretty close." + "When the data is normalized, we can use mean, as in that case, mean and median would be pretty close.\n", + "\n", + "First, let us take a column which is normally distributed and let us fill the missing value with the mean of the column. " ] }, { "cell_type": "code", "metadata": { - "id": "09HM_2feOj5Y" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + }, + "id": "09HM_2feOj5Y", + "outputId": "ade42fec-dc40-45d0-e22c-974849ea8664" }, "source": [ - "" + "fill_with_mean = pd.DataFrame([[-2,0,1],\n", + " [-1,2,3],\n", + " [np.nan,4,5],\n", + " [1,6,7],\n", + " [2,8,9]])\n", + "\n", + "fill_with_mean" ], - "execution_count": null, - "outputs": [] + "execution_count": 33, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
0-2.001
1-1.023
2NaN45
31.067
42.089
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 -2.0 0 1\n", + "1 -1.0 2 3\n", + "2 NaN 4 5\n", + "3 1.0 6 7\n", + "4 2.0 8 9" + ] + }, + "metadata": {}, + "execution_count": 33 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ka7-wNfzSxbx" + }, + "source": [ + "The mean of the column is" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XYtYEf5BSxFL", + "outputId": "1e79aeea-6baf-4572-dcd1-23e5ec742036", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.mean(fill_with_mean[0])" + ], + "execution_count": 34, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "execution_count": 34 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oBSRGxKRS39K" + }, + "source": [ + "Filling with mean" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FzncQLmuS5jh", + "outputId": "75f33b25-e6b3-41bb-8049-1ed2e085efe2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n", + "fill_with_mean" + ], + "execution_count": 35, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
0-2.001
1-1.023
20.045
31.067
42.089
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 -2.0 0 1\n", + "1 -1.0 2 3\n", + "2 0.0 4 5\n", + "3 1.0 6 7\n", + "4 2.0 8 9" + ] + }, + "metadata": {}, + "execution_count": 35 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CwpVFCrPTC5z" + }, + "source": [ + "As we can see, the missing value has been replaced with its mean." + ] }, { "cell_type": "code", From a60aa03c487119f1ba64ab97bfc560ec1d87fe93 Mon Sep 17 00:00:00 2001 From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com> Date: Wed, 6 Oct 2021 00:06:56 +0530 Subject: [PATCH 7/8] Added Encoding --- .../08-data-preparation/notebook.ipynb | 1710 ++++++++++++++--- 1 file changed, 1492 insertions(+), 218 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index b5a6bac0..71b076e8 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -79,7 +79,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "4641a412-8abb-4e2f-d1ec-ff9b5004e361" + "outputId": "70e0d7dd-fb30-45c4-a5af-7dc85cd89342" }, "source": [ "iris_df.shape" @@ -126,7 +126,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "0f9c41ea-d480-4245-d7e2-56d514ac7724" + "outputId": "85e6ab39-174f-4dc7-fee6-a18f3ba14a7d" }, "source": [ "iris_df.columns" @@ -174,7 +174,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "94d5e48a-746c-4e58-b08f-c63b377a61b1" + "outputId": "2a2bb81a-257c-4410-f826-99402b75ce14" }, "source": [ "iris_df.info()" @@ -230,7 +230,7 @@ "base_uri": "https://localhost:8080/", "height": 297 }, - "outputId": "b01322a1-4296-4ad0-f990-6e0dcba668f6" + "outputId": "e5015299-163f-42c7-aaa1-9bc3a67788bf" }, "source": [ "iris_df.describe()" @@ -373,7 +373,7 @@ "base_uri": "https://localhost:8080/", "height": 204 }, - "outputId": "14b1e3cd-54ac-47dc-f7b2-231d51d93741" + "outputId": "5ff975df-45f0-4efd-f884-2580909c6e67" }, "source": [ "iris_df.head()" @@ -492,7 +492,7 @@ "source": [ "# Hint: Consult the documentation by using iris_df.head?" ], - "execution_count": 7, + "execution_count": null, "outputs": [] }, { @@ -514,12 +514,12 @@ "base_uri": "https://localhost:8080/", "height": 204 }, - "outputId": "d4e22b38-ba5d-4dd1-bbd2-b9cd9ad7b150" + "outputId": "1726a2e0-82d7-4491-8dbc-637f28a11d26" }, "source": [ "iris_df.tail()" ], - "execution_count": 8, + "execution_count": 7, "outputs": [ { "output_type": "execute_result", @@ -599,7 +599,7 @@ ] }, "metadata": {}, - "execution_count": 8 + "execution_count": 7 } ] }, @@ -657,7 +657,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "e2ea93a4-b967-4319-904b-85479c36b169" + "outputId": "20e2d43a-2053-4037-c736-8ec2c28b67e5" }, "source": [ "import numpy as np\n", @@ -665,7 +665,7 @@ "example1 = np.array([2, None, 6, 8])\n", "example1" ], - "execution_count": 9, + "execution_count": 8, "outputs": [ { "output_type": "execute_result", @@ -675,7 +675,7 @@ ] }, "metadata": {}, - "execution_count": 9 + "execution_count": 8 } ] }, @@ -699,12 +699,12 @@ "base_uri": "https://localhost:8080/", "height": 292 }, - "outputId": "ff2a899b-5419-4a5c-b054-bc1e6ab906c5" + "outputId": "ab3b1799-504f-480d-851b-85b19f62d8b7" }, "source": [ "example1.sum()" ], - "execution_count": 10, + "execution_count": 9, "outputs": [ { "output_type": "error", @@ -713,7 +713,7 @@ "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" ] @@ -748,12 +748,12 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "a452b675-2131-47a7-ff38-2b4d6e923d50" + "outputId": "3744a812-6daf-472e-e933-388c722ab2b4" }, "source": [ "np.nan + 1" ], - "execution_count": 11, + "execution_count": 10, "outputs": [ { "output_type": "execute_result", @@ -763,7 +763,7 @@ ] }, "metadata": {}, - "execution_count": 11 + "execution_count": 10 } ] }, @@ -775,12 +775,12 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "6956b57f-8ae7-4880-cc1d-0cf54edfe6ee" + "outputId": "4a304a47-c5a0-4814-92b0-c4b5ab193358" }, "source": [ "np.nan * 0" ], - "execution_count": 12, + "execution_count": 11, "outputs": [ { "output_type": "execute_result", @@ -790,7 +790,7 @@ ] }, "metadata": {}, - "execution_count": 12 + "execution_count": 11 } ] }, @@ -811,13 +811,13 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "57ad3201-3958-48c6-924b-d46b61d4aeba" + "outputId": "a41b57bf-1c2a-4219-9ee5-0a1a1499e74d" }, "source": [ "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ], - "execution_count": 13, + "execution_count": 12, "outputs": [ { "output_type": "execute_result", @@ -827,7 +827,7 @@ ] }, "metadata": {}, - "execution_count": 13 + "execution_count": 12 } ] }, @@ -850,7 +850,7 @@ "source": [ "# What happens if you add np.nan and None together?\n" ], - "execution_count": 14, + "execution_count": 13, "outputs": [] }, { @@ -881,13 +881,13 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "8dbdf129-cd8b-40b5-96ba-21a7f3fa0044" + "outputId": "5f3389e0-4b54-4d6b-a305-a269df869235" }, "source": [ "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ], - "execution_count": 15, + "execution_count": 14, "outputs": [ { "output_type": "execute_result", @@ -900,7 +900,7 @@ ] }, "metadata": {}, - "execution_count": 15 + "execution_count": 14 } ] }, @@ -925,7 +925,7 @@ "# How does that element show up in the Series?\n", "# What is the dtype of the Series?\n" ], - "execution_count": 16, + "execution_count": 15, "outputs": [] }, { @@ -966,7 +966,7 @@ "source": [ "example3 = pd.Series([0, np.nan, '', None])" ], - "execution_count": 17, + "execution_count": 16, "outputs": [] }, { @@ -977,12 +977,12 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "1fd6c6af-19e0-4568-e837-985d571604f4" + "outputId": "88a14e60-392a-42ad-d767-a4055580f523" }, "source": [ "example3.isnull()" ], - "execution_count": 18, + "execution_count": 17, "outputs": [ { "output_type": "execute_result", @@ -996,7 +996,7 @@ ] }, "metadata": {}, - "execution_count": 18 + "execution_count": 17 } ] }, @@ -1020,12 +1020,12 @@ "base_uri": "https://localhost:8080/" }, "id": "JCcQVoPkHDUv", - "outputId": "c0002689-f529-4e3e-c73b-41ac513c59d3" + "outputId": "042418f0-981b-4c5e-cdf8-c42912f7e4fe" }, "source": [ "example3.isnull().sum()" ], - "execution_count": 19, + "execution_count": 18, "outputs": [ { "output_type": "execute_result", @@ -1035,7 +1035,7 @@ ] }, "metadata": {}, - "execution_count": 19 + "execution_count": 18 } ] }, @@ -1059,7 +1059,7 @@ "# Try running example3[example3.notnull()].\n", "# Before you do so, what do you expect to see?\n" ], - "execution_count": 20, + "execution_count": 19, "outputs": [] }, { @@ -1118,13 +1118,13 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "3d2d43e7-99ca-45ca-adc4-cef2c737e5bf" + "outputId": "782b0526-a1bb-4757-ac1f-a16267d9eb4f" }, "source": [ "example3 = example3.dropna()\n", "example3" ], - "execution_count": 21, + "execution_count": 20, "outputs": [ { "output_type": "execute_result", @@ -1136,7 +1136,7 @@ ] }, "metadata": {}, - "execution_count": 21 + "execution_count": 20 } ] }, @@ -1160,7 +1160,7 @@ "base_uri": "https://localhost:8080/", "height": 142 }, - "outputId": "961427aa-9bce-445b-d230-61d02bc16c92" + "outputId": "3d19e787-896d-4ba4-8662-811d2e191d3b" }, "source": [ "example4 = pd.DataFrame([[1, np.nan, 7], \n", @@ -1168,7 +1168,7 @@ " [np.nan, 6, 9]])\n", "example4" ], - "execution_count": 22, + "execution_count": 21, "outputs": [ { "output_type": "execute_result", @@ -1228,7 +1228,7 @@ ] }, "metadata": {}, - "execution_count": 22 + "execution_count": 21 } ] }, @@ -1252,12 +1252,12 @@ "base_uri": "https://localhost:8080/", "height": 80 }, - "outputId": "aaeac6bc-ca6f-4eda-de0c-119e0c50ba83" + "outputId": "6bdb7658-8a64-401f-d2b2-bd0f8bc17325" }, "source": [ "example4.dropna()" ], - "execution_count": 23, + "execution_count": 22, "outputs": [ { "output_type": "execute_result", @@ -1303,7 +1303,7 @@ ] }, "metadata": {}, - "execution_count": 23 + "execution_count": 22 } ] }, @@ -1325,12 +1325,12 @@ "base_uri": "https://localhost:8080/", "height": 142 }, - "outputId": "89fee273-d71b-4400-9484-b4bf93b69ee5" + "outputId": "0071a8bb-9fe5-4ed5-a3af-d0209485515a" }, "source": [ "example4.dropna(axis='columns')" ], - "execution_count": 24, + "execution_count": 23, "outputs": [ { "output_type": "execute_result", @@ -1382,7 +1382,7 @@ ] }, "metadata": {}, - "execution_count": 24 + "execution_count": 23 } ] }, @@ -1406,13 +1406,13 @@ "base_uri": "https://localhost:8080/", "height": 142 }, - "outputId": "07e8f4eb-18c8-4e5d-9317-6a9a3db38b73" + "outputId": "a26b5362-0d17-49c2-d902-10832f9bf9a0" }, "source": [ "example4[3] = np.nan\n", "example4" ], - "execution_count": 25, + "execution_count": 24, "outputs": [ { "output_type": "execute_result", @@ -1476,7 +1476,7 @@ ] }, "metadata": {}, - "execution_count": 25 + "execution_count": 24 } ] }, @@ -1513,7 +1513,7 @@ "# How might you go about dropping just column 3?\n", "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" ], - "execution_count": 26, + "execution_count": 25, "outputs": [] }, { @@ -1534,12 +1534,12 @@ "base_uri": "https://localhost:8080/", "height": 80 }, - "outputId": "b2c00415-95a6-4a5c-e3f9-781ff5cc8625" + "outputId": "ee2d3a60-a694-4a11-ef37-28d00a8d956c" }, "source": [ "example4.dropna(axis='rows', thresh=3)" ], - "execution_count": 27, + "execution_count": 26, "outputs": [ { "output_type": "execute_result", @@ -1587,7 +1587,7 @@ ] }, "metadata": {}, - "execution_count": 27 + "execution_count": 26 } ] }, @@ -1620,6 +1620,7 @@ "id": "CE8S7louLezV" }, "source": [ + "### Categorical Data(Non-numeric)\n", "First let us consider non-numeric data. In datasets, we have columns with categorical data. Eg. Gender, True or False etc.\n", "\n", "In most of these cases, we replace missing values with the `mode` of the column. Say, we have 100 data points and 90 have said True, 8 have said False and 2 have not filled. Then, we can will the 2 with True, considering the full column. \n", @@ -1635,7 +1636,7 @@ "height": 204 }, "id": "MY5faq4yLdpQ", - "outputId": "c3838b07-0d15-471e-8dad-370de91d4bdc" + "outputId": "49350e22-4ee9-43c1-9d6c-e5f837b24ae8" }, "source": [ "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", @@ -1646,7 +1647,7 @@ "\n", "fill_with_mode" ], - "execution_count": 28, + "execution_count": 27, "outputs": [ { "output_type": "execute_result", @@ -1720,7 +1721,7 @@ ] }, "metadata": {}, - "execution_count": 28 + "execution_count": 27 } ] }, @@ -1740,12 +1741,12 @@ "base_uri": "https://localhost:8080/" }, "id": "WKy-9Y2tN5jv", - "outputId": "41f5064e-502d-4aec-dc2d-86f885068b4f" + "outputId": "d0c045f2-218c-45aa-951c-f3feed98510a" }, "source": [ "fill_with_mode[2].value_counts()" ], - "execution_count": 29, + "execution_count": 28, "outputs": [ { "output_type": "execute_result", @@ -1757,7 +1758,7 @@ ] }, "metadata": {}, - "execution_count": 29 + "execution_count": 28 } ] }, @@ -1778,7 +1779,7 @@ "source": [ "fill_with_mode[2].fillna('True',inplace=True)" ], - "execution_count": 30, + "execution_count": 29, "outputs": [] }, { @@ -1789,12 +1790,12 @@ "height": 204 }, "id": "tvas7c9_OPWE", - "outputId": "7282c4f7-0e59-4398-b4f2-5919baf61164" + "outputId": "c45890f5-8c76-4a3c-87f0-b831c2199750" }, "source": [ "fill_with_mode" ], - "execution_count": 31, + "execution_count": 30, "outputs": [ { "output_type": "execute_result", @@ -1868,7 +1869,7 @@ ] }, "metadata": {}, - "execution_count": 31 + "execution_count": 30 } ] }, @@ -1887,6 +1888,7 @@ "id": "heYe1I0dOmQ_" }, "source": [ + "### Numeric Data\n", "Now, coming to numeric data. Here, we have a two common ways of replacing missing values:\n", "\n", "1. Replace with Median of the row\n", @@ -1907,7 +1909,7 @@ "height": 204 }, "id": "09HM_2feOj5Y", - "outputId": "ade42fec-dc40-45d0-e22c-974849ea8664" + "outputId": "44330273-5709-4af9-99c7-7a3a8e28c7b0" }, "source": [ "fill_with_mean = pd.DataFrame([[-2,0,1],\n", @@ -1918,7 +1920,7 @@ "\n", "fill_with_mean" ], - "execution_count": 33, + "execution_count": 31, "outputs": [ { "output_type": "execute_result", @@ -1992,7 +1994,7 @@ ] }, "metadata": {}, - "execution_count": 33 + "execution_count": 31 } ] }, @@ -2009,15 +2011,15 @@ "cell_type": "code", "metadata": { "id": "XYtYEf5BSxFL", - "outputId": "1e79aeea-6baf-4572-dcd1-23e5ec742036", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "7240075c-c3a7-4ac3-e08d-be6d60573d38" }, "source": [ "np.mean(fill_with_mean[0])" ], - "execution_count": 34, + "execution_count": 32, "outputs": [ { "output_type": "execute_result", @@ -2027,7 +2029,7 @@ ] }, "metadata": {}, - "execution_count": 34 + "execution_count": 32 } ] }, @@ -2044,17 +2046,17 @@ "cell_type": "code", "metadata": { "id": "FzncQLmuS5jh", - "outputId": "75f33b25-e6b3-41bb-8049-1ed2e085efe2", "colab": { "base_uri": "https://localhost:8080/", "height": 204 - } + }, + "outputId": "733bfa87-b099-4c11-db2e-1dea88b977ac" }, "source": [ "fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n", "fill_with_mean" ], - "execution_count": 35, + "execution_count": 33, "outputs": [ { "output_type": "execute_result", @@ -2128,7 +2130,7 @@ ] }, "metadata": {}, - "execution_count": 35 + "execution_count": 33 } ] }, @@ -2141,201 +2143,1261 @@ "As we can see, the missing value has been replaced with its mean." ] }, - { - "cell_type": "code", - "metadata": { - "trusted": false, - "id": "0ybtWLDdgRsG" - }, - "source": [ - "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", - "example5" - ], - "execution_count": null, - "outputs": [] - }, { "cell_type": "markdown", "metadata": { - "id": "yrsigxRggRsH" + "id": "jIvF13a1i00Z" }, "source": [ - "You can fill all of the null entries with a single value, such as `0`:" + "Now let us try another dataframe, and this time we will replace the None values with the median of the column." ] }, { "cell_type": "code", "metadata": { - "trusted": false, - "id": "KXMIPsQdgRsH" + "id": "DA59Bqo3jBYZ", + "outputId": "4338adf5-081c-46ce-aca1-85bcaebf9838", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } }, "source": [ - "example5.fillna(0)" + "fill_with_median = pd.DataFrame([[-2,0,1],\n", + " [-1,2,3],\n", + " [0,np.nan,5],\n", + " [1,6,7],\n", + " [2,8,9]])\n", + "\n", + "fill_with_median" ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FI9MmqFJgRsH" - }, - "source": [ - "### Exercise:" + "execution_count": 39, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
0-20.01
1-12.03
20NaN5
316.07
428.09
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 -2 0.0 1\n", + "1 -1 2.0 3\n", + "2 0 NaN 5\n", + "3 1 6.0 7\n", + "4 2 8.0 9" + ] + }, + "metadata": {}, + "execution_count": 39 + } ] }, - { - "cell_type": "code", - "metadata": { - "collapsed": true, - "trusted": false, - "id": "af-ezpXdgRsH" - }, - "source": [ - "# What happens if you try to fill null values with a string, like ''?\n" - ], - "execution_count": null, - "outputs": [] - }, { "cell_type": "markdown", "metadata": { - "id": "kq3hw1kLgRsI" + "id": "mM1GpXYmjHnc" }, "source": [ - "You can **forward-fill** null values, which is to use the last valid value to fill a null:" + "The median of the second column is" ] }, { "cell_type": "code", "metadata": { - "trusted": false, - "id": "vO3BuNrggRsI" + "id": "uiDy5v3xjHHX", + "outputId": "2028aa4b-8bec-4b76-ea2f-fcaa7b362e9d", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "source": [ - "example5.fillna(method='ffill')" + "fill_with_median[1].median()" ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nDXeYuHzgRsI" - }, - "source": [ - "You can also **back-fill** to propagate the next valid value backward to fill a null:" + "execution_count": 40, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "4.0" + ] + }, + "metadata": {}, + "execution_count": 40 + } ] }, - { - "cell_type": "code", - "metadata": { - "trusted": false, - "id": "4M5onHcEgRsI" - }, - "source": [ - "example5.fillna(method='bfill')" - ], - "execution_count": null, - "outputs": [] - }, { "cell_type": "markdown", "metadata": { - "collapsed": true, - "id": "MbBzTom5gRsI" + "id": "z9PLF75Jj_1s" }, "source": [ - "As you might guess, this works the same with `DataFrame`s, but you can also specify an `axis` along which to fill null values:" + "Filling with median" ] }, { "cell_type": "code", "metadata": { - "trusted": false, - "id": "aRpIvo4ZgRsI" - }, - "source": [ - "example4" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "trusted": false, - "id": "VM1qtACAgRsI" + "id": "lFKbOxCMkBbg", + "outputId": "61bf2b0e-c68d-4b54-9724-f496c8c2ea94", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } }, "source": [ - "example4.fillna(method='ffill', axis=1)" + "fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n", + "fill_with_median" ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZeMc-I1EgRsI" + "execution_count": 41, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
0-20.01
1-12.03
204.05
316.07
428.09
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 -2 0.0 1\n", + "1 -1 2.0 3\n", + "2 0 4.0 5\n", + "3 1 6.0 7\n", + "4 2 8.0 9" + ] + }, + "metadata": {}, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8JtQ53GSkKWC" + }, + "source": [ + "As we can see, the NaN value has been replaced by the median of the column" + ] + }, + { + "cell_type": "code", + "metadata": { + "trusted": false, + "id": "0ybtWLDdgRsG", + "outputId": "ee2e547a-bf98-40a5-ddc4-b11357efb898", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", + "example5" + ], + "execution_count": 42, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "a 1.0\n", + "b NaN\n", + "c 2.0\n", + "d NaN\n", + "e 3.0\n", + "dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 42 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yrsigxRggRsH" + }, + "source": [ + "You can fill all of the null entries with a single value, such as `0`:" + ] + }, + { + "cell_type": "code", + "metadata": { + "trusted": false, + "id": "KXMIPsQdgRsH", + "outputId": "f88a0095-9742-4f1e-fdf4-43fc14cbc4c0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "example5.fillna(0)" + ], + "execution_count": 43, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "a 1.0\n", + "b 0.0\n", + "c 2.0\n", + "d 0.0\n", + "e 3.0\n", + "dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 43 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RRlI5f_hkfKe" + }, + "source": [ + "> Key takeaways:\n", + "1. Filling in missing values should be done when either there is less data or there is a strategy to fill in the missing data.\n", + "2. Domain knowledge can be used to fill in missing values by approximating them.\n", + "3. For Categorical data, mostly, missing values are substituted with the mode of the column. \n", + "4. For numeric data, missing values are usually filled in with the mean(for normalized datasets) or the median of the columns. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FI9MmqFJgRsH" + }, + "source": [ + "### Exercise:" + ] + }, + { + "cell_type": "code", + "metadata": { + "collapsed": true, + "trusted": false, + "id": "af-ezpXdgRsH" + }, + "source": [ + "# What happens if you try to fill null values with a string, like ''?\n" + ], + "execution_count": 44, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kq3hw1kLgRsI" + }, + "source": [ + "You can **forward-fill** null values, which is to use the last valid value to fill a null:" + ] + }, + { + "cell_type": "code", + "metadata": { + "trusted": false, + "id": "vO3BuNrggRsI", + "outputId": "aff7d7de-20b9-42bf-fe06-932677314b37", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "example5.fillna(method='ffill')" + ], + "execution_count": 45, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "a 1.0\n", + "b 1.0\n", + "c 2.0\n", + "d 2.0\n", + "e 3.0\n", + "dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 45 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nDXeYuHzgRsI" + }, + "source": [ + "You can also **back-fill** to propagate the next valid value backward to fill a null:" + ] + }, + { + "cell_type": "code", + "metadata": { + "trusted": false, + "id": "4M5onHcEgRsI", + "outputId": "c20c283d-76d7-4f75-c443-5c55fbdb3541", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "example5.fillna(method='bfill')" + ], + "execution_count": 46, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "a 1.0\n", + "b 2.0\n", + "c 2.0\n", + "d 3.0\n", + "e 3.0\n", + "dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 46 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true, + "id": "MbBzTom5gRsI" + }, + "source": [ + "As you might guess, this works the same with DataFrames, but you can also specify an `axis` along which to fill null values:" + ] + }, + { + "cell_type": "code", + "metadata": { + "trusted": false, + "id": "aRpIvo4ZgRsI", + "outputId": "ea9c5e3d-a23d-4314-cff4-e5a0e46043d1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 142 + } + }, + "source": [ + "example4" + ], + "execution_count": 47, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 1.0 NaN 7 NaN\n", + "1 2.0 5.0 8 NaN\n", + "2 NaN 6.0 9 NaN" + ] + }, + "metadata": {}, + "execution_count": 47 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "trusted": false, + "id": "VM1qtACAgRsI", + "outputId": "2cd3360a-ac87-41fb-d362-9d8c981f573f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 142 + } + }, + "source": [ + "example4.fillna(method='ffill', axis=1)" + ], + "execution_count": 48, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
01.01.07.07.0
12.05.08.08.0
2NaN6.09.09.0
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 1.0 1.0 7.0 7.0\n", + "1 2.0 5.0 8.0 8.0\n", + "2 NaN 6.0 9.0 9.0" + ] + }, + "metadata": {}, + "execution_count": 48 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZeMc-I1EgRsI" + }, + "source": [ + "Notice that when a previous value is not available for forward-filling, the null value remains." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eeAoOU0RgRsJ" + }, + "source": [ + "### Exercise:" + ] + }, + { + "cell_type": "code", + "metadata": { + "collapsed": true, + "trusted": false, + "id": "e8S-CjW8gRsJ" + }, + "source": [ + "# What output does example4.fillna(method='bfill', axis=1) produce?\n", + "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", + "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" + ], + "execution_count": 49, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YHgy0lIrgRsJ" + }, + "source": [ + "You can be creative about how you use `fillna`. For example, let's look at `example4` again, but this time let's fill the missing values with the average of all of the values in the `DataFrame`:" + ] + }, + { + "cell_type": "code", + "metadata": { + "trusted": false, + "id": "OtYVErEygRsJ", + "outputId": "ad5f4520-cf88-4e3e-fa16-54bda5efa417", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 142 + } + }, + "source": [ + "example4.fillna(example4.mean())" + ], + "execution_count": 50, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
01.05.57NaN
12.05.08NaN
21.56.09NaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 1.0 5.5 7 NaN\n", + "1 2.0 5.0 8 NaN\n", + "2 1.5 6.0 9 NaN" + ] + }, + "metadata": {}, + "execution_count": 50 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zpMvCkLSgRsJ" + }, + "source": [ + "Notice that column 3 is still valueless: the default direction is to fill values row-wise.\n", + "\n", + "> **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bauDnESIl9FH" + }, + "source": [ + "### Encoding Categorical Data\n", + "\n", + "Machine learning models only deal with numbers and any form of numeric data. It won't be able to tell the difference between a Yes and a No, but it would be able to distinguish between 0 and 1. So, after filling in the missing values, we need to do encode the categorical data to some numeric form for the model to understand.\n", + "\n", + "Encoding can be done in two ways. We will be discussing them next.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uDq9SxB7mu5i" + }, + "source": [ + "**LABEL ENCODING**\n", + "\n", + "\n", + "Label encoding is basically converting each category to a number. For example, say we have a dataset of airline passengers and there is a column containing their class among the following ['business class', 'economy class','first class']. If Label encoding is done on this, this would be transformed to [0,1,2]. Let us see an example via code. As we would be learning `scikit-learn` in the upcoming notebooks, we won't use it here." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1vGz7uZyoWHL", + "outputId": "5003c8cd-ff07-4399-a5b2-621b45184511", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 235 + } + }, + "source": [ + "label = pd.DataFrame([\n", + " [10,'business class'],\n", + " [20,'first class'],\n", + " [30, 'economy class'],\n", + " [40, 'economy class'],\n", + " [50, 'economy class'],\n", + " [60, 'business class']\n", + "],columns=['ID','class'])\n", + "label" + ], + "execution_count": 70, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", + "
" + ], + "text/plain": [ + " ID class\n", + "0 10 business class\n", + "1 20 first class\n", + "2 30 economy class\n", + "3 40 economy class\n", + "4 50 economy class\n", + "5 60 business class" + ] + }, + "metadata": {}, + "execution_count": 70 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IDHnkwTYov-h" }, "source": [ - "Notice that when a previous value is not available for forward-filling, the null value remains." + "To perform label encoding on the 1st column, we have to first describe a mapping from each class to a number, before replacing" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZC5URJG3o1ES", + "outputId": "c75465b2-169e-417c-8769-680aaf1cd268", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 235 + } + }, + "source": [ + "class_labels = {'business class':0,'economy class':1,'first class':2}\n", + "label['class'] = label['class'].replace(class_labels)\n", + "label" + ], + "execution_count": 71, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDclass
0100
1202
2301
3401
4501
5600
\n", + "
" + ], + "text/plain": [ + " ID class\n", + "0 10 0\n", + "1 20 2\n", + "2 30 1\n", + "3 40 1\n", + "4 50 1\n", + "5 60 0" + ] + }, + "metadata": {}, + "execution_count": 71 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ftnF-TyapOPt" + }, + "source": [ + "As we can see, the output matches what we thought would happen. So, when do we use label encoding? Label encoding is used in either or both of the following cases :\n", + "1. When the number of categories is large\n", + "2. When the categories are in order. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eQPAPVwsqWT7" + }, + "source": [ + "**ONE HOT ENCODING**\n", + "\n", + "Another type of encoding is One Hot Encoding. In this type of encoding, each category of the column gets added as a separate column and each datapoint will get a 0 or a 1 based on whether it contains that category. So, if there are n different categories, n columns will be appended to the dataframe.\n", + "\n", + "For example, let us take the same aeroplane class example. The categories were: ['business class', 'economy class','first class'] . So, if we perform one hot encoding, the following three columns will be added to the dataset: ['class_business class','class_economy class','class_first class']." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZM0eVh0ArKUL", + "outputId": "cba4258f-a6c3-45e0-dd69-32b73b2cd735", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 235 + } + }, + "source": [ + "one_hot = pd.DataFrame([\n", + " [10,'business class'],\n", + " [20,'first class'],\n", + " [30, 'economy class'],\n", + " [40, 'economy class'],\n", + " [50, 'economy class'],\n", + " [60, 'business class']\n", + "],columns=['ID','class'])\n", + "one_hot" + ], + "execution_count": 67, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", + "
" + ], + "text/plain": [ + " ID class\n", + "0 10 business class\n", + "1 20 first class\n", + "2 30 economy class\n", + "3 40 economy class\n", + "4 50 economy class\n", + "5 60 business class" + ] + }, + "metadata": {}, + "execution_count": 67 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aVnZ7paDrWmb" + }, + "source": [ + "Let us perform one hot encoding on the 1st column" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RUPxf7egrYKr" + }, + "source": [ + "one_hot_data = pd.get_dummies(one_hot,columns=['class'])" + ], + "execution_count": 68, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TM37pHsFr4ge", + "outputId": "4f9cdbec-5ea6-4613-b14f-5b8b66b85894", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 235 + } + }, + "source": [ + "one_hot_data" + ], + "execution_count": 69, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDclass_business classclass_economy classclass_first class
010100
120001
230010
340010
450010
560100
\n", + "
" + ], + "text/plain": [ + " ID class_business class class_economy class class_first class\n", + "0 10 1 0 0\n", + "1 20 0 0 1\n", + "2 30 0 1 0\n", + "3 40 0 1 0\n", + "4 50 0 1 0\n", + "5 60 1 0 0" + ] + }, + "metadata": {}, + "execution_count": 69 + } ] }, { "cell_type": "markdown", "metadata": { - "id": "eeAoOU0RgRsJ" + "id": "_zXRLOjXujdA" }, "source": [ - "### Exercise:" + "Each one hot encoded column contains 0 or 1, which specifies whether that category exists for that datapoint." ] }, - { - "cell_type": "code", - "metadata": { - "collapsed": true, - "trusted": false, - "id": "e8S-CjW8gRsJ" - }, - "source": [ - "# What output does example4.fillna(method='bfill', axis=1) produce?\n", - "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", - "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" - ], - "execution_count": null, - "outputs": [] - }, { "cell_type": "markdown", "metadata": { - "id": "YHgy0lIrgRsJ" + "id": "bDnC4NQOu0qr" }, "source": [ - "You can be creative about how you use `fillna`. For example, let's look at `example4` again, but this time let's fill the missing values with the average of all of the values in the `DataFrame`:" + "When do we use one hot encoding? One hot encoding is used in either or both of the following cases :\n", + "\n", + "1. When the number of categories and the size of the dataset is smaller.\n", + "2. When the categories follow no particular order." ] }, - { - "cell_type": "code", - "metadata": { - "trusted": false, - "id": "OtYVErEygRsJ" - }, - "source": [ - "example4.fillna(example4.mean())" - ], - "execution_count": null, - "outputs": [] - }, { "cell_type": "markdown", "metadata": { - "id": "zpMvCkLSgRsJ" + "id": "XnUmci_4uvyu" }, "source": [ - "Notice that column 3 is still valueless: the default direction is to fill values row-wise.\n", - "\n", - "> **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets." + "> Key Takeaways:\n", + "1. Encoding is done to convert non-numeric data to numeric data.\n", + "2. There are two types of encoding: Label encoding and One Hot encoding, both of which can be performed based on the demands of the dataset. " ] }, { @@ -2366,27 +3428,121 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "ZLu6FEnZgRsJ" + "id": "ZLu6FEnZgRsJ", + "outputId": "d62ede23-a8ba-412b-f666-6fc1a43af424", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } }, "source": [ "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n", " 'numbers': [1, 2, 1, 3, 3]})\n", "example6" ], - "execution_count": null, - "outputs": [] + "execution_count": 72, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lettersnumbers
0A1
1B2
2A1
3B3
4B3
\n", + "
" + ], + "text/plain": [ + " letters numbers\n", + "0 A 1\n", + "1 B 2\n", + "2 A 1\n", + "3 B 3\n", + "4 B 3" + ] + }, + "metadata": {}, + "execution_count": 72 + } + ] }, { "cell_type": "code", "metadata": { "trusted": false, - "id": "cIduB5oBgRsK" + "id": "cIduB5oBgRsK", + "outputId": "061ff212-4cba-4f49-ae20-a7bde21b54a3", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "source": [ "example6.duplicated()" ], - "execution_count": null, - "outputs": [] + "execution_count": 73, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 True\n", + "3 False\n", + "4 True\n", + "dtype: bool" + ] + }, + "metadata": {}, + "execution_count": 73 + } + ] }, { "cell_type": "markdown", @@ -2402,13 +3558,75 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "w_YPpqIqgRsK" + "id": "w_YPpqIqgRsK", + "outputId": "5081cf87-9e65-493f-c867-c73f3833b775", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 142 + } }, "source": [ "example6.drop_duplicates()" ], - "execution_count": null, - "outputs": [] + "execution_count": 74, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lettersnumbers
0A1
1B2
3B3
\n", + "
" + ], + "text/plain": [ + " letters numbers\n", + "0 A 1\n", + "1 B 2\n", + "3 B 3" + ] + }, + "metadata": {}, + "execution_count": 74 + } + ] }, { "cell_type": "markdown", @@ -2423,13 +3641,69 @@ "cell_type": "code", "metadata": { "trusted": false, - "id": "BILjDs67gRsK" + "id": "BILjDs67gRsK", + "outputId": "1087142d-5a36-4667-8b70-45824de07d64", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 111 + } }, "source": [ "example6.drop_duplicates(['letters'])" ], - "execution_count": null, - "outputs": [] + "execution_count": 75, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lettersnumbers
0A1
1B2
\n", + "
" + ], + "text/plain": [ + " letters numbers\n", + "0 A 1\n", + "1 B 2" + ] + }, + "metadata": {}, + "execution_count": 75 + } + ] }, { "cell_type": "markdown", From e30b7208ab3c3259975015b5d53bdcbddb03577d Mon Sep 17 00:00:00 2001 From: Nirmalya Misra <39618712+nirmalya8@users.noreply.github.com> Date: Wed, 6 Oct 2021 01:26:55 +0530 Subject: [PATCH 8/8] Resolving conflicts --- .../08-data-preparation/notebook.ipynb | 302 +++++++++--------- 1 file changed, 151 insertions(+), 151 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index 71b076e8..ed5b0256 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -79,7 +79,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "70e0d7dd-fb30-45c4-a5af-7dc85cd89342" + "outputId": "fb0577ac-3b4a-4623-cb41-20e1b264b3e9" }, "source": [ "iris_df.shape" @@ -126,7 +126,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "85e6ab39-174f-4dc7-fee6-a18f3ba14a7d" + "outputId": "74e7a43a-77cc-4c80-da56-7f50767c37a0" }, "source": [ "iris_df.columns" @@ -174,7 +174,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "2a2bb81a-257c-4410-f826-99402b75ce14" + "outputId": "d8fb0c40-4f18-4e19-da48-c8db77d1d3a5" }, "source": [ "iris_df.info()" @@ -230,7 +230,7 @@ "base_uri": "https://localhost:8080/", "height": 297 }, - "outputId": "e5015299-163f-42c7-aaa1-9bc3a67788bf" + "outputId": "4fc49941-bc13-4b0c-a412-cb39e7d3f289" }, "source": [ "iris_df.describe()" @@ -373,7 +373,7 @@ "base_uri": "https://localhost:8080/", "height": 204 }, - "outputId": "5ff975df-45f0-4efd-f884-2580909c6e67" + "outputId": "d9393ee5-c106-4797-f815-218f17160e00" }, "source": [ "iris_df.head()" @@ -492,7 +492,7 @@ "source": [ "# Hint: Consult the documentation by using iris_df.head?" ], - "execution_count": null, + "execution_count": 7, "outputs": [] }, { @@ -512,14 +512,14 @@ "id": "heanjfGWgRr2", "colab": { "base_uri": "https://localhost:8080/", - "height": 204 + "height": 0 }, - "outputId": "1726a2e0-82d7-4491-8dbc-637f28a11d26" + "outputId": "6ae09a21-fe09-4110-b0d7-1a1fbf34d7f3" }, "source": [ "iris_df.tail()" ], - "execution_count": 7, + "execution_count": 8, "outputs": [ { "output_type": "execute_result", @@ -599,7 +599,7 @@ ] }, "metadata": {}, - "execution_count": 7 + "execution_count": 8 } ] }, @@ -657,7 +657,7 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "20e2d43a-2053-4037-c736-8ec2c28b67e5" + "outputId": "92779f18-62f4-4a03-eca2-e9a101604336" }, "source": [ "import numpy as np\n", @@ -665,7 +665,7 @@ "example1 = np.array([2, None, 6, 8])\n", "example1" ], - "execution_count": 8, + "execution_count": 9, "outputs": [ { "output_type": "execute_result", @@ -675,7 +675,7 @@ ] }, "metadata": {}, - "execution_count": 8 + "execution_count": 9 } ] }, @@ -699,12 +699,12 @@ "base_uri": "https://localhost:8080/", "height": 292 }, - "outputId": "ab3b1799-504f-480d-851b-85b19f62d8b7" + "outputId": "ecba710a-22ec-41d5-a39c-11f67e645b50" }, "source": [ "example1.sum()" ], - "execution_count": 9, + "execution_count": 10, "outputs": [ { "output_type": "error", @@ -713,7 +713,7 @@ "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" ] @@ -748,12 +748,12 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "3744a812-6daf-472e-e933-388c722ab2b4" + "outputId": "699e81b7-5c11-4b46-df1d-06071768690f" }, "source": [ "np.nan + 1" ], - "execution_count": 10, + "execution_count": 11, "outputs": [ { "output_type": "execute_result", @@ -763,7 +763,7 @@ ] }, "metadata": {}, - "execution_count": 10 + "execution_count": 11 } ] }, @@ -775,12 +775,12 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "4a304a47-c5a0-4814-92b0-c4b5ab193358" + "outputId": "4525b6c4-495d-4f7b-a979-efce1dae9bd0" }, "source": [ "np.nan * 0" ], - "execution_count": 11, + "execution_count": 12, "outputs": [ { "output_type": "execute_result", @@ -790,7 +790,7 @@ ] }, "metadata": {}, - "execution_count": 11 + "execution_count": 12 } ] }, @@ -811,13 +811,13 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "a41b57bf-1c2a-4219-9ee5-0a1a1499e74d" + "outputId": "fa06495a-0930-4867-87c5-6023031ea8b5" }, "source": [ "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ], - "execution_count": 12, + "execution_count": 13, "outputs": [ { "output_type": "execute_result", @@ -827,7 +827,7 @@ ] }, "metadata": {}, - "execution_count": 12 + "execution_count": 13 } ] }, @@ -850,7 +850,7 @@ "source": [ "# What happens if you add np.nan and None together?\n" ], - "execution_count": 13, + "execution_count": 14, "outputs": [] }, { @@ -881,13 +881,13 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "5f3389e0-4b54-4d6b-a305-a269df869235" + "outputId": "36aa14d2-8efa-4bfd-c0ed-682991288822" }, "source": [ "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ], - "execution_count": 14, + "execution_count": 15, "outputs": [ { "output_type": "execute_result", @@ -900,7 +900,7 @@ ] }, "metadata": {}, - "execution_count": 14 + "execution_count": 15 } ] }, @@ -925,7 +925,7 @@ "# How does that element show up in the Series?\n", "# What is the dtype of the Series?\n" ], - "execution_count": 15, + "execution_count": 16, "outputs": [] }, { @@ -966,7 +966,7 @@ "source": [ "example3 = pd.Series([0, np.nan, '', None])" ], - "execution_count": 16, + "execution_count": 17, "outputs": [] }, { @@ -977,12 +977,12 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "88a14e60-392a-42ad-d767-a4055580f523" + "outputId": "92fc363a-1874-471f-846d-f4f9ce1f51d0" }, "source": [ "example3.isnull()" ], - "execution_count": 17, + "execution_count": 18, "outputs": [ { "output_type": "execute_result", @@ -996,7 +996,7 @@ ] }, "metadata": {}, - "execution_count": 17 + "execution_count": 18 } ] }, @@ -1020,12 +1020,12 @@ "base_uri": "https://localhost:8080/" }, "id": "JCcQVoPkHDUv", - "outputId": "042418f0-981b-4c5e-cdf8-c42912f7e4fe" + "outputId": "001daa72-54f8-4bd5-842a-4df627a79d4d" }, "source": [ "example3.isnull().sum()" ], - "execution_count": 18, + "execution_count": 19, "outputs": [ { "output_type": "execute_result", @@ -1035,7 +1035,7 @@ ] }, "metadata": {}, - "execution_count": 18 + "execution_count": 19 } ] }, @@ -1059,7 +1059,7 @@ "# Try running example3[example3.notnull()].\n", "# Before you do so, what do you expect to see?\n" ], - "execution_count": 19, + "execution_count": 20, "outputs": [] }, { @@ -1118,13 +1118,13 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "782b0526-a1bb-4757-ac1f-a16267d9eb4f" + "outputId": "c13fc117-4ca1-4145-a0aa-42ac89e6e218" }, "source": [ "example3 = example3.dropna()\n", "example3" ], - "execution_count": 20, + "execution_count": 21, "outputs": [ { "output_type": "execute_result", @@ -1136,7 +1136,7 @@ ] }, "metadata": {}, - "execution_count": 20 + "execution_count": 21 } ] }, @@ -1160,7 +1160,7 @@ "base_uri": "https://localhost:8080/", "height": 142 }, - "outputId": "3d19e787-896d-4ba4-8662-811d2e191d3b" + "outputId": "340876a0-63ad-40f6-bd54-6240cdae50ab" }, "source": [ "example4 = pd.DataFrame([[1, np.nan, 7], \n", @@ -1168,7 +1168,7 @@ " [np.nan, 6, 9]])\n", "example4" ], - "execution_count": 21, + "execution_count": 22, "outputs": [ { "output_type": "execute_result", @@ -1228,7 +1228,7 @@ ] }, "metadata": {}, - "execution_count": 21 + "execution_count": 22 } ] }, @@ -1252,12 +1252,12 @@ "base_uri": "https://localhost:8080/", "height": 80 }, - "outputId": "6bdb7658-8a64-401f-d2b2-bd0f8bc17325" + "outputId": "0b5e5aee-7187-4d3f-b583-a44136ae5f80" }, "source": [ "example4.dropna()" ], - "execution_count": 22, + "execution_count": 23, "outputs": [ { "output_type": "execute_result", @@ -1303,7 +1303,7 @@ ] }, "metadata": {}, - "execution_count": 22 + "execution_count": 23 } ] }, @@ -1325,12 +1325,12 @@ "base_uri": "https://localhost:8080/", "height": 142 }, - "outputId": "0071a8bb-9fe5-4ed5-a3af-d0209485515a" + "outputId": "ff4001f3-2e61-4509-d60e-0093d1068437" }, "source": [ "example4.dropna(axis='columns')" ], - "execution_count": 23, + "execution_count": 24, "outputs": [ { "output_type": "execute_result", @@ -1382,7 +1382,7 @@ ] }, "metadata": {}, - "execution_count": 23 + "execution_count": 24 } ] }, @@ -1406,13 +1406,13 @@ "base_uri": "https://localhost:8080/", "height": 142 }, - "outputId": "a26b5362-0d17-49c2-d902-10832f9bf9a0" + "outputId": "72e0b1b8-52fa-4923-98ce-b6fbed6e44b1" }, "source": [ "example4[3] = np.nan\n", "example4" ], - "execution_count": 24, + "execution_count": 25, "outputs": [ { "output_type": "execute_result", @@ -1476,7 +1476,7 @@ ] }, "metadata": {}, - "execution_count": 24 + "execution_count": 25 } ] }, @@ -1513,7 +1513,7 @@ "# How might you go about dropping just column 3?\n", "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" ], - "execution_count": 25, + "execution_count": 26, "outputs": [] }, { @@ -1534,12 +1534,12 @@ "base_uri": "https://localhost:8080/", "height": 80 }, - "outputId": "ee2d3a60-a694-4a11-ef37-28d00a8d956c" + "outputId": "8093713a-54d2-4e54-c73f-4eea315cb6f2" }, "source": [ "example4.dropna(axis='rows', thresh=3)" ], - "execution_count": 26, + "execution_count": 27, "outputs": [ { "output_type": "execute_result", @@ -1587,7 +1587,7 @@ ] }, "metadata": {}, - "execution_count": 26 + "execution_count": 27 } ] }, @@ -1636,7 +1636,7 @@ "height": 204 }, "id": "MY5faq4yLdpQ", - "outputId": "49350e22-4ee9-43c1-9d6c-e5f837b24ae8" + "outputId": "19ab472e-1eed-4de8-f8a7-db2a3af3cb1a" }, "source": [ "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", @@ -1647,7 +1647,7 @@ "\n", "fill_with_mode" ], - "execution_count": 27, + "execution_count": 28, "outputs": [ { "output_type": "execute_result", @@ -1721,7 +1721,7 @@ ] }, "metadata": {}, - "execution_count": 27 + "execution_count": 28 } ] }, @@ -1741,12 +1741,12 @@ "base_uri": "https://localhost:8080/" }, "id": "WKy-9Y2tN5jv", - "outputId": "d0c045f2-218c-45aa-951c-f3feed98510a" + "outputId": "8da9fa16-e08c-447e-dea1-d4b1db2feebf" }, "source": [ "fill_with_mode[2].value_counts()" ], - "execution_count": 28, + "execution_count": 29, "outputs": [ { "output_type": "execute_result", @@ -1758,7 +1758,7 @@ ] }, "metadata": {}, - "execution_count": 28 + "execution_count": 29 } ] }, @@ -1779,7 +1779,7 @@ "source": [ "fill_with_mode[2].fillna('True',inplace=True)" ], - "execution_count": 29, + "execution_count": 30, "outputs": [] }, { @@ -1790,12 +1790,12 @@ "height": 204 }, "id": "tvas7c9_OPWE", - "outputId": "c45890f5-8c76-4a3c-87f0-b831c2199750" + "outputId": "ec3c8e44-d644-475e-9e22-c65101965850" }, "source": [ "fill_with_mode" ], - "execution_count": 30, + "execution_count": 31, "outputs": [ { "output_type": "execute_result", @@ -1869,7 +1869,7 @@ ] }, "metadata": {}, - "execution_count": 30 + "execution_count": 31 } ] }, @@ -1909,7 +1909,7 @@ "height": 204 }, "id": "09HM_2feOj5Y", - "outputId": "44330273-5709-4af9-99c7-7a3a8e28c7b0" + "outputId": "7e309013-9acb-411c-9b06-4de795bbeeff" }, "source": [ "fill_with_mean = pd.DataFrame([[-2,0,1],\n", @@ -1920,7 +1920,7 @@ "\n", "fill_with_mean" ], - "execution_count": 31, + "execution_count": 32, "outputs": [ { "output_type": "execute_result", @@ -1994,7 +1994,7 @@ ] }, "metadata": {}, - "execution_count": 31 + "execution_count": 32 } ] }, @@ -2014,12 +2014,12 @@ "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "7240075c-c3a7-4ac3-e08d-be6d60573d38" + "outputId": "68a78d18-f0e5-4a9a-a959-2c3676a57c70" }, "source": [ "np.mean(fill_with_mean[0])" ], - "execution_count": 32, + "execution_count": 33, "outputs": [ { "output_type": "execute_result", @@ -2029,7 +2029,7 @@ ] }, "metadata": {}, - "execution_count": 32 + "execution_count": 33 } ] }, @@ -2050,13 +2050,13 @@ "base_uri": "https://localhost:8080/", "height": 204 }, - "outputId": "733bfa87-b099-4c11-db2e-1dea88b977ac" + "outputId": "00f74fff-01f4-4024-c261-796f50f01d2e" }, "source": [ "fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n", "fill_with_mean" ], - "execution_count": 33, + "execution_count": 34, "outputs": [ { "output_type": "execute_result", @@ -2130,7 +2130,7 @@ ] }, "metadata": {}, - "execution_count": 33 + "execution_count": 34 } ] }, @@ -2156,11 +2156,11 @@ "cell_type": "code", "metadata": { "id": "DA59Bqo3jBYZ", - "outputId": "4338adf5-081c-46ce-aca1-85bcaebf9838", "colab": { "base_uri": "https://localhost:8080/", "height": 204 - } + }, + "outputId": "85dae6ec-7394-4c36-fda0-e04769ec4a32" }, "source": [ "fill_with_median = pd.DataFrame([[-2,0,1],\n", @@ -2171,7 +2171,7 @@ "\n", "fill_with_median" ], - "execution_count": 39, + "execution_count": 35, "outputs": [ { "output_type": "execute_result", @@ -2245,7 +2245,7 @@ ] }, "metadata": {}, - "execution_count": 39 + "execution_count": 35 } ] }, @@ -2262,15 +2262,15 @@ "cell_type": "code", "metadata": { "id": "uiDy5v3xjHHX", - "outputId": "2028aa4b-8bec-4b76-ea2f-fcaa7b362e9d", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "564b6b74-2004-4486-90d4-b39330a64b88" }, "source": [ "fill_with_median[1].median()" ], - "execution_count": 40, + "execution_count": 36, "outputs": [ { "output_type": "execute_result", @@ -2280,7 +2280,7 @@ ] }, "metadata": {}, - "execution_count": 40 + "execution_count": 36 } ] }, @@ -2297,17 +2297,17 @@ "cell_type": "code", "metadata": { "id": "lFKbOxCMkBbg", - "outputId": "61bf2b0e-c68d-4b54-9724-f496c8c2ea94", "colab": { "base_uri": "https://localhost:8080/", "height": 204 - } + }, + "outputId": "a8bd18fb-2765-47d4-e5fe-e965f57ed1f4" }, "source": [ "fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n", "fill_with_median" ], - "execution_count": 41, + "execution_count": 37, "outputs": [ { "output_type": "execute_result", @@ -2381,7 +2381,7 @@ ] }, "metadata": {}, - "execution_count": 41 + "execution_count": 37 } ] }, @@ -2399,16 +2399,16 @@ "metadata": { "trusted": false, "id": "0ybtWLDdgRsG", - "outputId": "ee2e547a-bf98-40a5-ddc4-b11357efb898", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "b8c238ef-6024-4ee2-be2b-aa1f0fcac61d" }, "source": [ "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", "example5" ], - "execution_count": 42, + "execution_count": 38, "outputs": [ { "output_type": "execute_result", @@ -2423,7 +2423,7 @@ ] }, "metadata": {}, - "execution_count": 42 + "execution_count": 38 } ] }, @@ -2441,15 +2441,15 @@ "metadata": { "trusted": false, "id": "KXMIPsQdgRsH", - "outputId": "f88a0095-9742-4f1e-fdf4-43fc14cbc4c0", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "aeedfa0a-a421-4c2f-cb0d-183ce8f0c91d" }, "source": [ "example5.fillna(0)" ], - "execution_count": 43, + "execution_count": 39, "outputs": [ { "output_type": "execute_result", @@ -2464,7 +2464,7 @@ ] }, "metadata": {}, - "execution_count": 43 + "execution_count": 39 } ] }, @@ -2500,7 +2500,7 @@ "source": [ "# What happens if you try to fill null values with a string, like ''?\n" ], - "execution_count": 44, + "execution_count": 40, "outputs": [] }, { @@ -2517,15 +2517,15 @@ "metadata": { "trusted": false, "id": "vO3BuNrggRsI", - "outputId": "aff7d7de-20b9-42bf-fe06-932677314b37", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "e2bc591b-0b48-4e88-ee65-754f2737c196" }, "source": [ "example5.fillna(method='ffill')" ], - "execution_count": 45, + "execution_count": 41, "outputs": [ { "output_type": "execute_result", @@ -2540,7 +2540,7 @@ ] }, "metadata": {}, - "execution_count": 45 + "execution_count": 41 } ] }, @@ -2558,15 +2558,15 @@ "metadata": { "trusted": false, "id": "4M5onHcEgRsI", - "outputId": "c20c283d-76d7-4f75-c443-5c55fbdb3541", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "8f32b185-40dd-4a9f-bd85-54d6b6a414fe" }, "source": [ "example5.fillna(method='bfill')" ], - "execution_count": 46, + "execution_count": 42, "outputs": [ { "output_type": "execute_result", @@ -2581,7 +2581,7 @@ ] }, "metadata": {}, - "execution_count": 46 + "execution_count": 42 } ] }, @@ -2600,16 +2600,16 @@ "metadata": { "trusted": false, "id": "aRpIvo4ZgRsI", - "outputId": "ea9c5e3d-a23d-4314-cff4-e5a0e46043d1", "colab": { "base_uri": "https://localhost:8080/", "height": 142 - } + }, + "outputId": "905a980a-a808-4eca-d0ba-224bd7d85955" }, "source": [ "example4" ], - "execution_count": 47, + "execution_count": 43, "outputs": [ { "output_type": "execute_result", @@ -2673,7 +2673,7 @@ ] }, "metadata": {}, - "execution_count": 47 + "execution_count": 43 } ] }, @@ -2682,16 +2682,16 @@ "metadata": { "trusted": false, "id": "VM1qtACAgRsI", - "outputId": "2cd3360a-ac87-41fb-d362-9d8c981f573f", "colab": { "base_uri": "https://localhost:8080/", "height": 142 - } + }, + "outputId": "71f2ad28-9b4e-4ff4-f5c3-e731eb489ade" }, "source": [ "example4.fillna(method='ffill', axis=1)" ], - "execution_count": 48, + "execution_count": 44, "outputs": [ { "output_type": "execute_result", @@ -2755,7 +2755,7 @@ ] }, "metadata": {}, - "execution_count": 48 + "execution_count": 44 } ] }, @@ -2789,7 +2789,7 @@ "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" ], - "execution_count": 49, + "execution_count": 45, "outputs": [] }, { @@ -2806,16 +2806,16 @@ "metadata": { "trusted": false, "id": "OtYVErEygRsJ", - "outputId": "ad5f4520-cf88-4e3e-fa16-54bda5efa417", "colab": { "base_uri": "https://localhost:8080/", "height": 142 - } + }, + "outputId": "708b1e67-45ca-44bf-a5ee-8b2de09ece73" }, "source": [ "example4.fillna(example4.mean())" ], - "execution_count": 50, + "execution_count": 46, "outputs": [ { "output_type": "execute_result", @@ -2879,7 +2879,7 @@ ] }, "metadata": {}, - "execution_count": 50 + "execution_count": 46 } ] }, @@ -2923,11 +2923,11 @@ "cell_type": "code", "metadata": { "id": "1vGz7uZyoWHL", - "outputId": "5003c8cd-ff07-4399-a5b2-621b45184511", "colab": { "base_uri": "https://localhost:8080/", "height": 235 - } + }, + "outputId": "9e252855-d193-4103-a54d-028ea7787b34" }, "source": [ "label = pd.DataFrame([\n", @@ -2940,7 +2940,7 @@ "],columns=['ID','class'])\n", "label" ], - "execution_count": 70, + "execution_count": 47, "outputs": [ { "output_type": "execute_result", @@ -3014,7 +3014,7 @@ ] }, "metadata": {}, - "execution_count": 70 + "execution_count": 47 } ] }, @@ -3031,18 +3031,18 @@ "cell_type": "code", "metadata": { "id": "ZC5URJG3o1ES", - "outputId": "c75465b2-169e-417c-8769-680aaf1cd268", "colab": { "base_uri": "https://localhost:8080/", "height": 235 - } + }, + "outputId": "aab0f1e7-e0f3-4c14-8459-9f9168c85437" }, "source": [ "class_labels = {'business class':0,'economy class':1,'first class':2}\n", "label['class'] = label['class'].replace(class_labels)\n", "label" ], - "execution_count": 71, + "execution_count": 48, "outputs": [ { "output_type": "execute_result", @@ -3116,7 +3116,7 @@ ] }, "metadata": {}, - "execution_count": 71 + "execution_count": 48 } ] }, @@ -3148,11 +3148,11 @@ "cell_type": "code", "metadata": { "id": "ZM0eVh0ArKUL", - "outputId": "cba4258f-a6c3-45e0-dd69-32b73b2cd735", "colab": { "base_uri": "https://localhost:8080/", "height": 235 - } + }, + "outputId": "83238a76-b3a5-418d-c0b6-605b02b6891b" }, "source": [ "one_hot = pd.DataFrame([\n", @@ -3165,7 +3165,7 @@ "],columns=['ID','class'])\n", "one_hot" ], - "execution_count": 67, + "execution_count": 49, "outputs": [ { "output_type": "execute_result", @@ -3239,7 +3239,7 @@ ] }, "metadata": {}, - "execution_count": 67 + "execution_count": 49 } ] }, @@ -3260,23 +3260,23 @@ "source": [ "one_hot_data = pd.get_dummies(one_hot,columns=['class'])" ], - "execution_count": 68, + "execution_count": 50, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "TM37pHsFr4ge", - "outputId": "4f9cdbec-5ea6-4613-b14f-5b8b66b85894", "colab": { "base_uri": "https://localhost:8080/", "height": 235 - } + }, + "outputId": "7be15f53-79b2-447a-979c-822658339a9e" }, "source": [ "one_hot_data" ], - "execution_count": 69, + "execution_count": 51, "outputs": [ { "output_type": "execute_result", @@ -3364,7 +3364,7 @@ ] }, "metadata": {}, - "execution_count": 69 + "execution_count": 51 } ] }, @@ -3429,18 +3429,18 @@ "metadata": { "trusted": false, "id": "ZLu6FEnZgRsJ", - "outputId": "d62ede23-a8ba-412b-f666-6fc1a43af424", "colab": { "base_uri": "https://localhost:8080/", "height": 204 - } + }, + "outputId": "376512d1-d842-4db1-aea3-71052aeeecaf" }, "source": [ "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n", " 'numbers': [1, 2, 1, 3, 3]})\n", "example6" ], - "execution_count": 72, + "execution_count": 52, "outputs": [ { "output_type": "execute_result", @@ -3508,7 +3508,7 @@ ] }, "metadata": {}, - "execution_count": 72 + "execution_count": 52 } ] }, @@ -3517,15 +3517,15 @@ "metadata": { "trusted": false, "id": "cIduB5oBgRsK", - "outputId": "061ff212-4cba-4f49-ae20-a7bde21b54a3", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "3da27b3d-4d69-4e1d-bb52-0af21bae87f2" }, "source": [ "example6.duplicated()" ], - "execution_count": 73, + "execution_count": 53, "outputs": [ { "output_type": "execute_result", @@ -3540,7 +3540,7 @@ ] }, "metadata": {}, - "execution_count": 73 + "execution_count": 53 } ] }, @@ -3559,16 +3559,16 @@ "metadata": { "trusted": false, "id": "w_YPpqIqgRsK", - "outputId": "5081cf87-9e65-493f-c867-c73f3833b775", "colab": { "base_uri": "https://localhost:8080/", "height": 142 - } + }, + "outputId": "ac66bd2f-8671-4744-87f5-8b8d96553dea" }, "source": [ "example6.drop_duplicates()" ], - "execution_count": 74, + "execution_count": 54, "outputs": [ { "output_type": "execute_result", @@ -3624,7 +3624,7 @@ ] }, "metadata": {}, - "execution_count": 74 + "execution_count": 54 } ] }, @@ -3642,16 +3642,16 @@ "metadata": { "trusted": false, "id": "BILjDs67gRsK", - "outputId": "1087142d-5a36-4667-8b70-45824de07d64", "colab": { "base_uri": "https://localhost:8080/", "height": 111 - } + }, + "outputId": "ef6dcc08-db8b-4352-c44e-5aa9e2bec0d3" }, "source": [ "example6.drop_duplicates(['letters'])" ], - "execution_count": 75, + "execution_count": 55, "outputs": [ { "output_type": "execute_result", @@ -3701,7 +3701,7 @@ ] }, "metadata": {}, - "execution_count": 75 + "execution_count": 55 } ] },