From c89a0fc36358116b867e7bfedb59f1fbf3210956 Mon Sep 17 00:00:00 2001 From: INDRASHIS PAUL Date: Mon, 4 Oct 2021 22:34:07 +0530 Subject: [PATCH 1/7] Enhance README with notebook content --- .../08-data-preparation/README.md | 103 ++++++++++++++++++ 1 file changed, 103 insertions(+) diff --git a/2-Working-With-Data/08-data-preparation/README.md b/2-Working-With-Data/08-data-preparation/README.md index e1de8260..5bb6c3d8 100644 --- a/2-Working-With-Data/08-data-preparation/README.md +++ b/2-Working-With-Data/08-data-preparation/README.md @@ -28,6 +28,109 @@ Depending on its source, raw data may contain some inconsistencies that will cau - **Missing Data**: Missing data can cause inaccuracies as well as weak or biased results. Sometimes these can be resolved by a "reload" of the data, filling in the missing values with computation and code like Python, or simply just removing the value and corresponding data. There are numerous reasons for why data may be missing and the actions that are taken to resolve these missing values can be dependent on how and why they went missing in the first place. +## Exploring DataFrame information +> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames. + +Once you have loaded your data into pandas, it will more likely than not be in a DataFrame(refer to the previous [lesson](https://github.com/IndraP24/Data-Science-For-Beginners/tree/main/2-Working-With-Data/07-python#dataframe) for detailed overview). However, if the data set in your DataFrame has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, [pandas](https://pandas.pydata.org/) provides some convenient tools to quickly look at overall information about a DataFrame in addition to the first few and last few rows. + +In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset: the **Iris data set **. +```python +import pandas as pd +from sklearn.datasets import load_iris + +iris = load_iris() +iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names']) +``` +| |sepal length (cm)|sepal width (cm)|petal length (cm)|petal width (cm)| +|----------------------------------------|-----------------|----------------|-----------------|----------------| +|0 |5.1 |3.5 |1.4 |0.2 | +|1 |4.9 |3.0 |1.4 |0.2 | +|2 |4.7 |3.2 |1.3 |0.2 | +|3 |4.6 |3.1 |1.5 |0.2 | +|4 |5.0 |3.6 |1.4 |0.2 | + +- **DataFrame.info**: To start off, the `info()` method is used to print a summary of the content present in a `DataFrame`. Let's take a look at this dataset to see what we have: +```python +iris_df.info() +``` +``` +RangeIndex: 150 entries, 0 to 149 +Data columns (total 4 columns): + # Column Non-Null Count Dtype +--- ------ -------------- ----- + 0 sepal length (cm) 150 non-null float64 + 1 sepal width (cm) 150 non-null float64 + 2 petal length (cm) 150 non-null float64 + 3 petal width (cm) 150 non-null float64 +dtypes: float64(4) +memory usage: 4.8 KB +``` +From this, we know that the *Iris* dataset has 150 entries in four columns with no null entries. All of the data is stored as 64-bit floating-point numbers. + +- **DataFrame.head()**: Next, to check the actual content of the `DataFrame`, we use the `head()` method. Let's see what the first few rows of our `iris_df` look like: +```python +iris_df.head() +``` +``` + sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) +0 5.1 3.5 1.4 0.2 +1 4.9 3.0 1.4 0.2 +2 4.7 3.2 1.3 0.2 +3 4.6 3.1 1.5 0.2 +4 5.0 3.6 1.4 0.2 +``` +- **DataFrame.tail()**: Conversely, to check the last few rows of the `DataFrame`, we use the `tail()` method: +```python +iris_df.tail() +``` +``` + sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) +145 6.7 3.0 5.2 2.3 +146 6.3 2.5 5.0 1.9 +147 6.5 3.0 5.2 2.0 +148 6.2 3.4 5.4 2.3 +149 5.9 3.0 5.1 1.8 +``` +> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with. + +## Dealing with Missing Data +> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames. + +Most of the time the datasets you want to use (of have to use) have missing values in them. How missing data is handled carries with it subtle tradeoffs that can affect your final analysis and real-world outcomes. + +Pandas handles missing values in two ways. The first you've seen before in previous sections: `NaN`, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values. + +For missing values apart from floats, pandas uses the Python `None` object. While it might seem confusing that you will encounter two different kinds of values that say essentially the same thing, there are sound programmatic reasons for this design choice and, in practice, going this route enables pandas to deliver a good compromise for the vast majority of cases. Notwithstanding this, both `None` and `NaN` carry restrictions that you need to be mindful of with regards to how they can be used. + +Check out more about `NaN` and `None` from the [notebook](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb)! + +- **Detecting null values**: In `pandas`, the `isnull()` and `notnull()` methods are your primary methods for detecting null data. Both return Boolean masks over your data. We will be using `numpy` for `NaN` values: +```python +import numpy as np + +example1 = pd.Series([0, np.nan, '', None]) +example1.isnull() +``` +``` +0 False +1 True +2 False +3 True +dtype: bool +``` +Look closely at the output. Does any of it surprise you? While `0` is an arithmetic null, it's nevertheless a perfectly good integer and pandas treats it as such. `''` is a little more subtle. While we used it in Section 1 to represent an empty string value, it is nevertheless a string object and not a representation of null as far as pandas is concerned. + +Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values. + +> **Tkeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data. + +- **Dropping null values**: Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to `example1`: +```python +example1 = example1.dropna() +example1 +``` + + ## ๐Ÿš€ Challenge From 52452e59ea521a9c208d143d7ce047c7e33ad362 Mon Sep 17 00:00:00 2001 From: INDRASHIS PAUL Date: Tue, 5 Oct 2021 11:07:14 +0530 Subject: [PATCH 2/7] Add the contents from the notebook to README 1. Copied almost all content (both code and explanations) from the notebook to the README in a proper format. 2. Left out extra portions and exercises in the notebook for the readers to try out. --- .../08-data-preparation/README.md | 187 +++++++++++++++++- 1 file changed, 184 insertions(+), 3 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/README.md b/2-Working-With-Data/08-data-preparation/README.md index 5bb6c3d8..58c2e528 100644 --- a/2-Working-With-Data/08-data-preparation/README.md +++ b/2-Working-With-Data/08-data-preparation/README.md @@ -33,7 +33,8 @@ Depending on its source, raw data may contain some inconsistencies that will cau Once you have loaded your data into pandas, it will more likely than not be in a DataFrame(refer to the previous [lesson](https://github.com/IndraP24/Data-Science-For-Beginners/tree/main/2-Working-With-Data/07-python#dataframe) for detailed overview). However, if the data set in your DataFrame has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, [pandas](https://pandas.pydata.org/) provides some convenient tools to quickly look at overall information about a DataFrame in addition to the first few and last few rows. -In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset: the **Iris data set **. +In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset: the **Iris data set**. + ```python import pandas as pd from sklearn.datasets import load_iris @@ -122,19 +123,199 @@ Look closely at the output. Does any of it surprise you? While `0` is an arithme Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values. -> **Tkeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data. +> **Takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data. - **Dropping null values**: Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to `example1`: ```python example1 = example1.dropna() example1 ``` +``` +0 0 +2 +dtype: object +``` +Note that this should look like your output from `example3[example3.notnull()]`. The difference here is that, rather than just indexing on the masked values, `dropna` has removed those missing values from the `Series` `example1`. + +Because `DataFrame`s have two dimensions, they afford more options for dropping data. + +```python +example2 = pd.DataFrame([[1, np.nan, 7], + [2, 5, 8], + [np.nan, 6, 9]]) +example2 +``` +| | 0 | 1 | 2 | +|------|---|---|---| +|0 |1.0|NaN|7 | +|1 |2.0|5.0|8 | +|2 |NaN|6.0|9 | + +(Did you notice that pandas upcast two of the columns to floats to accommodate the `NaN`s?) + +You cannot drop a single value from a `DataFrame`, so you have to drop full rows or columns. Depending on what you are doing, you might want to do one or the other, and so pandas gives you options for both. Because in data science, columns generally represent variables and rows represent observations, you are more likely to drop rows of data; the default setting for `dropna()` is to drop all rows that contain any null values: + +```python +example2.dropna() +``` +``` + 0 1 2 +1 2.0 5.0 8 +``` +If necessary, you can drop NA values from columns. Use `axis=1` to do so: +```python +example2.dropna(axis='columns') +``` +``` + 2 +0 7 +1 8 +2 9 +``` +Notice that this can drop a lot of data that you might want to keep, particularly in smaller datasets. What if you just want to drop rows or columns that contain several or even just all null values? You specify those setting in `dropna` with the `how` and `thresh` parameters. + +By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action. + +```python +example2[3] = np.nan +example2 +``` +| |0 |1 |2 |3 | +|------|---|---|---|---| +|0 |1.0|NaN|7 |NaN| +|1 |2.0|5.0|8 |NaN| +|2 |NaN|6.0|9 |NaN| + +The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept: +```python +example2.dropna(axis='rows', thresh=3) +``` +``` + 0 1 2 3 +1 2.0 5.0 8 NaN +``` +Here, the first and last row have been dropped, because they contain only two non-null values. + +- **Filling null values**: Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice. +```python +example3 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde')) +example3 +``` +``` +a 1.0 +b NaN +c 2.0 +d NaN +e 3.0 +dtype: float64 +``` +You can fill all of the null entries with a single value, such as `0`: +```python +example3.fillna(0) +``` +``` +a 1.0 +b 0.0 +c 2.0 +d 0.0 +e 3.0 +dtype: float64 +``` +You can **forward-fill** null values, which is to use the last valid value to fill a null: +```python +example3.fillna(method='ffill') +``` +``` +a 1.0 +b 1.0 +c 2.0 +d 2.0 +e 3.0 +dtype: float64 +``` +You can also **back-fill** to propagate the next valid value backward to fill a null: +```python +example3.fillna(method='bfill') +``` +``` +a 1.0 +b 2.0 +c 2.0 +d 3.0 +e 3.0 +dtype: float64 +``` +As you might guess, this works the same with `DataFrame`s, but you can also specify an `axis` along which to fill null values. taking the previously used `example2` again: +```python +example2.fillna(method='ffill', axis=1) +``` +``` + 0 1 2 3 +0 1.0 1.0 7.0 7.0 +1 2.0 5.0 8.0 8.0 +2 NaN 6.0 9.0 9.0 +``` +Notice that when a previous value is not available for forward-filling, the null value remains. + +> **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets. + +## Removing duplicate data + +> **Learning goal:** By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames. + +In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, `pandas` provides an easy means of detecting and removing duplicate entries. + +- **Identifying duplicates: `duplicated`**: You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an ealier one. Let's create another example `DataFrame` to see this in action. +```python +example4 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'], + 'numbers': [1, 2, 1, 3, 3]}) +example4 +``` +| |letters|numbers| +|------|-------|-------| +|0 |A |1 | +|1 |B |2 | +|2 |A |1 | +|3 |B |3 | +|4 |B |3 | + +```python +example4.duplicated() +``` +``` +0 False +1 False +2 True +3 False +4 True +dtype: bool +``` +- **Dropping duplicates: `drop_duplicates`: `drop_duplicates` simply returns a copy of the data for which all of the `duplicated` values are `False`: +```python +example4.drop_duplicates() +``` +``` + letters numbers +0 A 1 +1 B 2 +3 B 3 +``` +Both `duplicated` and `drop_duplicates` default to consider all columnsm but you can specify that they examine only a subset of columns in your `DataFrame`: +```python +example6.drop_duplicates(['letters']) +``` +``` +letters numbers +0 A 1 +1 B 2 +``` +> **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you inaccurate results! ## ๐Ÿš€ Challenge -Give the exercises in the [notebook](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb) a try! +All of the discussed materials are provided as a [Jupyter Notebook](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb). Additionally, there are exercises present after each section, give them a try! ## [Post-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/15) From 083760cdf7416cafbd7dfb56f74c0aeadfb3e6d0 Mon Sep 17 00:00:00 2001 From: IndraP24 Date: Tue, 5 Oct 2021 11:13:08 +0530 Subject: [PATCH 3/7] Update the notebook with outputs to be used as reference for the README --- .../08-data-preparation/notebook.ipynb | 1220 +++++++++++++++-- 1 file changed, 1114 insertions(+), 106 deletions(-) diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index e45a5cb5..93c37c35 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -3,28 +3,28 @@ { "cell_type": "markdown", "source": [ - "# Data Preparation\r\n", - "\r\n", - "[Original Notebook source from *Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio by Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\r\n", - "\r\n", - "## Exploring `DataFrame` information\r\n", - "\r\n", - "> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.\r\n", - "\r\n", - "Once you have loaded your data into pandas, it will more likely than not be in a `DataFrame`. However, if the data set in your `DataFrame` has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, pandas provides some convenient tools to quickly look at overall information about a `DataFrame` in addition to the first few and last few rows.\r\n", - "\r\n", - "In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset that every data scientist has seen hundreds of times: British biologist Ronald Fisher's *Iris* data set used in his 1936 paper \"The use of multiple measurements in taxonomic problems\":" + "# Data Preparation\n", + "\n", + "[Original Notebook source from *Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio by Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n", + "\n", + "## Exploring `DataFrame` information\n", + "\n", + "> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.\n", + "\n", + "Once you have loaded your data into pandas, it will more likely than not be in a `DataFrame`. \n", + "\n", + "In order to explore our `DataFramme`, we will import the Python `scikit-learn` library and use an iconic dataset that every data scientist has seen hundreds of times: British biologist Ronald Fisher's **Iris data set** used in his 1936 paper \"*The use of multiple measurements in taxonomic problems*\":" ], "metadata": {} }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "source": [ - "import pandas as pd\r\n", - "from sklearn.datasets import load_iris\r\n", - "\r\n", - "iris = load_iris()\r\n", + "import pandas as pd\n", + "from sklearn.datasets import load_iris\n", + "\n", + "iris = load_iris()\n", "iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])" ], "outputs": [], @@ -43,11 +43,29 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "source": [ "iris_df.info()" ], - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 150 entries, 0 to 149\n", + "Data columns (total 4 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 sepal length (cm) 150 non-null float64\n", + " 1 sepal width (cm) 150 non-null float64\n", + " 2 petal length (cm) 150 non-null float64\n", + " 3 petal width (cm) 150 non-null float64\n", + "dtypes: float64(4)\n", + "memory usage: 4.8 KB\n" + ] + } + ], "metadata": { "trusted": false } @@ -69,11 +87,92 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "source": [ "iris_df.head()" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", + "0 5.1 3.5 1.4 0.2\n", + "1 4.9 3.0 1.4 0.2\n", + "2 4.7 3.2 1.3 0.2\n", + "3 4.6 3.1 1.5 0.2\n", + "4 5.0 3.6 1.4 0.2" + ] + }, + "metadata": {}, + "execution_count": 3 + } + ], "metadata": { "trusted": false } @@ -89,7 +188,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "source": [ "# Hint: Consult the documentation by using iris_df.head?" ], @@ -109,11 +208,92 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "source": [ "iris_df.tail()" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
1456.73.05.22.3
1466.32.55.01.9
1476.53.05.22.0
1486.23.45.42.3
1495.93.05.11.8
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", + "145 6.7 3.0 5.2 2.3\n", + "146 6.3 2.5 5.0 1.9\n", + "147 6.5 3.0 5.2 2.0\n", + "148 6.2 3.4 5.4 2.3\n", + "149 5.9 3.0 5.1 1.8" + ] + }, + "metadata": {}, + "execution_count": 5 + } + ], "metadata": { "trusted": false } @@ -154,14 +334,25 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "source": [ - "import numpy as np\r\n", - "\r\n", - "example1 = np.array([2, None, 6, 8])\r\n", + "import numpy as np\n", + "\n", + "example1 = np.array([2, None, 6, 8])\n", "example1" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([2, None, 6, 8], dtype=object)" + ] + }, + "metadata": {}, + "execution_count": 6 + } + ], "metadata": { "trusted": false } @@ -177,11 +368,24 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "source": [ "example1.sum()" ], - "outputs": [], + "outputs": [ + { + "output_type": "error", + "ename": "TypeError", + "evalue": "unsupported operand type(s) for +: 'int' and 'NoneType'", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/anaconda3/lib/python3.8/site-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n", + "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" + ] + } + ], "metadata": { "trusted": false } @@ -204,22 +408,44 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "source": [ "np.nan + 1" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "nan" + ] + }, + "metadata": {}, + "execution_count": 8 + } + ], "metadata": { "trusted": false } }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "source": [ "np.nan * 0" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "nan" + ] + }, + "metadata": {}, + "execution_count": 9 + } + ], "metadata": { "trusted": false } @@ -233,12 +459,23 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "source": [ - "example2 = np.array([2, np.nan, 6, 8]) \r\n", + "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(nan, nan, nan)" + ] + }, + "metadata": {}, + "execution_count": 10 + } + ], "metadata": { "trusted": false } @@ -252,9 +489,9 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "source": [ - "# What happens if you add np.nan and None together?\r\n" + "# What happens if you add np.nan and None together?\n" ], "outputs": [], "metadata": { @@ -280,12 +517,26 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "source": [ - "int_series = pd.Series([1, 2, 3], dtype=int)\r\n", + "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 1\n", + "1 2\n", + "2 3\n", + "dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 12 + } + ], "metadata": { "trusted": false } @@ -299,11 +550,11 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "source": [ - "# Now set an element of int_series equal to None.\r\n", - "# How does that element show up in the Series?\r\n", - "# What is the dtype of the Series?\r\n" + "# Now set an element of int_series equal to None.\n", + "# How does that element show up in the Series?\n", + "# What is the dtype of the Series?\n" ], "outputs": [], "metadata": { @@ -335,7 +586,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "source": [ "example3 = pd.Series([0, np.nan, '', None])" ], @@ -347,11 +598,26 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "source": [ "example3.isnull()" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 False\n", + "1 True\n", + "2 False\n", + "3 True\n", + "dtype: bool" + ] + }, + "metadata": {}, + "execution_count": 15 + } + ], "metadata": { "trusted": false } @@ -374,10 +640,10 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 16, "source": [ - "# Try running example3[example3.notnull()].\r\n", - "# Before you do so, what do you expect to see?\r\n" + "# Try running example3[example3.notnull()].\n", + "# Before you do so, what do you expect to see?\n" ], "outputs": [], "metadata": { @@ -403,12 +669,25 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 17, "source": [ - "example3 = example3.dropna()\r\n", + "example3 = example3.dropna()\n", "example3" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 0\n", + "2 \n", + "dtype: object" + ] + }, + "metadata": {}, + "execution_count": 17 + } + ], "metadata": { "trusted": false } @@ -424,14 +703,75 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 18, "source": [ - "example4 = pd.DataFrame([[1, np.nan, 7], \r\n", - " [2, 5, 8], \r\n", - " [np.nan, 6, 9]])\r\n", + "example4 = pd.DataFrame([[1, np.nan, 7], \n", + " [2, 5, 8], \n", + " [np.nan, 6, 9]])\n", "example4" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
01.0NaN7
12.05.08
2NaN6.09
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 1.0 NaN 7\n", + "1 2.0 5.0 8\n", + "2 NaN 6.0 9" + ] + }, + "metadata": {}, + "execution_count": 18 + } + ], "metadata": { "trusted": false } @@ -447,11 +787,58 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "source": [ "example4.dropna()" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
12.05.08
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "1 2.0 5.0 8" + ] + }, + "metadata": {}, + "execution_count": 19 + } + ], "metadata": { "trusted": false } @@ -465,11 +852,64 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "source": [ "example4.dropna(axis='columns')" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
2
07
18
29
\n", + "
" + ], + "text/plain": [ + " 2\n", + "0 7\n", + "1 8\n", + "2 9" + ] + }, + "metadata": {}, + "execution_count": 20 + } + ], "metadata": { "trusted": false } @@ -485,12 +925,77 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 21, "source": [ - "example4[3] = np.nan\r\n", + "example4[3] = np.nan\n", "example4" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 1.0 NaN 7 NaN\n", + "1 2.0 5.0 8 NaN\n", + "2 NaN 6.0 9 NaN" + ] + }, + "metadata": {}, + "execution_count": 21 + } + ], "metadata": { "trusted": false } @@ -504,10 +1009,10 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 22, "source": [ - "# How might you go about dropping just column 3?\r\n", - "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\r\n" + "# How might you go about dropping just column 3?\n", + "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" ], "outputs": [], "metadata": { @@ -524,11 +1029,60 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 23, "source": [ "example4.dropna(axis='rows', thresh=3)" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
12.05.08NaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "1 2.0 5.0 8 NaN" + ] + }, + "metadata": {}, + "execution_count": 23 + } + ], "metadata": { "trusted": false } @@ -551,12 +1105,28 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 24, "source": [ - "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\r\n", + "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", "example5" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "a 1.0\n", + "b NaN\n", + "c 2.0\n", + "d NaN\n", + "e 3.0\n", + "dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 24 + } + ], "metadata": { "trusted": false } @@ -570,11 +1140,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "source": [ "example5.fillna(0)" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "a 1.0\n", + "b 0.0\n", + "c 2.0\n", + "d 0.0\n", + "e 3.0\n", + "dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 25 + } + ], "metadata": { "trusted": false } @@ -588,9 +1174,9 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 26, "source": [ - "# What happens if you try to fill null values with a string, like ''?\r\n" + "# What happens if you try to fill null values with a string, like ''?\n" ], "outputs": [], "metadata": { @@ -607,11 +1193,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 27, "source": [ "example5.fillna(method='ffill')" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "a 1.0\n", + "b 1.0\n", + "c 2.0\n", + "d 2.0\n", + "e 3.0\n", + "dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 27 + } + ], "metadata": { "trusted": false } @@ -625,11 +1227,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 28, "source": [ "example5.fillna(method='bfill')" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "a 1.0\n", + "b 2.0\n", + "c 2.0\n", + "d 3.0\n", + "e 3.0\n", + "dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 28 + } + ], "metadata": { "trusted": false } @@ -645,22 +1263,152 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 29, "source": [ "example4" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 1.0 NaN 7 NaN\n", + "1 2.0 5.0 8 NaN\n", + "2 NaN 6.0 9 NaN" + ] + }, + "metadata": {}, + "execution_count": 29 + } + ], "metadata": { "trusted": false } }, { "cell_type": "code", - "execution_count": null, + "execution_count": 30, "source": [ "example4.fillna(method='ffill', axis=1)" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
01.01.07.07.0
12.05.08.08.0
2NaN6.09.09.0
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 1.0 1.0 7.0 7.0\n", + "1 2.0 5.0 8.0 8.0\n", + "2 NaN 6.0 9.0 9.0" + ] + }, + "metadata": {}, + "execution_count": 30 + } + ], "metadata": { "trusted": false } @@ -681,11 +1429,11 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 31, "source": [ - "# What output does example4.fillna(method='bfill', axis=1) produce?\r\n", - "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\r\n", - "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\r\n" + "# What output does example4.fillna(method='bfill', axis=1) produce?\n", + "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", + "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" ], "outputs": [], "metadata": { @@ -702,11 +1450,76 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 32, "source": [ "example4.fillna(example4.mean())" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
01.05.57NaN
12.05.08NaN
21.56.09NaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 1.0 5.5 7 NaN\n", + "1 2.0 5.0 8 NaN\n", + "2 1.5 6.0 9 NaN" + ] + }, + "metadata": {}, + "execution_count": 32 + } + ], "metadata": { "trusted": false } @@ -742,24 +1555,109 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 33, "source": [ - "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\r\n", - " 'numbers': [1, 2, 1, 3, 3]})\r\n", + "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n", + " 'numbers': [1, 2, 1, 3, 3]})\n", "example6" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lettersnumbers
0A1
1B2
2A1
3B3
4B3
\n", + "
" + ], + "text/plain": [ + " letters numbers\n", + "0 A 1\n", + "1 B 2\n", + "2 A 1\n", + "3 B 3\n", + "4 B 3" + ] + }, + "metadata": {}, + "execution_count": 33 + } + ], "metadata": { "trusted": false } }, { "cell_type": "code", - "execution_count": null, + "execution_count": 34, "source": [ "example6.duplicated()" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 True\n", + "3 False\n", + "4 True\n", + "dtype: bool" + ] + }, + "metadata": {}, + "execution_count": 34 + } + ], "metadata": { "trusted": false } @@ -774,11 +1672,68 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 35, "source": [ "example6.drop_duplicates()" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lettersnumbers
0A1
1B2
3B3
\n", + "
" + ], + "text/plain": [ + " letters numbers\n", + "0 A 1\n", + "1 B 2\n", + "3 B 3" + ] + }, + "metadata": {}, + "execution_count": 35 + } + ], "metadata": { "trusted": false } @@ -792,11 +1747,62 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 36, "source": [ "example6.drop_duplicates(['letters'])" ], - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lettersnumbers
0A1
1B2
\n", + "
" + ], + "text/plain": [ + " letters numbers\n", + "0 A 1\n", + "1 B 2" + ] + }, + "metadata": {}, + "execution_count": 36 + } + ], "metadata": { "trusted": false } @@ -813,20 +1819,22 @@ "anaconda-cloud": {}, "kernelspec": { "name": "python3", - "display_name": "Python 3", - "language": "python" + "display_name": "Python 3.8.8 64-bit ('base': conda)" }, "language_info": { "mimetype": "text/x-python", "nbconvert_exporter": "python", "name": "python", "file_extension": ".py", - "version": "3.5.4", + "version": "3.8.8", "pygments_lexer": "ipython3", "codemirror_mode": { "version": 3, "name": "ipython" } + }, + "interpreter": { + "hash": "ac36fb7022a775f2750f61e1a6104d2d5a9eb3fb9bd004b80f1c771537b93945" } }, "nbformat": 4, From 7f53507c79cdec5feecbc935147e7890da117eb1 Mon Sep 17 00:00:00 2001 From: Lateefah Bello <2019cinnamon@gmail.com> Date: Tue, 5 Oct 2021 12:02:03 +0100 Subject: [PATCH 4/7] fixed spelling error --- 1-Introduction/01-defining-data-science/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/1-Introduction/01-defining-data-science/README.md b/1-Introduction/01-defining-data-science/README.md index ccfe6ef7..aa92a9c9 100644 --- a/1-Introduction/01-defining-data-science/README.md +++ b/1-Introduction/01-defining-data-science/README.md @@ -33,7 +33,7 @@ This definition highlights the following important aspects of data science: > Another important aspect of Data Science is that it studies how data can be gathered, stored and operated upon using computers. While statistics gives us mathematical foundations, data science applies mathematical concepts to actually draw insights from data. One of the ways (attributed to [Jim Gray](https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist))) to look at the data science is to consider it to be a separate paradigm of science: -* **Empyrical**, in which we rely mostly on observations and results of experiments +* **Empirical**, in which we rely mostly on observations and results of experiments * **Theoretical**, where new concepts emerge from existing scientific knowledge * **Computational**, where we discover new principles based on some computational experiments * **Data-Driven**, based on discovering relationships and patterns in the data From 1ac3f9c104ef7d6c945ea3277d3e13e730dfb035 Mon Sep 17 00:00:00 2001 From: INDRASHIS PAUL Date: Tue, 5 Oct 2021 19:18:47 +0530 Subject: [PATCH 5/7] Update previous lesson link --- 2-Working-With-Data/08-data-preparation/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/2-Working-With-Data/08-data-preparation/README.md b/2-Working-With-Data/08-data-preparation/README.md index 58c2e528..29534354 100644 --- a/2-Working-With-Data/08-data-preparation/README.md +++ b/2-Working-With-Data/08-data-preparation/README.md @@ -31,7 +31,7 @@ Depending on its source, raw data may contain some inconsistencies that will cau ## Exploring DataFrame information > **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames. -Once you have loaded your data into pandas, it will more likely than not be in a DataFrame(refer to the previous [lesson](https://github.com/IndraP24/Data-Science-For-Beginners/tree/main/2-Working-With-Data/07-python#dataframe) for detailed overview). However, if the data set in your DataFrame has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, [pandas](https://pandas.pydata.org/) provides some convenient tools to quickly look at overall information about a DataFrame in addition to the first few and last few rows. +Once you have loaded your data into pandas, it will more likely than not be in a DataFrame(refer to the previous [lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/2-Working-With-Data/07-python#dataframe) for detailed overview). However, if the data set in your DataFrame has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, [pandas](https://pandas.pydata.org/) provides some convenient tools to quickly look at overall information about a DataFrame in addition to the first few and last few rows. In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset: the **Iris data set**. From 8481565c0a0415b676c000bf5362f05c5a76781c Mon Sep 17 00:00:00 2001 From: Dmitri Soshnikov Date: Wed, 6 Oct 2021 18:46:28 +0300 Subject: [PATCH 6/7] Correct the link to Intro to DS video --- 1-Introduction/01-defining-data-science/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/1-Introduction/01-defining-data-science/README.md b/1-Introduction/01-defining-data-science/README.md index aa92a9c9..bedbb1e7 100644 --- a/1-Introduction/01-defining-data-science/README.md +++ b/1-Introduction/01-defining-data-science/README.md @@ -6,7 +6,7 @@ --- -[![Defining Data Science Video](images/video-def-ds.png)](https://youtu.be/pqqsm5reGvs) +[![Defining Data Science Video](images/video-def-ds.png)](https://youtu.be/beZ7Mb_oz9I) ## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/0) From 9f414d6975a8d8ecca59ae67c4706b7408999deb Mon Sep 17 00:00:00 2001 From: Dmitri Soshnikov Date: Wed, 6 Oct 2021 18:48:12 +0300 Subject: [PATCH 7/7] Correct link to intro to DS video on home page --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7b2a9e8e..4843d1f8 100644 --- a/README.md +++ b/README.md @@ -64,7 +64,7 @@ In addition, a low-stakes quiz before a class sets the intention of the student | Lesson Number | Topic | Lesson Grouping | Learning Objectives | Linked Lesson | Author | | :-----------: | :----------------------------------------: | :--------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------: | :----: | -| 01 | Defining Data Science | [Introduction](1-Introduction/README.md) | Learn the basic concepts behind data science and how itโ€™s related to artificial intelligence, machine learning, and big data. | [lesson](1-Introduction/01-defining-data-science/README.md) [video](https://youtu.be/pqqsm5reGvs) | [Dmitry](http://soshnikov.com) | +| 01 | Defining Data Science | [Introduction](1-Introduction/README.md) | Learn the basic concepts behind data science and how itโ€™s related to artificial intelligence, machine learning, and big data. | [lesson](1-Introduction/01-defining-data-science/README.md) [video](https://youtu.be/beZ7Mb_oz9I) | [Dmitry](http://soshnikov.com) | | 02 | Data Science Ethics | [Introduction](1-Introduction/README.md) | Data Ethics Concepts, Challenges & Frameworks. | [lesson](1-Introduction/02-ethics/README.md) | [Nitya](https://twitter.com/nitya) | | 03 | Defining Data | [Introduction](1-Introduction/README.md) | How data is classified and its common sources. | [lesson](1-Introduction/03-defining-data/README.md) | [Jasmine](https://www.twitter.com/paladique) | | 04 | Introduction to Statistics & Probability | [Introduction](1-Introduction/README.md) | The mathematical techniques of probability and statistics to understand data. | [lesson](1-Introduction/04-stats-and-probability/README.md) [video](https://youtu.be/Z5Zy85g4Yjw) | [Dmitry](http://soshnikov.com) |