notebook and lesson updates

4 years ago · b9033ac80b
parent 1fdfd7fb50
commit b9033ac80b
3 changed files with 128 additions and 10 deletions
--- a/4-Data-Science-Lifecycle/15-analyzing/README.md
+++ b/4-Data-Science-Lifecycle/15-analyzing/README.md
@ -10,7 +10,7 @@

 Analyzing in the data lifecycle confirms that the data can answer the questions that are proposed or solving a particular problem. This step can also focus on confirming a model is correctly addressing these questions and problems. This lesson is focused on Exploratory Data Analysis or EDA, which are techniques for defining features and relationships within the data and can be used to prepare the data for modeling. 

- We'll be using an example dataset from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv/version/1) to show how this can be applied with Python's Pandas library. This dataset contains a count of some common words found in emails, the sources of these emails are anonymous. Use the [notebook](notebook.ipynb) in this directory to follow along.
+ We'll be using an example dataset from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv/version/1) to show how this can be applied with Python and the Pandas library. This dataset contains a count of some common words found in emails, the sources of these emails are anonymous. Use the [notebook](notebook.ipynb) in this directory to follow along.

 ## Exploratory Data Analysis

@ -31,13 +31,13 @@ Exploring everything in a large dataset can be very time consuming and a task th
 Pandas has the [`sample()` function in its library]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples you’d like to receive and use. 

 General querying of the data can help you answer some general questions and theories you may have. In contrast to sampling, queries allow you to have control and focus on specific parts of the data you have questions about. 
-The [`query() `function]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select the columns and receive simple answers about the data in the form of `True` or `False`.
+The [`query() `function]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved.

 ## Exploring with Visualizations
 You don’t have to wait until the data is thoroughly cleaned and analyzed to start creating visualizations. In fact, having a visual representation while exploring can help identify patterns, relationships, and problems in the data. Furthermore, visualizations provide a means of communication with those who are not involved with managing the data and can be an opportunity to share and clarify additional questions that were not addressed in the capture stage. Refer to the [section on Visualizations](3-Data-Visualization) to learn more about some popular ways to explore visually.

 ## Exploring to identify inconsistencies
-All the topics in this lesson can help identify missing or inconsistent values, but Pandas provides functions to check for some of these. [isna() or isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) can check for missing values. One important piece of exploring for these values within your data is to explore why they ended up that way in the first place. This can help you decide on what [actions to take to resolve them]().
+All the topics in this lesson can help identify missing or inconsistent values, but Pandas provides functions to check for some of these. [isna() or isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) can check for missing values. One important piece of exploring for these values within your data is to explore why they ended up that way in the first place. This can help you decide on what [actions to take to resolve them](2-Working-With-Data\08-data-preparation\notebook.ipynb).


 ## 🚀 Challenge
--- a/4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb
+++ b/4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb
@ -2,12 +2,15 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "source": [],
+   "source": [
+    "# Analyzing Data\r\n",
+    "Examples of the Pandas functions mentioned in the [lesson](README.md)."
+   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
   "source": [
    "import pandas as pd\r\n",
    "import glob\r\n",
@ -21,17 +24,132 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
   "source": [
    "# Using Describe on the email dataset\r\n",
    "print(email_df.describe())"
   ],
-   "outputs": [],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "              the          to         ect         and         for          of  \\\n",
+      "count  406.000000  406.000000  406.000000  406.000000  406.000000  406.000000   \n",
+      "mean     7.022167    6.519704    4.948276    3.059113    3.502463    2.662562   \n",
+      "std     10.945522    9.801907    9.293820    6.267806    4.901372    5.443939   \n",
+      "min      0.000000    0.000000    1.000000    0.000000    0.000000    0.000000   \n",
+      "25%      1.000000    1.000000    1.000000    0.000000    1.000000    0.000000   \n",
+      "50%      3.000000    3.000000    2.000000    1.000000    2.000000    1.000000   \n",
+      "75%      9.000000    7.750000    4.000000    3.000000    4.750000    3.000000   \n",
+      "max     99.000000   88.000000   79.000000   69.000000   39.000000   57.000000   \n",
+      "\n",
+      "                a         you          in          on          is        this  \\\n",
+      "count  406.000000  406.000000  406.000000  406.000000  406.000000  406.000000   \n",
+      "mean    57.017241    2.394089   10.817734   11.591133    5.901478    1.485222   \n",
+      "std     78.868243    4.067015   19.050972   16.407175    8.793103    2.912473   \n",
+      "min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   \n",
+      "25%     15.000000    0.000000    1.250000    3.000000    1.000000    0.000000   \n",
+      "50%     29.000000    1.000000    5.000000    6.000000    3.000000    0.000000   \n",
+      "75%     61.000000    3.000000   12.000000   13.000000    7.000000    2.000000   \n",
+      "max    843.000000   31.000000  223.000000  125.000000   61.000000   24.000000   \n",
+      "\n",
+      "                i          be        that        will  \n",
+      "count  406.000000  406.000000  406.000000  406.000000  \n",
+      "mean    47.155172    2.950739    1.034483    0.955665  \n",
+      "std     71.043009    4.297865    1.904846    2.042271  \n",
+      "min      0.000000    0.000000    0.000000    0.000000  \n",
+      "25%     11.000000    1.000000    0.000000    0.000000  \n",
+      "50%     24.000000    1.000000    0.000000    0.000000  \n",
+      "75%     50.750000    3.000000    1.000000    1.000000  \n",
+      "max    754.000000   40.000000   14.000000   24.000000  \n"
+     ]
+    }
+   ],
   "metadata": {}
  },
  {
-   "cell_type": "markdown",
-   "source": [],
+   "cell_type": "code",
+   "execution_count": 5,
+   "source": [
+    "# Sampling 10 emails\r\n",
+    "print(email_df.sample(10))"
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "      Email No.  the  to  ect  and  for  of    a  you  in  on  is  this    i  \\\n",
+      "150   Email 151    0   1    2    0    3   0   15    0   0   5   0     0    7   \n",
+      "380  Email 5147    0   3    2    0    0   0    7    0   1   1   0     0    3   \n",
+      "19     Email 20    3   4   11    0    4   2   32    1   1   3   9     5   25   \n",
+      "300   Email 301    2   1    1    0    1   1   15    2   2   3   2     0    8   \n",
+      "307   Email 308    0   0    1    0    0   0    1    0   1   0   0     0    2   \n",
+      "167   Email 168    2   2    2    1    5   1   24    2   5   6   4     0   30   \n",
+      "320   Email 321   10  12    4    6    8   6  187    5  26  28  23     2  171   \n",
+      "61     Email 62    0   1    1    0    4   1   15    4   4   3   3     0   19   \n",
+      "26     Email 27    5   4    1    1    4   4   51    0   8   6   6     2   44   \n",
+      "73     Email 74    0   0    1    0    0   0    7    0   4   3   0     0    6   \n",
+      "\n",
+      "     be  that  will  \n",
+      "150   1     0     0  \n",
+      "380   0     0     0  \n",
+      "19    3     0     1  \n",
+      "300   0     0     0  \n",
+      "307   0     0     0  \n",
+      "167   2     0     0  \n",
+      "320   5     1     1  \n",
+      "61    2     0     0  \n",
+      "26    6     0     0  \n",
+      "73    0     0     0  \n"
+     ]
+    }
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "source": [
+    "# Returns rows where there are more occurrences of \"to\" than \"the\"\r\n",
+    "print(email_df.query('the < to'))"
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "      Email No.  the  to  ect  and  for  of    a  you  in  on  is  this   i  \\\n",
+      "1       Email 2    8  13   24    6    6   2  102    1  18  21  13     0  61   \n",
+      "3       Email 4    0   5   22    0    5   1   51    2   1   5   9     2  16   \n",
+      "5       Email 6    4   5    1    4    2   3   45    1  16  12   8     1  52   \n",
+      "7       Email 8    0   2    2    3    1   2   21    6   2   6   2     0  28   \n",
+      "13     Email 14    4   5    7    1    5   1   37    1   8   8   6     1  43   \n",
+      "..          ...  ...  ..  ...  ...  ...  ..  ...  ...  ..  ..  ..   ...  ..   \n",
+      "390  Email 5157    4  13    1    0    3   1   48    2   8  26   9     1  45   \n",
+      "393  Email 5160    2  13    1    0    2   1   38    2   7  24   6     1  34   \n",
+      "396  Email 5163    2   3    1    2    1   2   32    0   7   3   2     0  26   \n",
+      "404  Email 5171    2   7    1    0    2   1   28    2   8  11   7     1  39   \n",
+      "405  Email 5172   22  24    5    1    6   5  148    8  23  13   5     4  99   \n",
+      "\n",
+      "     be  that  will  \n",
+      "1     4     2     0  \n",
+      "3     2     0     0  \n",
+      "5     2     0     0  \n",
+      "7     1     0     1  \n",
+      "13    1     0     1  \n",
+      "..   ..   ...   ...  \n",
+      "390   1     0     0  \n",
+      "393   1     0     0  \n",
+      "396   3     0     0  \n",
+      "404   1     0     0  \n",
+      "405   6     4     1  \n",
+      "\n",
+      "[169 rows x 17 columns]\n"
+     ]
+    }
+   ],
   "metadata": {}
  }
 ],