beginning data preparation lesson

pull/117/head
Jasmine 3 years ago
parent 0063017de7
commit ea4b9c40cc

@ -8,6 +8,15 @@
[Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/14) [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/14)
Depending on its source, raw data may contain some inconsistencies that will cause challenges in analysis and modeling. In other words, this data can be categorized as “dirty” and will need to be cleaned up. This lesson focuses on techniques for cleaning and transforming the data to handle challenges of missing, inaccurate, or incomplete data. Topics covered in this lesson will utilize Python and the Pandas library and will be demonstrated in the notebook within this directory.
## The importance of cleaning data
- Ease of use and reuse: When data is properly organized and normalized its easier to search, use, and share with others.
- Consistency: Data science often requires working with more than one dataset, where datasets from different sources need to be joined together. Making sure that each individual data set has common standardization will ensure that the data is still useful when they are all merged into one dataset.
- Model accuracy: Data that has been cleaned improves the accuracy of models that rely on it.
## 🚀 Challenge ## 🚀 Challenge

@ -10,7 +10,7 @@
Analyzing in the data lifecycle confirms that the data can answer the questions that are proposed or solving a particular problem. This step can also focus on confirming a model is correctly addressing these questions and problems. This lesson is focused on Exploratory Data Analysis or EDA, which are techniques for defining features and relationships within the data and can be used to prepare the data for modeling. Analyzing in the data lifecycle confirms that the data can answer the questions that are proposed or solving a particular problem. This step can also focus on confirming a model is correctly addressing these questions and problems. This lesson is focused on Exploratory Data Analysis or EDA, which are techniques for defining features and relationships within the data and can be used to prepare the data for modeling.
We'll be using an example dataset from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv/version/1) to show how this can be applied with Python's Pandas library. This dataset contains a count of some common words found in emails, the sources of these emails are anonymous. Use the notebook in this directory to follow along. We'll be using an example dataset from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv/version/1) to show how this can be applied with Python's Pandas library. This dataset contains a count of some common words found in emails, the sources of these emails are anonymous. Use the [notebook](notebook.ipynb) in this directory to follow along.
## Exploratory Data Analysis ## Exploratory Data Analysis
@ -30,26 +30,18 @@ In the few of the lessons, we have used Pandas to provide some descriptive stati
Exploring everything in a large dataset can be very time consuming and a task thats usually left up to a computer to do. However, sampling is a helpful tool in understanding of the data and allows us to have a better understanding of whats in the dataset and what it represents. With a sample, you can apply probability and statistics to come to some general conclusions about your data. While theres no defined rule on how much data you should sample its important to note that the more data you sample, the more precise of a generalization you can make of about data. Exploring everything in a large dataset can be very time consuming and a task thats usually left up to a computer to do. However, sampling is a helpful tool in understanding of the data and allows us to have a better understanding of whats in the dataset and what it represents. With a sample, you can apply probability and statistics to come to some general conclusions about your data. While theres no defined rule on how much data you should sample its important to note that the more data you sample, the more precise of a generalization you can make of about data.
Pandas has the [`sample()` function in its library]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples youd like to receive and use. Pandas has the [`sample()` function in its library]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples youd like to receive and use.
[example]
General querying of the data can help you answer some general questions and theories you may have. In contrast to sampling, queries allow you to have control and focus on specific parts of the data you have questions about. General querying of the data can help you answer some general questions and theories you may have. In contrast to sampling, queries allow you to have control and focus on specific parts of the data you have questions about.
The [`query() `function]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select the columns and receive simple answers about the data in the form of `True` or `False`. The [`query() `function]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select the columns and receive simple answers about the data in the form of `True` or `False`.
[example]
## Exploring with Visualizations ## Exploring with Visualizations
You dont have to wait until the data is thoroughly cleaned and analyzed to start creating visualizations. In fact, having a visual representation while exploring can help identify patterns, relationships, and problems in the data. Furthermore, visualizations provide a means of communication with those who are not involved with managing the data and can be an opportunity to share and clarify additional questions that were not addressed in the capture stage. Refer to the section on Visualization to learn more about some popular ways to explore visually. You dont have to wait until the data is thoroughly cleaned and analyzed to start creating visualizations. In fact, having a visual representation while exploring can help identify patterns, relationships, and problems in the data. Furthermore, visualizations provide a means of communication with those who are not involved with managing the data and can be an opportunity to share and clarify additional questions that were not addressed in the capture stage. Refer to the [section on Visualizations](3-Data-Visualization) to learn more about some popular ways to explore visually.
[example]
## Exploring to identify inconsistencies ## Exploring to identify inconsistencies
All the topics in this lesson can help identify missing or inconsistent values, but Pandas provides functions to check for some of these. [isna() or isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) can check for missing values. All the topics in this lesson can help identify missing or inconsistent values, but Pandas provides functions to check for some of these. [isna() or isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) can check for missing values. One important piece of exploring for these values within your data is to explore why they ended up that way in the first place. This can help you decide on what [actions to take to resolve them]().
One important piece of exploring for these values within your data is to explore why they ended up that way in the first place. This can help you decide on what [actions to take to resolve them]().
## 🚀 Challenge ## 🚀 Challenge
## Post-Lecture Quiz ## Post-Lecture Quiz
[Post-lecture quiz]() [Post-lecture quiz]()

Loading…
Cancel
Save