Jasmine
b9033ac80b
|
3 years ago | |
---|---|---|
.. | ||
solution | 3 years ago | |
translations | 3 years ago | |
README.md | 3 years ago | |
assignment.ipynb | 3 years ago | |
assignment.md | 3 years ago | |
index.html | 3 years ago | |
notebook.ipynb | 3 years ago |
README.md
Working with Data: Data Preparation
Data Preparation - Sketchnote by @nitya |
Pre-Lecture Quiz
Depending on its source, raw data may contain some inconsistencies that will cause challenges in analysis and modeling. In other words, this data can be categorized as “dirty” and will need to be cleaned up. This lesson focuses on techniques for cleaning and transforming the data to handle challenges of missing, inaccurate, or incomplete data. Topics covered in this lesson will utilize Python and the Pandas library and will be demonstrated in the notebook within this directory.
The importance of cleaning data
-
Ease of use and reuse: When data is properly organized and normalized it’s easier to search, use, and share with others.
-
Consistency: Data science often requires working with more than one dataset, where datasets from different sources need to be joined together. Making sure that each individual data set has common standardization will ensure that the data is still useful when they are all merged into one dataset.
-
Model accuracy: Data that has been cleaned improves the accuracy of models that rely on it.
Common cleaning goals and strategies
-
Exploring a dataset: Data exploration, which is covered in a later lesson can help you discover data that needs to be cleaned up. Visually observing values within a dataset can set expectations of what that rest of it will look like, or provide an idea of the problems that can be resolved. Exploration can involve basic querying, visualizations, and sampling.
-
Formatting: Depending on the source, data can have inconsistencies in how it’s presented. This can cause problems in searching for and representing the value, where it’s seen within the dataset but is not properly represented in visualizations or query results. Common formatting problems involve resolving whitespace, dates, and data types. Resolving formatting issues is typically up to the people who are using the data. For example, standards on how dates and numbers are presented can differ by country.
-
Duplications: Data that has more than one occurrence can produce inaccurate results and usually should be removed. This can be a common occurrence when joining more two or more datasets together. However, there are instances where duplication in joined datasets contain pieces that can provide additional information and may need to be preserved.
-
Missing Data: Missing data can cause inaccuracies as well as weak or biased results. Sometimes these can be resolved by a "reload" of the data, filling in the missing values with computation and code like Python, or simply just removing the value and corresponding data. There are numerous reasons for why data may be missing and the actions that are taken to resolve these missing values can be dependent on how and why they went missing in the first place.
🚀 Challenge
Give the exercises in the notebook a try!
Post-Lecture Quiz
Review & Self Study
There are many ways to discover and approach preparing your data for analysis and modeling and cleaning the data is an important step that is a "hands on" experience. Try these challenges from Kaggle to explore techniques that this lesson didn't cover.