You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Data-Science-For-Beginners/translations/en/4-Data-Science-Lifecycle/15-analyzing/README.md

5.8 KiB

The Data Science Lifecycle: Analyzing

 Sketchnote by (@sketchthedocs)
Data Science Lifecycle: Analyzing - Sketchnote by @nitya

Pre-Lecture Quiz

Pre-Lecture Quiz

The "Analyzing" phase in the data lifecycle ensures that the data can address the questions posed or solve a specific problem. This step may also involve verifying that a model is effectively tackling these questions and issues. This lesson focuses on Exploratory Data Analysis (EDA), which includes techniques for identifying features and relationships within the data and preparing it for modeling.

We'll use an example dataset from Kaggle to demonstrate how this can be done using Python and the Pandas library. The dataset contains counts of common words found in emails, with the sources of these emails anonymized. Use the notebook in this directory to follow along.

Exploratory Data Analysis

The "Capture" phase of the lifecycle involves acquiring data as well as defining the problems and questions at hand. But how can we confirm that the data will support the desired outcomes? A data scientist might ask the following questions when acquiring data:

  • Do I have enough data to solve this problem?
  • Is the data of sufficient quality for this problem?
  • If new insights emerge from this data, should we consider revising or redefining the goals?

Exploratory Data Analysis is the process of familiarizing yourself with the data and can help answer these questions, as well as identify challenges associated with the dataset. Lets explore some techniques used to achieve this.

Data Profiling, Descriptive Statistics, and Pandas

How can we determine if we have enough data to solve the problem? Data profiling provides a summary and general overview of the dataset using descriptive statistics techniques. Data profiling helps us understand what is available, while descriptive statistics help us understand how much is available.

In previous lessons, we used Pandas to generate descriptive statistics with the describe() function. This function provides the count, maximum and minimum values, mean, standard deviation, and quantiles for numerical data. Using descriptive statistics like the describe() function can help you evaluate whether you have sufficient data or need more.

Sampling and Querying

Exploring every detail in a large dataset can be time-consuming and is often left to computers. However, sampling is a useful technique for gaining a better understanding of the data and what it represents. With a sample, you can apply probability and statistics to draw general conclusions about the dataset. While theres no strict rule for how much data to sample, its important to note that larger samples lead to more accurate generalizations about the data.

Pandas includes the sample() function, which allows you to specify the number of random samples you want to retrieve and use.

General querying of the data can help answer specific questions or test theories you may have. Unlike sampling, queries allow you to focus on particular parts of the dataset that are relevant to your questions. The query() function in the Pandas library lets you select columns and retrieve rows to answer specific questions about the data.

Exploring with Visualizations

You dont need to wait until the data is fully cleaned and analyzed to start creating visualizations. In fact, visual representations during exploration can help identify patterns, relationships, and issues within the data. Additionally, visualizations provide a way to communicate findings to stakeholders who may not be directly involved in data management. This can also be an opportunity to address additional questions that werent considered during the "Capture" phase. Refer to the section on Visualizations to learn more about popular methods for visual exploration.

Exploring to Identify Inconsistencies

The techniques covered in this lesson can help identify missing or inconsistent values, but Pandas also provides specific functions for this purpose. isna() or isnull() can be used to check for missing values. An important aspect of exploring these values is understanding why they are missing in the first place. This insight can guide you in deciding what actions to take to address them.

Post-Lecture Quiz

Assignment

Exploring for answers


Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.