You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Data-Science-For-Beginners/translations/en/4-Data-Science-Lifecycle/15-analyzing/README.md

5.7 KiB

The Data Science Lifecycle: Analyzing

 Sketchnote by (@sketchthedocs)
Data Science Lifecycle: Analyzing - Sketchnote by @nitya

Pre-Lecture Quiz

The "Analyzing" phase in the data lifecycle ensures that the data can answer the proposed questions or solve a specific problem. This step may also involve verifying that a model is effectively addressing these questions and problems. This lesson focuses on Exploratory Data Analysis (EDA), which includes techniques for identifying features and relationships within the data and preparing it for modeling.

Well use an example dataset from Kaggle to demonstrate how this can be done using Python and the Pandas library. This dataset contains counts of common words found in emails, with the sources of these emails anonymized. Use the notebook in this directory to follow along.

Exploratory Data Analysis

The "Capture" phase of the lifecycle involves acquiring data and defining the problems and questions at hand. But how can we confirm that the data will support the desired outcomes?
Remember, a data scientist might ask the following questions when acquiring data:

  • Do I have enough data to solve this problem?
  • Is the data of sufficient quality for this problem?
  • If I uncover new insights from this data, should we consider revising or redefining the goals?

Exploratory Data Analysis is the process of familiarizing yourself with the data. It can help answer these questions and identify challenges associated with the dataset. Lets explore some techniques used to achieve this.

Data Profiling, Descriptive Statistics, and Pandas

How can we determine if we have enough data to solve the problem? Data profiling summarizes and gathers general information about the dataset using descriptive statistics. Data profiling helps us understand what is available, while descriptive statistics help us understand how much is available.

In previous lessons, we used Pandas to generate descriptive statistics with the describe() function. This function provides the count, maximum and minimum values, mean, standard deviation, and quantiles for numerical data. Using descriptive statistics like describe() can help you evaluate whether you have enough data or need more.

Sampling and Querying

Exploring an entire large dataset can be time-consuming and is often left to computers. However, sampling is a useful tool for understanding the data. It allows you to gain insights into what the dataset contains and represents. With a sample, you can apply probability and statistics to draw general conclusions about the data. While theres no strict rule on how much data to sample, its important to note that the more data you sample, the more accurate your generalizations will be.

Pandas includes the sample() function, which allows you to specify the number of random samples you want to retrieve and use.

General querying of the data can help answer specific questions or test theories. Unlike sampling, queries allow you to focus on particular parts of the data that youre curious about. The query() function in the Pandas library lets you select columns and retrieve rows to answer specific questions about the data.

Exploring with Visualizations

You dont need to wait until the data is fully cleaned and analyzed to start creating visualizations. In fact, visualizations during the exploration phase can help identify patterns, relationships, and issues in the data. Additionally, visualizations are a great way to communicate findings to stakeholders who may not be directly involved in managing the data. They can also help surface new questions that werent addressed during the "Capture" phase. Refer to the section on Visualizations to learn more about popular methods for visual exploration.

Exploring to Identify Inconsistencies

The techniques covered in this lesson can help identify missing or inconsistent values, but Pandas also provides specific functions for this purpose. The isna() or isnull() functions can check for missing values. A key part of exploring these values is understanding why they are missing in the first place. This can guide you in deciding what actions to take to address them.

Post-Lecture Quiz

Assignment

Exploring for answers


Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.