🌐 Update translations via Co-op Translator

pull/653/head
leestott 9 hours ago committed by GitHub
parent 53b64e066d
commit 7d2708a767

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "356d12cffc3125db133a2d27b827a745",
"translation_date": "2025-08-31T11:10:03+00:00",
"original_hash": "1228edf3572afca7d7cdcd938b6b4984",
"translation_date": "2025-09-05T07:46:25+00:00",
"source_file": "1-Introduction/03-defining-data/README.md",
"language_code": "en"
}
@ -22,18 +22,18 @@ This lesson focuses on identifying and classifying data based on its characteris
## How Data is Described
### Raw Data
Raw data refers to data in its original state, directly from its source, without any analysis or organization. To make sense of a dataset, it needs to be organized into a format that can be understood by humans and the technology used for further analysis. The structure of a dataset describes how it is organized and can be classified as structured, unstructured, or semi-structured. These classifications depend on the source but ultimately fall into one of these three categories.
Raw data refers to data in its original state, directly from its source, without any analysis or organization. To make sense of a dataset, it needs to be organized into a format that can be understood by humans and the technology used for further analysis. The structure of a dataset describes how it is organized and can be classified as structured, unstructured, or semi-structured. These structures vary depending on the source but generally fall into one of these three categories.
### Quantitative Data
Quantitative data consists of numerical observations within a dataset that can typically be analyzed, measured, and used mathematically. Examples of quantitative data include a country's population, a person's height, or a company's quarterly earnings. With further analysis, quantitative data can be used to identify seasonal trends in the Air Quality Index (AQI) or estimate the likelihood of rush hour traffic on a typical workday.
### Qualitative Data
Qualitative data, also known as categorical data, cannot be measured objectively like quantitative data. It often consists of subjective information that captures the quality of something, such as a product or process. Sometimes, qualitative data is numerical but not typically used mathematically, like phone numbers or timestamps. Examples of qualitative data include video comments, the make and model of a car, or your closest friends' favorite color. Qualitative data can be used to understand which products consumers prefer or to identify popular keywords in job application resumes.
Qualitative data, also known as categorical data, cannot be measured objectively like quantitative data. It often consists of subjective information that captures the quality of something, such as a product or process. Sometimes, qualitative data is numerical but not typically used mathematically, such as phone numbers or timestamps. Examples of qualitative data include video comments, the make and model of a car, or your closest friends' favorite color. Qualitative data can be used to understand which products consumers prefer or identify popular keywords in job application resumes.
### Structured Data
Structured data is organized into rows and columns, where each row has the same set of columns. Columns represent specific types of values and are identified by names describing what the values represent, while rows contain the actual data. Columns often have rules or restrictions to ensure the values accurately represent the column. For example, imagine a spreadsheet of customers where each row must include a phone number, and the phone numbers cannot contain alphabetical characters. Rules might be applied to ensure the phone number column is never empty and only contains numbers.
Structured data is organized into rows and columns, where each row has the same set of columns. Columns represent specific types of values and are identified by names describing what the values represent, while rows contain the actual data. Columns often have rules or restrictions to ensure the values accurately represent the column. For example, imagine a spreadsheet of customers where each row must include a phone number, and phone numbers cannot contain alphabetical characters. Rules might be applied to ensure the phone number column is never empty and only contains numbers.
One advantage of structured data is that it can be organized in a way that allows it to relate to other structured data. However, because the data is designed to follow a specific structure, making changes to its overall organization can require significant effort. For instance, adding an email column to the customer spreadsheet that cannot be empty would require figuring out how to populate this column for existing rows.
One advantage of structured data is that it can be organized in a way that relates to other structured data. However, because structured data is designed to follow a specific organization, making changes to its structure can require significant effort. For instance, adding an email column to the customer spreadsheet that cannot be empty would require figuring out how to populate this column for existing rows.
Examples of structured data: spreadsheets, relational databases, phone numbers, bank statements.
@ -43,15 +43,15 @@ Unstructured data cannot typically be organized into rows or columns and lacks a
Examples of unstructured data: text files, text messages, video files.
### Semi-structured Data
Semi-structured data combines features of both structured and unstructured data. It doesn't typically conform to rows and columns but is organized in a way that is considered structured and may follow a fixed format or set of rules. The structure can vary between sources, ranging from a well-defined hierarchy to something more flexible that allows for easy integration of new information. Metadata provides indicators for how the data is organized and stored, with various names depending on the type of data. Common names for metadata include tags, elements, entities, and attributes. For example, a typical email message includes a subject, body, and recipients, and can be organized by sender or date.
Semi-structured data combines features of both structured and unstructured data. It doesn't typically conform to rows and columns but is organized in a way that is considered structured and may follow a fixed format or set of rules. The structure can vary between sources, ranging from a well-defined hierarchy to something more flexible that allows easy integration of new information. Metadata helps determine how the data is organized and stored, with various names depending on the type of data. Common names for metadata include tags, elements, entities, and attributes. For example, a typical email message includes a subject, body, and recipients, and can be organized by sender or date.
Examples of semi-structured data: HTML, CSV files, JavaScript Object Notation (JSON).
## Sources of Data
A data source refers to the original location where the data was generated or "lives," and it varies based on how and when it was collected. Data generated by its user(s) is known as primary data, while secondary data comes from a source that has collected data for general use. For example, scientists collecting observations in a rainforest would be considered primary data, and if they share it with other scientists, it becomes secondary data for those users.
A data source refers to the original location where the data was generated or resides, and it varies based on how and when it was collected. Data generated by its user(s) is known as primary data, while secondary data comes from a source that has collected data for general use. For example, scientists collecting observations in a rainforest would be considered primary data, and if they share it with others, it becomes secondary data for those users.
Databases are a common data source and rely on a database management system to host and maintain the data. Users explore the data using commands called queries. Files can also serve as data sources, including audio, image, and video files, as well as spreadsheets like Excel. The internet is another common location for hosting data, where both databases and files can be found. Application programming interfaces (APIs) allow programmers to create ways to share data with external users over the internet, while web scraping extracts data from web pages. The [lessons in Working with Data](../../../../../../../../../2-Working-With-Data) focus on how to use various data sources.
Databases are a common data source and rely on a database management system to host and maintain the data. Users explore the data using commands called queries. Files can also serve as data sources, including audio, image, and video files, as well as spreadsheets like Excel. The internet is another common location for hosting data, where both databases and files can be found. Application programming interfaces (APIs) allow programmers to share data with external users over the internet, while web scraping extracts data from web pages. The [lessons in Working with Data](../../../../../../../../../2-Working-With-Data) focus on how to use various data sources.
## Conclusion
@ -69,7 +69,7 @@ Kaggle is an excellent source of open datasets. Use the [dataset search tool](ht
- Is the data quantitative or qualitative?
- Is the data structured, unstructured, or semi-structured?
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/5)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "b706a07cfa87ba091cbb91e0aa775600",
"translation_date": "2025-08-31T11:08:08+00:00",
"original_hash": "8bbb3fa0d4ad61384a3b4b5f7560226f",
"translation_date": "2025-09-05T07:45:26+00:00",
"source_file": "1-Introduction/04-stats-and-probability/README.md",
"language_code": "en"
}
@ -13,7 +13,7 @@ CO_OP_TRANSLATOR_METADATA:
|:---:|
| Statistics and Probability - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Statistics and Probability Theory are two closely related branches of Mathematics that are highly relevant to Data Science. While it is possible to work with data without a deep understanding of mathematics, it is still beneficial to grasp some basic concepts. Here, we provide a brief introduction to help you get started.
Statistics and Probability Theory are two closely related branches of Mathematics that are highly relevant to Data Science. While you can work with data without a deep understanding of mathematics, it's still beneficial to grasp some fundamental concepts. This introduction will help you get started.
[![Intro Video](../../../../1-Introduction/04-stats-and-probability/images/video-prob-and-stats.png)](https://youtu.be/Z5Zy85g4Yjw)
@ -21,43 +21,43 @@ Statistics and Probability Theory are two closely related branches of Mathematic
## Probability and Random Variables
**Probability** is a number between 0 and 1 that indicates how likely an **event** is to occur. It is calculated as the number of favorable outcomes (leading to the event) divided by the total number of possible outcomes, assuming all outcomes are equally likely. For example, when rolling a die, the probability of getting an even number is 3/6 = 0.5.
**Probability** is a value between 0 and 1 that represents how likely an **event** is to occur. It is calculated as the number of favorable outcomes (leading to the event) divided by the total number of possible outcomes, assuming all outcomes are equally likely. For instance, when rolling a die, the probability of getting an even number is 3/6 = 0.5.
When discussing events, we use **random variables**. For instance, the random variable representing the number rolled on a die can take values from 1 to 6. This set of numbers (1 to 6) is called the **sample space**. We can calculate the probability of a random variable taking a specific value, such as P(X=3)=1/6.
When discussing events, we use **random variables**. For example, the random variable representing the number rolled on a die can take values from 1 to 6. This set of numbers (1 to 6) is called the **sample space**. We can calculate the probability of a random variable taking a specific value, such as P(X=3)=1/6.
The random variable in the example above is called **discrete** because its sample space is countable, meaning it consists of distinct values that can be listed. In other cases, the sample space might be a range of real numbers or the entire set of real numbers. Such variables are called **continuous**. A good example is the time a bus arrives.
The random variable in this example is called **discrete** because its sample space consists of countable values that can be listed. In other cases, the sample space might be a range of real numbers or the entire set of real numbers. Such variables are called **continuous**. A good example is the time a bus arrives.
## Probability Distribution
For discrete random variables, it is straightforward to describe the probability of each event using a function P(X). For every value *s* in the sample space *S*, the function assigns a number between 0 and 1, such that the sum of all P(X=s) values for all events equals 1.
For discrete random variables, it's straightforward to describe the probability of each event using a function P(X). For every value *s* in the sample space *S*, the function assigns a number between 0 and 1, ensuring that the sum of all P(X=s) values equals 1.
The most well-known discrete distribution is the **uniform distribution**, where the sample space consists of N elements, each with an equal probability of 1/N.
The most common discrete distribution is the **uniform distribution**, where the sample space contains N elements, each with an equal probability of 1/N.
Describing the probability distribution of a continuous variable, which may take values from an interval [a, b] or the entire set of real numbers , is more complex. Consider the example of bus arrival times. The probability of the bus arriving at an exact time *t* is actually 0!
Describing the probability distribution of a continuous variable, such as values within an interval [a,b] or the entire set of real numbers , is more complex. Consider the bus arrival time example. The probability of the bus arriving at an exact time *t* is actually 0!
> Now you know that events with 0 probability can and do happen—every time the bus arrives, for instance!
> Now you know that events with 0 probability can still happen—and quite often! For example, every time the bus arrives!
Instead, we talk about the probability of a variable falling within a specific interval, e.g., P(t<sub>1</sub>≤X<t<sub>2</sub>). In this case, the probability distribution is described by a **probability density function** p(x), such that:
Instead, we discuss the probability of a variable falling within a range of values, such as P(t<sub>1</sub>≤X<t<sub>2</sub>). In this case, the probability distribution is described by a **probability density function** p(x), such that:
![P(t_1\le X<t_2)=\int_{t_1}^{t_2}p(x)dx](../../../../1-Introduction/04-stats-and-probability/images/probability-density.png)
The continuous counterpart of the uniform distribution is called the **continuous uniform distribution**, which is defined over a finite interval. The probability that the value X falls within an interval of length l is proportional to l and can reach up to 1.
A continuous version of the uniform distribution is called **continuous uniform distribution**, defined over a finite interval. The probability of X falling within an interval of length l is proportional to l and reaches up to 1.
Another important distribution is the **normal distribution**, which we will explore in more detail later.
Another key distribution is the **normal distribution**, which we'll explore in more detail later.
## Mean, Variance, and Standard Deviation
Suppose we draw a sequence of n samples from a random variable X: x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>n</sub>. The **mean** (or **arithmetic average**) of the sequence is calculated as (x<sub>1</sub>+x<sub>2</sub>+...+x<sub>n</sub>)/n. As the sample size increases (n→∞), the mean approaches the **expectation** of the distribution, denoted as **E**(x).
Suppose we take a sequence of n samples from a random variable X: x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>n</sub>. The **mean** (or **arithmetic average**) of the sequence is calculated as (x<sub>1</sub>+x<sub>2</sub>+...+x<sub>n</sub>)/n. As the sample size increases (n→∞), the mean approaches the **expectation** of the distribution, denoted as **E**(x).
> It can be shown that for any discrete distribution with values {x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>N</sub>} and corresponding probabilities p<sub>1</sub>, p<sub>2</sub>, ..., p<sub>N</sub>, the expectation is given by E(X)=x<sub>1</sub>p<sub>1</sub>+x<sub>2</sub>p<sub>2</sub>+...+x<sub>N</sub>p<sub>N</sub>.
> For any discrete distribution with values {x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>N</sub>} and corresponding probabilities p<sub>1</sub>, p<sub>2</sub>, ..., p<sub>N</sub>, the expectation is given by E(X)=x<sub>1</sub>p<sub>1</sub>+x<sub>2</sub>p<sub>2</sub>+...+x<sub>N</sub>p<sub>N</sub>.
To measure how spread out the values are, we calculate the variance σ<sup>2</sup> = ∑(x<sub>i</sub> - μ)<sup>2</sup>/n, where μ is the mean of the sequence. The square root of the variance, σ, is called the **standard deviation**.
To measure how spread out the values are, we calculate the variance σ<sup>2</sup> = ∑(x<sub>i</sub> - μ)<sup>2</sup>/n, where μ is the mean of the sequence. The square root of the variance, σ, is called the **standard deviation**, while σ<sup>2</sup> is the **variance**.
## Mode, Median, and Quartiles
Sometimes, the mean does not adequately represent the "typical" value of the data, especially when there are extreme outliers. In such cases, the **median**—the value that divides the data into two equal halves—can be a better indicator.
Sometimes, the mean doesn't accurately represent the "typical" value of the data, especially when there are extreme values that skew the average. A better indicator might be the **median**, the value that divides the data into two equal halves—one half below and the other above.
To further understand the data distribution, we use **quartiles**:
To further understand data distribution, we use **quartiles**:
* The first quartile (Q1) is the value below which 25% of the data falls.
* The third quartile (Q3) is the value below which 75% of the data falls.
@ -68,49 +68,48 @@ The relationship between the median and quartiles can be visualized using a **bo
We also calculate the **inter-quartile range** (IQR=Q3-Q1) and identify **outliers**—values outside the range [Q1-1.5*IQR, Q3+1.5*IQR].
For small, finite distributions, the **mode**—the most frequently occurring value—can be a good "typical" value. This is especially useful for categorical data, such as colors. For example, if two groups of people strongly prefer red and blue, the mean of their preferences (if coded numerically) might fall in the orange-green range, which doesn't represent either group's preference. The mode, however, would correctly identify the most popular colors.
For small, finite distributions, the most frequent value is a good "typical" value, called the **mode**. Mode is often used for categorical data, such as colors. For instance, if two groups of people strongly prefer red and blue, the mean color might fall somewhere in the orange-green spectrum, which doesn't reflect either group's preference. However, the mode would correctly identify the most popular colors. If two colors are equally popular, the sample is called **multimodal**.
## Real-world Data
When analyzing real-world data, the values are often not random variables in the strict sense, as they are not the result of experiments with unknown outcomes. For example, consider the heights, weights, and ages of a baseball team. These values are not truly random, but we can still apply the same mathematical concepts. For instance, the sequence of players' weights can be treated as samples from a random variable. Below is a sequence of weights from actual Major League Baseball players, taken from [this dataset](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights) (only the first 20 values are shown):
When analyzing real-world data, it often doesn't behave like random variables in the strict sense, as we aren't conducting experiments with unknown outcomes. For example, consider a baseball team and their physical attributes like height, weight, and age. These numbers aren't truly random, but we can still apply mathematical concepts. For instance, a sequence of players' weights can be treated as values drawn from a random variable. Below is a sequence of weights from actual baseball players in [Major League Baseball](http://mlb.mlb.com/index.jsp), sourced from [this dataset](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights) (only the first 20 values are shown):
```
[180.0, 215.0, 210.0, 210.0, 188.0, 176.0, 209.0, 200.0, 231.0, 180.0, 188.0, 180.0, 185.0, 160.0, 180.0, 185.0, 197.0, 189.0, 185.0, 219.0]
```
> **Note**: For an example of working with this dataset, check out the [accompanying notebook](../../../../1-Introduction/04-stats-and-probability/notebook.ipynb). There are also challenges throughout this lesson that you can complete by adding code to the notebook. If you're unsure how to work with data, don't worry—we'll revisit this topic later. If you don't know how to run code in Jupyter Notebook, see [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
> **Note**: To see an example of working with this dataset, check out the [accompanying notebook](../../../../1-Introduction/04-stats-and-probability/notebook.ipynb). There are also challenges throughout this lesson that you can complete by adding code to the notebook. If you're unsure how to work with data, don't worry—we'll revisit data manipulation using Python later. If you don't know how to run code in Jupyter Notebook, refer to [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
Here is a box plot showing the mean, median, and quartiles for the data:
![Weight Box Plot](../../../../1-Introduction/04-stats-and-probability/images/weight-boxplot.png)
Since the dataset includes player **roles**, we can create a box plot by role to see how the values differ across roles. This time, we'll consider height:
Since the dataset includes information about different player **roles**, we can create box plots by role to see how parameters vary across roles. This time, we'll consider height:
![Box plot by role](../../../../1-Introduction/04-stats-and-probability/images/boxplot_byrole.png)
This diagram suggests that, on average, first basemen are taller than second basemen. Later in this lesson, we'll learn how to formally test this hypothesis and determine whether the data is statistically significant.
This diagram suggests that, on average, first basemen are taller than second basemen. Later in this lesson, we'll learn how to formally test this hypothesis and demonstrate statistical significance.
> When working with real-world data, we assume that all data points are samples drawn from some probability distribution. This assumption allows us to apply machine learning techniques and build predictive models.
> When working with real-world data, we assume that all data points are samples drawn from a probability distribution. This assumption allows us to apply machine learning techniques and build predictive models.
To visualize the data distribution, we can create a **histogram**. The X-axis represents weight intervals (or **bins**), and the Y-axis shows the frequency of values within each interval.
To visualize the distribution of the data, we can create a **histogram**. The X-axis represents weight intervals (or **bins**), while the Y-axis shows the frequency of samples within each interval.
![Histogram of real-world data](../../../../1-Introduction/04-stats-and-probability/images/weight-histogram.png)
From this histogram, we see that most values cluster around a certain mean weight, with fewer values as we move further from the mean. This indicates that extreme weights are less likely. The variance shows how much the weights deviate from the mean.
From this histogram, you can see that most values cluster around a certain mean weight, with fewer values appearing as we move further from the mean. This indicates that extreme weights are less likely. The variance shows how much weights deviate from the mean.
> If we analyzed weights from a different population (e.g., university students), the distribution might differ in mean and variance, but the overall shape would remain similar. However, a model trained on baseball players might perform poorly on students due to differences in the underlying distribution.
> If we analyze weights of people outside the baseball league, the distribution might differ. However, the general shape of the distribution would remain similar, with changes in mean and variance. Training a model on baseball players might yield inaccurate results when applied to university students, as the underlying distribution differs.
## Normal Distribution
The weight distribution we observed is typical of many real-world measurements, which often follow a similar pattern but with different means and variances. This pattern is called the **normal distribution**, and it plays a crucial role in statistics.
To simulate random weights for potential baseball players, we can use the normal distribution. Given the mean weight `mean` and standard deviation `std`, we can generate 1000 weight samples as follows:
The weight distribution we observed above is very common, and many real-world measurements follow a similar pattern, albeit with different means and variances. This pattern is called the **normal distribution**, which plays a crucial role in statistics.
Using a normal distribution is a valid way to generate random weights for potential baseball players. Once we know the mean weight `mean` and standard deviation `std`, we can generate 1000 weight samples as follows:
```python
samples = np.random.normal(mean,std,1000)
```
```
If we plot a histogram of the generated samples, it will resemble the earlier histogram. By increasing the number of samples and bins, we can create a graph that closely approximates the ideal normal distribution:
If we plot a histogram of the generated samples, it will resemble the earlier example. Increasing the number of samples and bins will produce a graph closer to the ideal normal distribution:
![Normal Distribution with mean=0 and std.dev=1](../../../../1-Introduction/04-stats-and-probability/images/normal-histogram.png)
@ -118,33 +117,33 @@ If we plot a histogram of the generated samples, it will resemble the earlier hi
## Confidence Intervals
When analyzing baseball players' weights, we assume there is a **random variable W** representing the ideal probability distribution of all players' weights (the **population**). Our dataset represents a subset of players, or a **sample**. A key question is whether we can determine the population's distribution parameters, such as mean and variance.
When discussing baseball players' weights, we assume there is a **random variable W** representing the ideal probability distribution of weights for all players (the **population**). Our sequence of weights represents a subset of players, called a **sample**. A key question is whether we can determine the parameters of W's distribution, such as the population mean and variance.
The simplest approach is to calculate the sample's mean and variance. However, the sample may not perfectly represent the population. This is where **confidence intervals** come into play.
The simplest approach is to calculate the sample's mean and variance. However, the sample might not perfectly represent the population. This is where **confidence intervals** come into play.
> **Confidence interval** is the estimation of the true mean of the population based on our sample, with a certain probability (or **level of confidence**) of being accurate.
Suppose we have a sample X<sub>1</sub>, ..., X<sub>n</sub> from our distribution. Each time we draw a sample from our distribution, we would end up with a different mean value μ. Thus, μ can be considered a random variable. A **confidence interval** with confidence p is a pair of values (L<sub>p</sub>,R<sub>p</sub>), such that **P**(L<sub>p</sub>≤μ≤R<sub>p</sub>) = p, i.e., the probability of the measured mean value falling within the interval equals p.
Suppose we have a sample X<sub>1</sub>, ..., X<sub>n</sub> from our distribution. Each time we draw a sample from our distribution, we would end up with a different mean value μ. Thus, μ can be considered a random variable. A **confidence interval** with confidence p is a pair of values (L<sub>p</sub>, R<sub>p</sub>), such that **P**(L<sub>p</sub> μ R<sub>p</sub>) = p, meaning the probability of the measured mean value falling within the interval equals p.
It goes beyond our brief introduction to discuss in detail how these confidence intervals are calculated. More details can be found [on Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval). In short, we define the distribution of the computed sample mean relative to the true mean of the population, which is called the **Student's t-distribution**.
It goes beyond the scope of this short introduction to discuss in detail how these confidence intervals are calculated. More details can be found [on Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval). In short, we define the distribution of the computed sample mean relative to the true mean of the population, which is called the **Student distribution**.
> **Interesting fact**: The Student's t-distribution is named after mathematician William Sealy Gosset, who published his paper under the pseudonym "Student." He worked at the Guinness brewery, and, according to one version, his employer did not want the general public to know they were using statistical tests to determine the quality of raw materials.
> **Interesting fact**: The Student distribution is named after mathematician William Sealy Gosset, who published his paper under the pseudonym "Student." He worked at the Guinness brewery, and, according to one version, his employer did not want the general public to know they were using statistical tests to determine the quality of raw materials.
If we want to estimate the mean μ of our population with confidence p, we need to take the *(1-p)/2-th percentile* of a Student's t-distribution A, which can either be taken from tables or computed using built-in functions in statistical software (e.g., Python, R, etc.). Then the interval for μ would be given by X±A*D/√n, where X is the obtained mean of the sample, and D is the standard deviation.
If we want to estimate the mean μ of our population with confidence p, we need to take the *(1-p)/2-th percentile* of a Student distribution A, which can either be taken from tables or computed using built-in functions in statistical software (e.g., Python, R, etc.). Then the interval for μ would be given by X ± A * D / √n, where X is the obtained mean of the sample, and D is the standard deviation.
> **Note**: We are also omitting the discussion of an important concept called [degrees of freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)), which is relevant to the Student's t-distribution. You can refer to more comprehensive books on statistics to understand this concept in depth.
> **Note**: We also omit the discussion of an important concept called [degrees of freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)), which is relevant to the Student distribution. You can refer to more comprehensive books on statistics to understand this concept in depth.
An example of calculating confidence intervals for weights and heights is provided in the [accompanying notebooks](../../../../1-Introduction/04-stats-and-probability/notebook.ipynb).
| p | Weight mean |
|------|---------------|
| 0.85 | 201.73±0.94 |
| 0.90 | 201.73±1.08 |
| 0.95 | 201.73±1.28 |
| 0.85 | 201.73 ± 0.94 |
| 0.90 | 201.73 ± 1.08 |
| 0.95 | 201.73 ± 1.28 |
Notice that the higher the confidence probability, the wider the confidence interval.
## Hypothesis Testing
In our baseball players dataset, there are different player roles, which can be summarized below (refer to the [accompanying notebook](../../../../1-Introduction/04-stats-and-probability/notebook.ipynb) to see how this table was calculated):
In our baseball players dataset, there are different player roles, summarized below (refer to the [accompanying notebook](../../../../1-Introduction/04-stats-and-probability/notebook.ipynb) to see how this table is calculated):
| Role | Height | Weight | Count |
|-------------------|------------|------------|-------|
@ -158,29 +157,29 @@ In our baseball players dataset, there are different player roles, which can be
| Starting_Pitcher | 74.719457 | 205.163636 | 221 |
| Third_Baseman | 73.044444 | 200.955556 | 45 |
We can observe that the mean height of first basemen is greater than that of second basemen. Thus, we might be tempted to conclude that **first basemen are taller than second basemen**.
We can observe that the mean height of first basemen is higher than that of second basemen. Thus, we might be tempted to conclude that **first basemen are taller than second basemen**.
> This statement is called **a hypothesis**, because we do not know whether the fact is actually true or not.
However, it is not always obvious whether we can make this conclusion. From the discussion above, we know that each mean has an associated confidence interval, and thus this difference could just be a statistical error. We need a more formal way to test our hypothesis.
However, it is not always clear whether we can make this conclusion. From the discussion above, we know that each mean has an associated confidence interval, and this difference could simply be a statistical error. We need a more formal way to test our hypothesis.
Let's compute confidence intervals separately for the heights of first and second basemen:
| Confidence | First Basemen | Second Basemen |
|------------|-----------------|-----------------|
| 0.85 | 73.62..74.38 | 71.04..71.69 |
| 0.90 | 73.56..74.44 | 70.99..71.73 |
| 0.95 | 73.47..74.53 | 70.92..71.81 |
| 0.85 | 73.62..74.38 | 71.04..71.69 |
| 0.90 | 73.56..74.44 | 70.99..71.73 |
| 0.95 | 73.47..74.53 | 70.92..71.81 |
We can see that under no confidence level do the intervals overlap. This proves our hypothesis that first basemen are taller than second basemen.
We can see that under no confidence level do the intervals overlap. This supports our hypothesis that first basemen are taller than second basemen.
More formally, the problem we are solving is to determine if **two probability distributions are the same**, or at least have the same parameters. Depending on the distribution, we need to use different tests for this. If we know that our distributions are normal, we can apply the **[Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)**.
More formally, the problem we are solving is to determine whether **two probability distributions are the same**, or at least have the same parameters. Depending on the distribution, different tests are required. If we know our distributions are normal, we can apply the **[Student t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)**.
In the Student's t-test, we compute the so-called **t-value**, which indicates the difference between means, taking into account the variance. It has been shown that the t-value follows the **Student's t-distribution**, which allows us to get the threshold value for a given confidence level **p** (this can be computed or looked up in numerical tables). We then compare the t-value to this threshold to accept or reject the hypothesis.
In the Student t-test, we compute the so-called **t-value**, which indicates the difference between means while accounting for variance. It has been demonstrated that the t-value follows the **Student distribution**, which allows us to find the threshold value for a given confidence level **p** (this can be computed or looked up in numerical tables). We then compare the t-value to this threshold to accept or reject the hypothesis.
In Python, we can use the **SciPy** package, which includes the `ttest_ind` function (along with many other useful statistical functions!). This function computes the t-value for us and also performs the reverse lookup of the confidence p-value, so we can simply look at the confidence to draw a conclusion.
In Python, we can use the **SciPy** package, which includes the `ttest_ind` function (along with many other useful statistical functions!). This function computes the t-value for us and also performs the reverse lookup of the confidence p-value, allowing us to simply examine the confidence level to draw conclusions.
For example, our comparison between the heights of first and second basemen gives us the following results:
For example, our comparison between the heights of first and second basemen yields the following results:
```python
from scipy.stats import ttest_ind
@ -191,44 +190,44 @@ print(f"T-value = {tval[0]:.2f}\nP-value: {pval[0]}")
T-value = 7.65
P-value: 9.137321189738925e-12
```
In our case, the p-value is very low, meaning there is strong evidence supporting that first basemen are taller.
In this case, the p-value is very low, indicating strong evidence that first basemen are taller.
There are also other types of hypotheses we might want to test, for example:
* To prove that a given sample follows a specific distribution. In our case, we assumed that heights are normally distributed, but this needs formal statistical verification.
* To prove that the mean value of a sample corresponds to some predefined value.
* To compare the means of multiple samples (e.g., differences in happiness levels among different age groups).
There are also other types of hypotheses we might want to test, such as:
* Proving that a given sample follows a specific distribution. In our case, we assumed that heights are normally distributed, but this requires formal statistical verification.
* Proving that the mean value of a sample corresponds to a predefined value.
* Comparing the means of multiple samples (e.g., differences in happiness levels across different age groups).
## Law of Large Numbers and Central Limit Theorem
One of the reasons why the normal distribution is so important is the **central limit theorem**. Suppose we have a large sample of independent N values X<sub>1</sub>, ..., X<sub>N</sub>, sampled from any distribution with mean μ and variance σ<sup>2</sup>. Then, for sufficiently large N (in other words, as N→∞), the mean Σ<sub>i</sub>X<sub>i</sub> will be normally distributed, with mean μ and variance σ<sup>2</sup>/N.
One reason why the normal distribution is so important is the **central limit theorem**. Suppose we have a large sample of independent N values X<sub>1</sub>, ..., X<sub>N</sub>, drawn from any distribution with mean μ and variance σ<sup>2</sup>. Then, for sufficiently large N (in other words, as N→∞), the mean Σ<sub>i</sub>X<sub>i</sub> will be normally distributed, with mean μ and variance σ<sup>2</sup>/N.
> Another way to interpret the central limit theorem is to say that regardless of the original distribution, when you compute the mean of a sum of random variable values, you end up with a normal distribution.
From the central limit theorem, it also follows that as N→∞, the probability of the sample mean being equal to μ becomes 1. This is known as **the law of large numbers**.
From the central limit theorem, it also follows that as N→∞, the probability of the sample mean equaling μ approaches 1. This is known as **the law of large numbers**.
## Covariance and Correlation
One of the tasks in Data Science is finding relationships between data. We say that two sequences **correlate** when they exhibit similar behavior at the same time, i.e., they either rise/fall simultaneously, or one sequence rises when the other falls and vice versa. In other words, there seems to be some relationship between the two sequences.
One of the tasks in Data Science is identifying relationships between data. We say two sequences **correlate** when they exhibit similar behavior at the same time, i.e., they either rise/fall simultaneously, or one rises while the other falls, and vice versa. In other words, there appears to be some relationship between the two sequences.
> Correlation does not necessarily indicate a causal relationship between two sequences; sometimes both variables can depend on an external cause, or it can be purely by chance that the two sequences correlate. However, strong mathematical correlation is a good indication that two variables are somehow connected.
> Correlation does not necessarily imply a causal relationship between two sequences; sometimes both variables depend on an external cause, or the correlation may occur purely by chance. However, strong mathematical correlation is a good indication that two variables are somehow connected.
Mathematically, the main concept that shows the relationship between two random variables is **covariance**, which is computed as: Cov(X,Y) = **E**\[(X-**E**(X))(Y-**E**(Y))\]. We compute the deviation of both variables from their mean values, and then take the product of those deviations. If both variables deviate together, the product will always be positive, resulting in positive covariance. If both variables deviate out-of-sync (i.e., one falls below average when the other rises above average), we will always get negative numbers, resulting in negative covariance. If the deviations are independent, they will sum to roughly zero.
Mathematically, the main concept that shows the relationship between two random variables is **covariance**, computed as: Cov(X, Y) = **E**\[(X - **E**(X))(Y - **E**(Y))\]. We calculate the deviation of both variables from their mean values and then take the product of those deviations. If both variables deviate together, the product will always be positive, resulting in positive covariance. If the variables deviate out of sync (i.e., one falls below average while the other rises above average), the product will always be negative, resulting in negative covariance. If the deviations are independent, they will sum to roughly zero.
The absolute value of covariance does not tell us much about the strength of the correlation, as it depends on the magnitude of the actual values. To normalize it, we can divide covariance by the standard deviation of both variables to get **correlation**. The advantage of correlation is that it is always in the range [-1,1], where 1 indicates strong positive correlation, -1 indicates strong negative correlation, and 0 indicates no correlation at all (variables are independent).
The absolute value of covariance does not provide much insight into the strength of the correlation, as it depends on the magnitude of the values. To normalize it, we divide covariance by the standard deviation of both variables to obtain **correlation**. The advantage of correlation is that it always falls within the range [-1, 1], where 1 indicates strong positive correlation, -1 indicates strong negative correlation, and 0 indicates no correlation (variables are independent).
**Example**: We can compute the correlation between the weights and heights of baseball players from the dataset mentioned above:
**Example**: We can compute the correlation between the weights and heights of baseball players from the dataset mentioned earlier:
```python
print(np.corrcoef(weights,heights))
```
As a result, we get a **correlation matrix** like this one:
This results in a **correlation matrix** like the following:
```
array([[1. , 0.52959196],
[0.52959196, 1. ]])
```
> A correlation matrix C can be computed for any number of input sequences S<sub>1</sub>, ..., S<sub>n</sub>. The value of C<sub>ij</sub> is the correlation between S<sub>i</sub> and S<sub>j</sub>, and diagonal elements are always 1 (which represents the self-correlation of S<sub>i</sub>).
> A correlation matrix C can be computed for any number of input sequences S<sub>1</sub>, ..., S<sub>n</sub>. The value of C<sub>ij</sub> represents the correlation between S<sub>i</sub> and S<sub>j</sub>, and the diagonal elements are always 1 (self-correlation of S<sub>i</sub>).
In our case, the value 0.53 indicates that there is some correlation between a person's weight and height. We can also create a scatter plot of one value against the other to visualize the relationship:
In our case, the value 0.53 indicates some correlation between a person's weight and height. We can also create a scatter plot of one variable against the other to visualize the relationship:
![Relationship between weight and height](../../../../1-Introduction/04-stats-and-probability/images/weight-height-relationship.png)
@ -241,23 +240,23 @@ In this section, we have learned:
* Basic statistical properties of data, such as mean, variance, mode, and quartiles.
* Different distributions of random variables, including the normal distribution.
* How to find correlations between different properties.
* How to use mathematical and statistical tools to prove hypotheses.
* How to compute confidence intervals for random variables given a data sample.
* How to use mathematical and statistical tools to test hypotheses.
* How to compute confidence intervals for random variables based on data samples.
While this is not an exhaustive list of topics in probability and statistics, it should provide a solid foundation for this course.
While this is not an exhaustive list of topics within probability and statistics, it should provide a solid foundation for this course.
## 🚀 Challenge
Use the sample code in the notebook to test the following hypotheses:
Use the sample code in the notebook to test other hypotheses:
1. First basemen are older than second basemen.
2. First basemen are taller than third basemen.
3. Shortstops are taller than second basemen.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/7)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
Probability and statistics is such a broad topic that it deserves its own course. If you are interested in diving deeper into the theory, you may want to explore the following resources:
Probability and statistics is a broad topic that deserves its own course. If you want to explore the theory further, consider reading the following books:
1. [Carlos Fernandez-Granda](https://cims.nyu.edu/~cfgranda/) from New York University has excellent lecture notes: [Probability and Statistics for Data Science](https://cims.nyu.edu/~cfgranda/pages/stuff/probability_stats_for_DS.pdf) (available online).
2. [Peter and Andrew Bruce. Practical Statistics for Data Scientists.](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/) [[Sample code in R](https://github.com/andrewgbruce/statistics-for-data-scientists)].
@ -269,9 +268,9 @@ Probability and statistics is such a broad topic that it deserves its own course
## Credits
This lesson was authored with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)
This lesson was authored with ♥️ by [Dmitry Soshnikov](http://soshnikov.com).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "870a0086adbc313a8eea5489bdcb2522",
"translation_date": "2025-08-31T10:58:39+00:00",
"original_hash": "11b166fbcb7eaf82308cdc24b562f687",
"translation_date": "2025-09-05T07:39:59+00:00",
"source_file": "2-Working-With-Data/05-relational-databases/README.md",
"language_code": "en"
}
@ -13,15 +13,15 @@ CO_OP_TRANSLATOR_METADATA:
|:---:|
| Working With Data: Relational Databases - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Youve probably used a spreadsheet before to store information. It consists of rows and columns, where the rows hold the data and the columns describe the data (sometimes referred to as metadata). A relational database builds on this concept of rows and columns in tables, enabling you to spread information across multiple tables. This approach allows you to work with more complex data, reduce duplication, and gain flexibility in how you analyze the data. Lets dive into the basics of relational databases.
Chances are youve used a spreadsheet before to store information. You had rows and columns, where the rows contained the data, and the columns described the data (sometimes called metadata). A relational database is based on this same principle of rows and columns in tables, but it allows you to spread information across multiple tables. This makes it possible to work with more complex data, avoid duplication, and have more flexibility in exploring the data. Lets dive into the concepts of relational databases.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/8)
## It all starts with tables
At the heart of a relational database are tables. Similar to a spreadsheet, a table is a collection of rows and columns. Rows contain the data you want to work with, such as the name of a city or the amount of rainfall, while columns describe the type of data stored.
At the heart of a relational database are tables. Similar to a spreadsheet, a table is a collection of rows and columns. The rows contain the data you want to work with, such as the name of a city or the amount of rainfall, while the columns describe the type of data stored.
Lets start by creating a table to store information about cities. For example, we might want to store their name and country. This could look like the following table:
Lets start by creating a table to store information about cities. We might begin with their name and country. You could organize this in a table like this:
| City | Country |
| -------- | ------------- |
@ -29,11 +29,11 @@ Lets start by creating a table to store information about cities. For example
| Atlanta | United States |
| Auckland | New Zealand |
Notice how the column names **city**, **country**, and **population** describe the data being stored, and each row contains information about a specific city.
Notice how the column names **city**, **country**, and **population** describe the data being stored, and each row contains information about one city.
## The shortcomings of a single table approach
The table above might look familiar to you. Now, lets add more data to our growing database—annual rainfall (in millimeters) for the years 2018, 2019, and 2020. If we were to add this data for Tokyo, it might look like this:
The table above might look familiar to you. Now lets add more data to our growing database—annual rainfall (in millimeters). Well focus on the years 2018, 2019, and 2020. If we were to add this data for Tokyo, it might look like this:
| City | Country | Year | Amount |
| ----- | ------- | ---- | ------ |
@ -41,9 +41,9 @@ The table above might look familiar to you. Now, lets add more data to our gr
| Tokyo | Japan | 2019 | 1874 |
| Tokyo | Japan | 2018 | 1445 |
What do you notice about this table? You might see that were repeating the name and country of the city multiple times. This repetition can take up unnecessary storage space. After all, Tokyo only has one name and one country.
What do you notice about this table? You might see that were repeating the name and country of the city multiple times. This could take up a lot of storage and is unnecessary since Tokyo only has one name and country.
Lets try a different approach by adding new columns for each year:
Lets try another approach. Well add new columns for each year:
| City | Country | 2018 | 2019 | 2020 |
| -------- | ------------- | ---- | ---- | ---- |
@ -51,13 +51,13 @@ Lets try a different approach by adding new columns for each year:
| Atlanta | United States | 1779 | 1111 | 1683 |
| Auckland | New Zealand | 1386 | 942 | 1176 |
While this eliminates row duplication, it introduces other challenges. For instance, wed need to modify the table structure every time a new year is added. Additionally, as the dataset grows, having years as columns makes it harder to retrieve and calculate values.
While this avoids repeating rows, it introduces other challenges. Wed need to change the table structure every time a new year is added. Additionally, as the data grows, having years as columns would make it harder to retrieve and calculate values.
This is why relational databases use multiple tables and relationships. By breaking data into separate tables, we can avoid duplication and gain more flexibility in how we work with the data.
This is why we need multiple tables and relationships. By splitting the data into separate tables, we can avoid duplication and gain more flexibility in working with the data.
## The concepts of relationships
Lets revisit our data and decide how to split it into multiple tables. We know we want to store the name and country of each city, so this information can go into one table:
Lets revisit our data and decide how to divide it. We know we want to store the name and country of each city, so this will work best in one table.
| City | Country |
| -------- | ------------- |
@ -65,7 +65,7 @@ Lets revisit our data and decide how to split it into multiple tables. We kno
| Atlanta | United States |
| Auckland | New Zealand |
Before creating the next table, we need a way to reference each city. This requires an identifier, often called an ID or, in database terminology, a primary key. A primary key is a unique value used to identify a specific row in a table. While we could use the city name as the identifier, its better to use a number or another unique value that wont change. Most primary keys are auto-generated numbers.
Before creating the next table, we need a way to reference each city. We need an identifier, ID, or (in database terms) a primary key. A primary key is a unique value used to identify a specific row in a table. While it could be based on an existing value (like the city name), its better to use a number or other identifier that wont change. If the ID changes, it would break the relationship. In most cases, the primary key or ID is an auto-generated number.
> ✅ Primary key is often abbreviated as PK
@ -77,9 +77,9 @@ Before creating the next table, we need a way to reference each city. This requi
| 2 | Atlanta | United States |
| 3 | Auckland | New Zealand |
> ✅ Throughout this lesson, youll notice we use the terms "id" and "primary key" interchangeably. These concepts also apply to DataFrames, which youll explore later. While DataFrames dont use the term "primary key," they function similarly.
> ✅ Youll notice we use the terms "id" and "primary key" interchangeably in this lesson. These concepts also apply to DataFrames, which youll explore later. While DataFrames dont use the term "primary key," they behave similarly.
With our cities table created, lets store the rainfall data. Instead of duplicating city information, we can use the city ID. The new table should also have its own ID or primary key.
With our cities table created, lets store the rainfall data. Instead of repeating the full city information, we can use the ID. The new table should also have an *id* column, as all tables should have a primary key.
### rainfall
@ -95,15 +95,15 @@ With our cities table created, lets store the rainfall data. Instead of dupli
| 8 | 3 | 2019 | 942 |
| 9 | 3 | 2020 | 1176 |
Notice the **city_id** column in the **rainfall** table. This column contains values that reference the IDs in the **cities** table. In relational database terms, this is called a **foreign key**—a primary key from another table. You can think of it as a reference or pointer. For example, **city_id** 1 refers to Tokyo.
Notice the **city_id** column in the **rainfall** table. This column contains values that reference the IDs in the **cities** table. In relational database terms, this is called a **foreign key**—a primary key from another table. You can think of it as a reference or pointer. **city_id** 1 refers to Tokyo.
> [!NOTE] Foreign key is often abbreviated as FK
## Retrieving the data
With our data split into two tables, you might wonder how to retrieve it. Relational databases like MySQL, SQL Server, or Oracle use a language called Structured Query Language (SQL) for this purpose. SQL (sometimes pronounced "sequel") is a standard language for retrieving and modifying data in relational databases.
With our data split into two tables, you might wonder how to retrieve it. If youre using a relational database like MySQL, SQL Server, or Oracle, you can use a language called Structured Query Language (SQL). SQL (sometimes pronounced "sequel") is a standard language for retrieving and modifying data in relational databases.
To retrieve data, you use the `SELECT` command. Essentially, you **select** the columns you want to view **from** the table they belong to. For example, to display just the names of the cities, you could use the following:
To retrieve data, you use the `SELECT` command. Essentially, you **select** the columns you want to see **from** the table theyre in. For example, to display just the names of the cities, you could use:
```sql
SELECT city
@ -117,9 +117,9 @@ FROM cities;
`SELECT` specifies the columns, and `FROM` specifies the table.
> [NOTE] SQL syntax is case-insensitive, meaning `select` and `SELECT` are treated the same. However, depending on the database, column and table names might be case-sensitive. As a best practice, always treat everything in programming as case-sensitive. In SQL, its common to write keywords in uppercase.
> [NOTE] SQL syntax is case-insensitive, meaning `select` and `SELECT` are the same. However, depending on the database, column and table names might be case-sensitive. Its a best practice to treat everything in programming as case-sensitive. In SQL, its common to write keywords in uppercase.
The query above will display all cities. If you only want to display cities in New Zealand, you can use a filter. The SQL keyword for filtering is `WHERE`, which specifies conditions.
The query above will display all cities. If you only want to display cities in New Zealand, you can use a filter. The SQL keyword for this is `WHERE`, which specifies a condition.
```sql
SELECT city
@ -132,13 +132,13 @@ WHERE country = 'New Zealand';
## Joining data
So far, weve retrieved data from a single table. Now, lets combine data from both **cities** and **rainfall**. This is done by *joining* the tables. Essentially, you create a connection between the two tables by matching values in specific columns.
So far, weve retrieved data from a single table. Now lets combine data from both **cities** and **rainfall**. This is done by *joining* the tables. You create a connection between the two tables by matching values in a column from each table.
In our example, well match the **city_id** column in **rainfall** with the **city_id** column in **cities**. This will link rainfall data to its corresponding city. The type of join well use is called an *inner* join, which only displays rows that have matching values in both tables. Since every city has rainfall data, all rows will be displayed.
In our example, well match the **city_id** column in **rainfall** with the **city_id** column in **cities**. This will link the rainfall data to its corresponding city. The type of join well use is called an *inner* join, which means rows without matches in the other table wont be displayed. In our case, every city has rainfall data, so all rows will be displayed.
Lets retrieve the rainfall data for 2019 for all cities.
Well do this step by step. First, join the tables by specifying the columns to connect—**city_id** in both tables.
Well do this step by step. First, join the tables by specifying the columns to connect—**city_id**.
```sql
SELECT cities.city
@ -147,7 +147,7 @@ FROM cities
INNER JOIN rainfall ON cities.city_id = rainfall.city_id
```
Weve highlighted the columns to join and specified the connection using **city_id**. Now, we can add a `WHERE` clause to filter for the year 2019.
Weve highlighted the columns we want and specified that were joining the tables by **city_id**. Now we can add a `WHERE` statement to filter for the year 2019.
```sql
SELECT cities.city
@ -167,15 +167,15 @@ WHERE rainfall.year = 2019
## Summary
Relational databases are designed to divide information across multiple tables, which can then be combined for analysis and display. This approach offers flexibility for calculations and data manipulation. Youve learned the core concepts of relational databases and how to join data from two tables.
Relational databases are designed to divide information across multiple tables, which can then be combined for display and analysis. This approach provides flexibility for calculations and data manipulation. Youve learned the core concepts of relational databases and how to join two tables.
## 🚀 Challenge
There are many relational databases available online. Use the skills youve learned to explore and analyze data.
There are many relational databases available online. Use the skills youve learned to explore the data.
## Post-Lecture Quiz
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/9)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "32ddfef8121650f2ca2f3416fd283c37",
"translation_date": "2025-08-31T10:57:17+00:00",
"original_hash": "54c5a1c74aecb69d2f9099300a4b7eea",
"translation_date": "2025-09-05T07:38:44+00:00",
"source_file": "2-Working-With-Data/06-non-relational/README.md",
"language_code": "en"
}
@ -19,15 +19,15 @@ Data isn't limited to relational databases. This lesson focuses on non-relationa
## Spreadsheets
Spreadsheets are a widely used method for storing and analyzing data because they require minimal setup to get started. In this lesson, you'll learn the fundamental components of a spreadsheet, along with formulas and functions. Examples will be demonstrated using Microsoft Excel, but most spreadsheet software will have similar features and steps.
Spreadsheets are a widely used method for storing and analyzing data because they are easy to set up and use. In this lesson, you'll learn the fundamental components of a spreadsheet, along with formulas and functions. Examples will be demonstrated using Microsoft Excel, but most spreadsheet software will have similar features and terminology.
![An empty Microsoft Excel workbook with two worksheets](../../../../2-Working-With-Data/06-non-relational/images/parts-of-spreadsheet.png)
A spreadsheet is a file that can be accessed on a computer, device, or cloud-based file system. The software itself might be browser-based or require installation as an application or app. In Excel, these files are referred to as **workbooks**, and this term will be used throughout the lesson.
A workbook contains one or more **worksheets**, each labeled with tabs. Within a worksheet are rectangles called **cells**, which hold the actual data. A cell is the intersection of a row and column, with columns labeled alphabetically and rows labeled numerically. Some spreadsheets include headers in the first few rows to describe the data in the cells.
A workbook contains one or more **worksheets**, each labeled with tabs. Within a worksheet are rectangular areas called **cells**, which hold the actual data. A cell is the intersection of a row and column, with columns labeled alphabetically and rows labeled numerically. Some spreadsheets include headers in the first few rows to describe the data in the cells.
Using these basic elements of an Excel workbook, we'll explore an example from [Microsoft Templates](https://templates.office.com/) focused on inventory management to dive deeper into spreadsheet features.
Using these basic elements of an Excel workbook, we'll explore an example from [Microsoft Templates](https://templates.office.com/) focused on inventory management to demonstrate additional features of a spreadsheet.
### Managing an Inventory
@ -35,15 +35,15 @@ The spreadsheet file named "InventoryExample" is a formatted inventory spreadshe
![A highlighted formula from an example inventory list in Microsoft Excel](../../../../2-Working-With-Data/06-non-relational/images/formula-excel.png)
Sometimes, a cell's value depends on other cells to calculate its own value. For example, the Inventory List spreadsheet tracks the cost of each item in the inventory, but what if we need to calculate the total value of the inventory? [**Formulas**](https://support.microsoft.com/en-us/office/overview-of-formulas-34519a4e-1e8d-4f4b-84d4-d642c4f63263) perform operations on cell data, and in this case, a formula is used in the Inventory Value column to calculate the value of each item by multiplying the quantity (under the QTY header) by the cost (under the COST header). Double-clicking or highlighting a cell reveals the formula. Formulas always start with an equals sign, followed by the calculation or operation.
Sometimes, a cell's value depends on the values of other cells. For example, the Inventory List spreadsheet tracks the cost of each item in the inventory, but what if we need to calculate the total value of the inventory? [**Formulas**](https://support.microsoft.com/en-us/office/overview-of-formulas-34519a4e-1e8d-4f4b-84d4-d642c4f63263) perform operations on cell data, and in this case, a formula is used in the Inventory Value column to calculate the value of each item by multiplying the quantity (under the QTY header) by the cost (under the COST header). Double-clicking or highlighting a cell reveals the formula. Formulas always start with an equals sign, followed by the calculation or operation.
![A highlighted function from an example inventory list in Microsoft Excel](../../../../2-Working-With-Data/06-non-relational/images/function-excel.png)
To find the total inventory value, we can use another formula to sum up all the values in the Inventory Value column. While manually adding each cell is possible, it can be tedious. Excel provides [**functions**](https://support.microsoft.com/en-us/office/sum-function-043e1c7d-7726-4e80-8f32-07b23e057f89), which are predefined formulas for performing calculations on cell values. Functions require argumentsthe values needed for the calculation. If a function requires multiple arguments, they must be listed in the correct order to ensure accurate results. In this example, the SUM function is used to add up the values in the Inventory Value column, with the total displayed in row 3, column B (B3).
To find the total inventory value, we can use another formula to sum up all the values in the Inventory Value column. While adding each cell manually is possible, it can be tedious. Excel provides [**functions**](https://support.microsoft.com/en-us/office/sum-function-043e1c7d-7726-4e80-8f32-07b23e057f89), which are predefined formulas for performing calculations on cell values. Functions require arguments, which are the values needed for the calculation. If a function requires multiple arguments, they must be listed in a specific order to ensure accurate results. In this example, the SUM function is used to add up the values in the Inventory Value column, with the total displayed in row 3, column B (B3).
## NoSQL
NoSQL is a broad term encompassing various methods for storing non-relational data. It can be interpreted as "non-SQL," "non-relational," or "not only SQL." These database systems are categorized into four main types.
NoSQL is a broad term encompassing various methods for storing non-relational data. It can be interpreted as "non-SQL," "non-relational," or "not only SQL." These database systems are generally categorized into four types.
![Graphical representation of a key-value data store showing 4 unique numerical keys that are associated with 4 various values](../../../../2-Working-With-Data/06-non-relational/images/kv-db.png)
> Source from [Michał Białecki Blog](https://www.michalbialecki.com/2018/03/18/azure-cosmos-db-key-value-database-cloud/)
@ -53,19 +53,19 @@ NoSQL is a broad term encompassing various methods for storing non-relational da
![Graphical representation of a graph data store showing the relationships between people, their interests and locations](../../../../2-Working-With-Data/06-non-relational/images/graph-db.png)
> Source from [Microsoft](https://docs.microsoft.com/en-us/azure/cosmos-db/graph/graph-introduction#graph-database-by-example)
[Graph](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#graph-data-stores) databases represent relationships in data as collections of nodes and edges. Nodes represent entities (e.g., a student or bank statement), while edges represent relationships between entities. Both nodes and edges have properties that provide additional information.
[Graph](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#graph-data-stores) databases represent relationships between data as collections of nodes and edges. Nodes represent entities, such as a student or bank statement, while edges represent relationships between entities. Both nodes and edges have properties that provide additional information.
![Graphical representation of a columnar data store showing a customer database with two column families named Identity and Contact Info](../../../../2-Working-With-Data/06-non-relational/images/columnar-db.png)
[Columnar](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#columnar-data-stores) data stores organize data into rows and columns, similar to relational databases, but group columns into column families. All data within a column family is related and can be retrieved or modified as a single unit.
[Columnar](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#columnar-data-stores) data stores organize data into rows and columns, similar to relational databases. However, columns are grouped into column families, where all data within a column family is related and can be retrieved or modified as a single unit.
### Document Data Stores with the Azure Cosmos DB
[Document](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#document-data-stores) data stores expand on the concept of key-value stores, consisting of fields and objects. This section explores document databases using the Cosmos DB emulator.
[Document](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#document-data-stores) data stores expand on the concept of key-value stores by using fields and objects. This section explores document databases using the Cosmos DB emulator.
Cosmos DB fits the "Not Only SQL" definition, as its document database uses SQL for querying data. The [previous lesson](../05-relational-databases/README.md) on SQL covers the basics of the language, which can be applied to document databases here. We'll use the Cosmos DB Emulator to create and explore a document database locally. Learn more about the Emulator [here](https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator?tabs=ssl-netstd21).
Cosmos DB fits the "Not Only SQL" definition, as its document database uses SQL for querying data. The [previous lesson](../05-relational-databases/README.md) on SQL covers the basics of the language, and we'll apply some of those concepts to a document database here. The Cosmos DB Emulator allows you to create and explore a document database locally on your computer. Learn more about the Emulator [here](https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator?tabs=ssl-netstd21).
A document is a collection of fields and object values, where fields describe the object values. Below is an example of a document.
A document consists of fields and object values, where fields describe the object values. Below is an example of a document.
```json
{
@ -80,17 +80,17 @@ A document is a collection of fields and object values, where fields describe th
}
```
Key fields in this document include `firstname`, `id`, and `age`. Other fields with underscores are generated by Cosmos DB.
Key fields in this document include `firstname`, `id`, and `age`. The other fields with underscores are automatically generated by Cosmos DB.
#### Exploring Data with the Cosmos DB Emulator
You can download and install the emulator [for Windows here](https://aka.ms/cosmosdb-emulator). For macOS and Linux, refer to this [documentation](https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator?tabs=ssl-netstd21#run-on-linux-macos).
The Emulator opens in a browser window, where the Explorer view lets you explore documents.
The Emulator opens in a browser window, where the Explorer view lets you navigate documents.
![The Explorer view of the Cosmos DB Emulator](../../../../2-Working-With-Data/06-non-relational/images/cosmosdb-emulator-explorer.png)
If you're following along, click "Start with Sample" to generate a sample database called SampleDB. Expanding SampleDB reveals a container called `Persons`, which holds a collection of items (documents). You can explore the four individual documents under `Items`.
If you're following along, click "Start with Sample" to generate a sample database called SampleDB. Expanding SampleDB reveals a container called `Persons`. A container holds a collection of items, which are the documents within it. You can explore the four individual documents under `Items`.
![Exploring sample data in the Cosmos DB Emulator](../../../../2-Working-With-Data/06-non-relational/images/cosmosdb-emulator-persons.png)
@ -98,7 +98,7 @@ If you're following along, click "Start with Sample" to generate a sample databa
You can query the sample data by clicking the "New SQL Query" button (second button from the left).
`SELECT * FROM c` retrieves all documents in the container. Adding a WHERE clause allows filtering, such as finding everyone younger than 40:
`SELECT * FROM c` retrieves all documents in the container. To find everyone younger than 40, add a WHERE clause:
`SELECT * FROM c where c.age < 40`
@ -110,7 +110,7 @@ The query returns two documents, both with age values less than 40.
If you're familiar with JavaScript Object Notation (JSON), you'll notice that documents resemble JSON. A `PersonsData.json` file in this directory contains additional data that can be uploaded to the Persons container in the Emulator using the `Upload Item` button.
APIs that return JSON data can often be directly stored in document databases. Below is another document, representing tweets from the Microsoft Twitter account retrieved via the Twitter API and inserted into Cosmos DB.
APIs that return JSON data can often be directly stored in document databases. Below is another document, representing tweets from the Microsoft Twitter account. This data was retrieved using the Twitter API and inserted into Cosmos DB.
```json
{
@ -128,25 +128,25 @@ Key fields in this document include `created_at`, `id`, and `text`.
## 🚀 Challenge
A `TwitterData.json` file can be uploaded to the SampleDB database. It's recommended to add it to a separate container. To do this:
A `TwitterData.json` file is available for upload to the SampleDB database. It's recommended to add it to a separate container. To do this:
1. Click the "New Container" button in the top right.
2. Select the existing database (SampleDB), create a container ID for the container.
3. Set the partition key to `/id`.
4. Click OK (you can ignore the rest of the information since this is a small dataset running locally).
4. Click OK (you can ignore the other settings since this is a small dataset running locally).
5. Open your new container and upload the Twitter Data file using the `Upload Item` button.
Try running a few SELECT queries to find documents containing "Microsoft" in the text field. Hint: Use the [LIKE keyword](https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-keywords#using-like-with-the--wildcard-character).
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/11)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
- This lesson doesn't cover all the formatting and features available in spreadsheets. Microsoft offers a [comprehensive library of documentation and videos](https://support.microsoft.com/excel) for Excel if you'd like to learn more.
- Learn more about the characteristics of different types of non-relational data in this architectural documentation: [Non-relational Data and NoSQL](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data).
- This architectural documentation explains the characteristics of different types of non-relational data: [Non-relational Data and NoSQL](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data).
- Cosmos DB is a cloud-based non-relational database that supports the NoSQL types discussed in this lesson. Explore these types further in this [Cosmos DB Microsoft Learn Module](https://docs.microsoft.com/en-us/learn/paths/work-with-nosql-data-in-azure-cosmos-db/).
- Cosmos DB is a cloud-based non-relational database that supports the NoSQL types discussed in this lesson. Learn more through this [Cosmos DB Microsoft Learn Module](https://docs.microsoft.com/en-us/learn/paths/work-with-nosql-data-in-azure-cosmos-db/).
## Assignment

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "116c5d361fbe812e59a73f37ce721d36",
"translation_date": "2025-08-31T10:57:48+00:00",
"original_hash": "57f7db1f4c3ae3361c1d1fbafcdd690c",
"translation_date": "2025-09-05T07:39:13+00:00",
"source_file": "2-Working-With-Data/07-python/README.md",
"language_code": "en"
}
@ -15,55 +15,57 @@ CO_OP_TRANSLATOR_METADATA:
[![Intro Video](../../../../2-Working-With-Data/07-python/images/video-ds-python.png)](https://youtu.be/dZjWOGbsN4Y)
While databases provide highly efficient methods for storing and querying data using query languages, the most flexible way to process data is by writing your own program to manipulate it. In many cases, using a database query is more effective. However, when more complex data processing is required, it may not be easily achievable with SQL.
Data processing can be done in any programming language, but some languages are better suited for working with data. Data scientists often prefer one of the following languages:
Databases provide efficient ways to store and query data using query languages, but the most flexible method for processing data is writing your own program to manipulate it. Often, database queries are more effective, but in cases where complex data processing is required, SQL may not be sufficient.
* **[Python](https://www.python.org/)**: A general-purpose programming language often considered one of the best options for beginners due to its simplicity. Python has many additional libraries that can help solve practical problems, such as extracting data from ZIP archives or converting images to grayscale. Beyond data science, Python is also widely used for web development.
* **[R](https://www.r-project.org/)**: A traditional toolset designed specifically for statistical data processing. It has a large repository of libraries (CRAN), making it a strong choice for data analysis. However, R is not a general-purpose programming language and is rarely used outside the data science domain.
* **[Julia](https://julialang.org/)**: A language developed specifically for data science, designed to offer better performance than Python, making it an excellent tool for scientific experimentation.
Data processing can be done in any programming language, but some languages are better suited for working with data. Data scientists typically use one of the following:
In this lesson, we will focus on using Python for simple data processing. We assume you have basic familiarity with the language. If you'd like a deeper dive into Python, you can explore the following resources:
* **[Python](https://www.python.org/)**: A general-purpose programming language often considered ideal for beginners due to its simplicity. Python has numerous libraries that can help solve practical problems, such as extracting data from ZIP archives or converting images to grayscale. Beyond data science, Python is widely used for web development.
* **[R](https://www.r-project.org/)**: A traditional tool designed for statistical data processing. It has a large library repository (CRAN), making it a strong choice for data analysis. However, R is not a general-purpose language and is rarely used outside the data science domain.
* **[Julia](https://julialang.org/)**: A language specifically developed for data science, offering better performance than Python, making it ideal for scientific experimentation.
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse) - A quick introductory course on Python programming hosted on GitHub.
* [Take your First Steps with Python](https://docs.microsoft.com/en-us/learn/paths/python-first-steps/?WT.mc_id=academic-77958-bethanycheum) - A learning path available on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=academic-77958-bethanycheum).
In this lesson, we will focus on using Python for basic data processing. We assume you have a basic understanding of the language. For a deeper dive into Python, consider the following resources:
Data can come in various forms. In this lesson, we will focus on three types of data: **tabular data**, **text**, and **images**.
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse) - A quick GitHub-based introduction to Python programming.
* [Take your First Steps with Python](https://docs.microsoft.com/en-us/learn/paths/python-first-steps/?WT.mc_id=academic-77958-bethanycheum) - A learning path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=academic-77958-bethanycheum).
Rather than providing a comprehensive overview of all related libraries, we will focus on a few examples of data processing. This approach will help you grasp the main concepts and equip you with the knowledge to find solutions to your problems when needed.
Data can exist in various forms. In this lesson, we will focus on three types: **tabular data**, **text**, and **images**.
> **Most useful advice**: When you need to perform a specific operation on data but don't know how, try searching for it online. [Stackoverflow](https://stackoverflow.com/) often contains many useful Python code samples for common tasks.
Rather than covering all related libraries, we will focus on a few examples of data processing. This approach will give you a clear understanding of what's possible and help you know where to find solutions when needed.
> **Best advice**: If you're unsure how to perform a specific data operation, search online. [Stackoverflow](https://stackoverflow.com/) often has useful Python code samples for common tasks.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/12)
## Tabular Data and Dataframes
Youve already encountered tabular data when we discussed relational databases. When dealing with large datasets stored in multiple linked tables, SQL is often the best tool for the job. However, there are many situations where you have a single table of data and need to derive **insights** or **understanding** from it, such as analyzing distributions or correlations between values. In data science, its common to transform the original data and then visualize it. Both steps can be easily accomplished using Python.
Youve already encountered tabular data when discussing relational databases. When dealing with large datasets stored in multiple linked tables, SQL is often the best tool. However, there are situations where you have a single table of data and need to derive **insights** or **understanding**—such as analyzing distributions or correlations. In data science, transforming data and visualizing it are common tasks, and Python makes these steps straightforward.
Two key libraries in Python are particularly useful for working with tabular data:
* **[Pandas](https://pandas.pydata.org/)**: Enables manipulation of **DataFrames**, which are similar to relational tables. You can work with named columns and perform various operations on rows, columns, and entire DataFrames.
Two key Python libraries for working with tabular data are:
* **[Pandas](https://pandas.pydata.org/)**: Enables manipulation of **DataFrames**, which are similar to relational tables. You can work with named columns and perform operations on rows, columns, or entire DataFrames.
* **[Numpy](https://numpy.org/)**: A library for working with **tensors**, or multi-dimensional **arrays**. Arrays contain values of the same type and are simpler than DataFrames, offering more mathematical operations with less overhead.
Additionally, there are other libraries worth knowing:
* **[Matplotlib](https://matplotlib.org/)**: Used for data visualization and graph plotting.
* **[SciPy](https://www.scipy.org/)**: Provides additional scientific functions. Weve already encountered this library when discussing probability and statistics.
Other useful libraries include:
* **[Matplotlib](https://matplotlib.org/)**: Used for data visualization and graph plotting.
* **[SciPy](https://www.scipy.org/)**: Provides additional scientific functions. You may recall this library from discussions on probability and statistics.
Heres a typical code snippet for importing these libraries at the start of a Python program:
Heres a typical code snippet for importing these libraries at the start of a Python program:
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import ... # you need to specify exact sub-packages that you need
```
```
Pandas revolves around a few core concepts.
Pandas revolves around a few fundamental concepts.
### Series
### Series
A **Series** is a sequence of values, similar to a list or numpy array. The key difference is that a Series has an **index**, which is considered during operations (e.g., addition). The index can be as simple as an integer row number (default when creating a Series from a list or array) or more complex, like a date range.
A **Series** is a sequence of values, similar to a list or numpy array. The key difference is that a Series also has an **index**, which is considered during operations (e.g., addition). The index can be as simple as an integer row number (the default when creating a Series from a list or array) or more complex, such as a date range.
> **Note**: Introductory Pandas code is available in the accompanying notebook [`notebook.ipynb`](../../../../2-Working-With-Data/07-python/notebook.ipynb). Well outline some examples here, but feel free to explore the full notebook.
> **Note**: Some introductory Pandas code is available in the accompanying notebook [`notebook.ipynb`](../../../../2-Working-With-Data/07-python/notebook.ipynb). Well outline a few examples here, but feel free to explore the full notebook.
Example: Lets analyze sales at an ice cream shop. Well generate a Series of daily sales numbers over a period of time:
For example, lets analyze sales data for an ice cream shop. Well generate a Series of sales numbers (items sold each day) over a specific time period:
```python
start_date = "Jan 1, 2020"
end_date = "Mar 31, 2020"
@ -71,49 +73,48 @@ idx = pd.date_range(start_date,end_date)
print(f"Length of index is {len(idx)}")
items_sold = pd.Series(np.random.randint(25,50,size=len(idx)),index=idx)
items_sold.plot()
```
```
![Time Series Plot](../../../../2-Working-With-Data/07-python/images/timeseries-1.png)
Now, suppose we host a weekly party for friends and take an additional 10 packs of ice cream for the event. We can create another Series, indexed by week, to represent this:
Suppose we host a weekly party and take an additional 10 packs of ice cream for the event. We can create another Series, indexed by week, to represent this:
```python
additional_items = pd.Series(10,index=pd.date_range(start_date,end_date,freq="W"))
```
When we add the two Series together, we get the total number:
```
Adding the two Series gives us the total number:
```python
total_items = items_sold.add(additional_items,fill_value=0)
total_items.plot()
```
```
![Time Series Plot](../../../../2-Working-With-Data/07-python/images/timeseries-2.png)
> **Note**: We dont use the simple syntax `total_items + additional_items`. If we did, the resulting Series would contain many `NaN` (*Not a Number*) values. This happens because some index points in the `additional_items` Series lack values, and adding `NaN` to anything results in `NaN`. To avoid this, we specify the `fill_value` parameter during addition.
> **Note**: We dont use the simple syntax `total_items + additional_items`. Doing so would result in many `NaN` (*Not a Number*) values in the resulting Series because missing values in the `additional_items` index lead to `NaN` when added. To avoid this, we specify the `fill_value` parameter during addition.
With time series, we can also **resample** the data using different time intervals. For instance, to calculate the average monthly sales volume, we can use the following code:
With time series, we can also **resample** data at different intervals. For instance, to calculate average monthly sales:
```python
monthly = total_items.resample("1M").mean()
ax = monthly.plot(kind='bar')
```
```
![Monthly Time Series Averages](../../../../2-Working-With-Data/07-python/images/timeseries-3.png)
### DataFrame
A DataFrame is essentially a collection of Series with the same index. We can combine multiple Series into a DataFrame:
A DataFrame is essentially a collection of Series with the same index. You can combine multiple Series into a DataFrame:
```python
a = pd.Series(range(1,10))
b = pd.Series(["I","like","to","play","games","and","will","not","change"],index=range(0,9))
df = pd.DataFrame([a,b])
```
This creates a horizontal table like this:
```
This creates a horizontal table like this:
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| --- | --- | ---- | --- | --- | ------ | --- | ------ | ---- | ---- |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 1 | I | like | to | use | Python | and | Pandas | very | much |
We can also use Series as columns and specify column names using a dictionary:
You can also use Series as columns and specify column names using a dictionary:
```python
df = pd.DataFrame({ 'A' : a, 'B' : b })
```
This results in the following table:
```
This results in a table like this:
| | A | B |
| --- | --- | ------ |
@ -127,39 +128,39 @@ This results in the following table:
| 7 | 8 | very |
| 8 | 9 | much |
**Note**: We can also achieve this table layout by transposing the previous table using:
**Note**: You can achieve this layout by transposing the previous table using:
```python
df = pd.DataFrame([a,b]).T..rename(columns={ 0 : 'A', 1 : 'B' })
```
Here, `.T` performs the transposition (swapping rows and columns), and the `rename` operation allows us to rename columns to match the previous example.
```
Here, `.T` transposes the DataFrame (swapping rows and columns), and `rename` allows column renaming to match the previous example.
Here are some key operations you can perform on DataFrames:
Key operations on DataFrames include:
**Column selection**: Select individual columns using `df['A']` (returns a Series). To select a subset of columns into another DataFrame, use `df[['B', 'A']]`.
**Column selection**: Select individual columns with `df['A']` (returns a Series) or a subset of columns with `df[['B','A']]` (returns another DataFrame).
**Filtering rows by criteria**: For example, to keep only rows where column `A` is greater than 5, use `df[df['A'] > 5]`.
**Filtering rows**: Filter rows based on criteria, e.g., `df[df['A']>5]` keeps rows where column `A` is greater than 5.
> **Note**: Filtering works as follows: The expression `df['A'] < 5` returns a boolean Series indicating whether the condition is `True` or `False` for each element in `df['A']`. When a boolean Series is used as an index, it returns a subset of rows in the DataFrame. You cannot use arbitrary Python boolean expressions like `df[df['A'] > 5 and df['A'] < 7]`. Instead, use the special `&` operator for boolean Series: `df[(df['A'] > 5) & (df['A'] < 7)]` (*brackets are essential*).
> **Note**: Filtering works by creating a boolean Series (`df['A']<5`) that indicates whether the condition is `True` or `False` for each element. Using this boolean Series as an index returns the filtered rows. Avoid using regular Python boolean expressions like `df[df['A']>5 and df['A']<7]`. Instead, use `&` for boolean Series: `df[(df['A']>5) & (df['A']<7)]` (*brackets are required*).
**Creating new computed columns**: Easily create new columns using expressions like:
**Creating new columns**: Add new columns with expressions like:
```python
df['DivA'] = df['A']-df['A'].mean()
```
This example calculates the divergence of `A` from its mean value. Here, we compute a Series and assign it to the left-hand side, creating a new column. However, operations incompatible with Series will result in errors, such as:
```
This calculates the divergence of `A` from its mean value. The operation computes a Series and assigns it to the new column. Avoid incompatible operations, such as:
```python
# Wrong code -> df['ADescr'] = "Low" if df['A'] < 5 else "Hi"
df['LenB'] = len(df['B']) # <- Wrong result
```
This example, while syntactically correct, produces incorrect results because it assigns the length of Series `B` to all values in the column, rather than the length of individual elements.
```
This example assigns the length of Series `B` to all values in the column, not the length of individual elements.
For complex expressions, use the `apply` function. The previous example can be rewritten as:
For complex expressions, use the `apply` function:
```python
df['LenB'] = df['B'].apply(lambda x : len(x))
# or
df['LenB'] = df['B'].apply(len)
```
```
After these operations, the resulting DataFrame will look like this:
After these operations, the DataFrame looks like this:
| | A | B | DivA | LenB |
| --- | --- | ------ | ---- | ---- |
@ -173,22 +174,22 @@ After these operations, the resulting DataFrame will look like this:
| 7 | 8 | very | 3.0 | 4 |
| 8 | 9 | much | 4.0 | 4 |
**Selecting rows by index**: Use the `iloc` construct to select rows by their position. For example, to select the first 5 rows:
**Selecting rows by index**: Use `iloc` to select rows by position, e.g., the first 5 rows:
```python
df.iloc[:5]
```
```
**Grouping**: Often used to create results similar to *pivot tables* in Excel. For instance, to compute the mean value of column `A` for each unique value in `LenB`, group the DataFrame by `LenB` and call `mean`:
**Grouping**: Grouping is useful for results similar to *pivot tables* in Excel. For example, to calculate the mean of column `A` for each unique `LenB` value:
```python
df.groupby(by='LenB').mean()
```
To compute both the mean and the count of elements in each group, use the `aggregate` function:
```
For more complex aggregations, use the `aggregate` function:
```python
df.groupby(by='LenB') \
.aggregate({ 'DivA' : len, 'A' : lambda x: x.mean() }) \
.rename(columns={ 'DivA' : 'Count', 'A' : 'Mean'})
```
This produces the following table:
```
This produces the following table:
| LenB | Count | Mean |
| ---- | ----- | -------- |
@ -199,7 +200,7 @@ This produces the following table:
| 6 | 2 | 6.000000 |
### Getting Data
We have seen how simple it is to create Series and DataFrames from Python objects. However, data is often stored in text files or Excel tables. Fortunately, Pandas provides an easy way to load data from disk. For example, reading a CSV file is as straightforward as this:
We have seen how simple it is to create Series and DataFrames from Python objects. However, data is often stored in text files or Excel tables. Fortunately, Pandas provides an easy way to load data from disk. For instance, reading a CSV file is as straightforward as this:
```python
df = pd.read_csv('file.csv')
```
@ -211,13 +212,13 @@ A Data Scientist frequently needs to explore data, so being able to visualize it
Weve also seen how to use the `plot` function to visualize specific columns. While `plot` is highly versatile and supports various graph types via the `kind=` parameter, you can always use the raw `matplotlib` library for more complex visualizations. We will delve deeper into data visualization in separate course lessons.
This overview covers the key concepts of Pandas, but the library is incredibly rich, and the possibilities are endless! Lets now apply this knowledge to solve specific problems.
This overview covers the key concepts of Pandas, but the library is incredibly rich, offering endless possibilities! Lets now apply this knowledge to solve specific problems.
## 🚀 Challenge 1: Analyzing COVID Spread
The first problem well tackle is modeling the spread of the COVID-19 epidemic. To do this, well use data on the number of infected individuals in various countries, provided by the [Center for Systems Science and Engineering](https://systems.jhu.edu/) (CSSE) at [Johns Hopkins University](https://jhu.edu/). The dataset is available in [this GitHub Repository](https://github.com/CSSEGISandData/COVID-19).
To demonstrate how to work with data, we encourage you to open [`notebook-covidspread.ipynb`](../../../../2-Working-With-Data/07-python/notebook-covidspread.ipynb) and go through it from start to finish. You can also execute the cells and try out some challenges weve included at the end.
To demonstrate how to work with data, we encourage you to open [`notebook-covidspread.ipynb`](../../../../2-Working-With-Data/07-python/notebook-covidspread.ipynb) and read through it from start to finish. You can also execute the cells and try out some challenges weve included at the end.
![COVID Spread](../../../../2-Working-With-Data/07-python/images/covidspread.png)
@ -239,7 +240,7 @@ A complete example of analyzing this dataset using the [Text Analytics for Healt
> **NOTE**: This repository does not include a copy of the dataset. You may need to download the [`metadata.csv`](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv) file from [this Kaggle dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Registration with Kaggle may be required. Alternatively, you can download the dataset without registration [from here](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html), which includes all full texts in addition to the metadata file.
Open [`notebook-papers.ipynb`](../../../../2-Working-With-Data/07-python/notebook-papers.ipynb) and go through it from start to finish. You can also execute the cells and try out some challenges weve included at the end.
Open [`notebook-papers.ipynb`](../../../../2-Working-With-Data/07-python/notebook-papers.ipynb) and read through it from start to finish. You can also execute the cells and try out some challenges weve included at the end.
![Covid Medical Treatment](../../../../2-Working-With-Data/07-python/images/covidtreat.png)
@ -247,21 +248,21 @@ Open [`notebook-papers.ipynb`](../../../../2-Working-With-Data/07-python/noteboo
Recently, powerful AI models have been developed to analyze images. Many tasks can be accomplished using pre-trained neural networks or cloud services. Examples include:
* **Image Classification**, which categorizes images into predefined classes. You can train your own image classifiers using services like [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum).
* **Object Detection**, which identifies various objects in an image. Services like [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) can detect common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum) models to detect specific objects of interest.
* **Face Detection**, including age, gender, and emotion analysis. This can be achieved using [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum).
* **Image Classification**, which categorizes images into predefined classes. You can train your own classifiers using services like [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum).
* **Object Detection**, which identifies various objects in an image. Services like [Computer Vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) can detect common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum) models to detect specific objects of interest.
* **Face Detection**, including age, gender, and emotion recognition. This can be achieved using [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum).
These cloud services can be accessed via [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=academic-77958-bethanycheum), making it easy to integrate them into your data exploration workflow.
All these cloud services can be accessed via [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=academic-77958-bethanycheum), making it easy to integrate them into your data exploration workflow.
Here are some examples of working with image data sources:
* In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/), we analyze Instagram photos to understand what makes people like a photo more. We extract information from images using [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) and use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=academic-77958-bethanycheum) to build an interpretable model.
* In the [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies), we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum) to analyze emotions in event photographs to understand what makes people happy.
* In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/), we analyze Instagram photos to understand what makes people like a photo more. We extract information from images using [Computer Vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) and use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=academic-77958-bethanycheum) to build an interpretable model.
* In [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies), we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum) to analyze emotions in event photographs to understand what makes people happy.
## Conclusion
Whether youre working with structured or unstructured data, Python allows you to perform all steps related to data processing and analysis. Its one of the most flexible tools for data processing, which is why most data scientists use Python as their primary tool. If youre serious about pursuing data science, learning Python in depth is highly recommended!
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/13)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
@ -282,7 +283,7 @@ Whether youre working with structured or unstructured data, Python allows you
## Credits
This lesson was created with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)
This lesson was created with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)
---

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "3ade580a06b5f04d57cc83a768a8fb77",
"translation_date": "2025-08-31T10:59:11+00:00",
"original_hash": "90a815d332aea41a222f4c6372e7186e",
"translation_date": "2025-09-05T07:40:23+00:00",
"source_file": "2-Working-With-Data/08-data-preparation/README.md",
"language_code": "en"
}
@ -21,24 +21,24 @@ Raw data, depending on its source, may have inconsistencies that make analysis a
- **Ease of use and reuse**: Properly organized and normalized data is easier to search, use, and share with others.
- **Consistency**: Data science often involves working with multiple datasets, which may need to be combined. Ensuring that each dataset follows common standards makes the merged data more useful.
- **Consistency**: Data science often involves working with multiple datasets, which may need to be combined. Ensuring that each dataset follows common standards helps maintain its usefulness when merged.
- **Model accuracy**: Clean data improves the accuracy of models that depend on it.
## Common cleaning goals and strategies
- **Exploring a dataset**: Data exploration, covered in a [later lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/4-Data-Science-Lifecycle/15-analyzing), helps identify data that needs cleaning. Observing values visually can set expectations or highlight problems to address. Exploration can involve querying, visualizations, and sampling.
- **Exploring a dataset**: Data exploration, covered in a [later lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/4-Data-Science-Lifecycle/15-analyzing), helps identify data that needs cleaning. Observing values visually can set expectations or highlight issues to resolve. Exploration may involve querying, visualizations, and sampling.
- **Formatting**: Data from different sources may have inconsistencies in presentation, which can affect searches and visualizations. Common formatting issues include whitespace, dates, and data types. Resolving these issues often depends on the user's needs, as standards for dates and numbers vary by region.
- **Formatting**: Data from different sources may have inconsistencies in presentation, causing issues in searches or visualizations. Common formatting problems include whitespace, dates, and data types. Resolving these issues often depends on the user's needs, as standards for dates and numbers vary by region.
- **Duplications**: Duplicate data can lead to inaccurate results and often needs to be removed. However, in some cases, duplicates may contain additional information and should be preserved.
- **Duplications**: Duplicate data can lead to inaccurate results and often needs removal. This is common when merging datasets. However, some duplicates may contain additional information and should be preserved.
- **Missing Data**: Missing data can lead to inaccuracies or biased results. Solutions include reloading the data, filling in missing values programmatically, or removing the affected data. The approach depends on the reasons behind the missing data.
## Exploring DataFrame information
> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.
Once data is loaded into pandas, it is typically stored in a DataFrame (refer to the previous [lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/2-Working-With-Data/07-python#dataframe) for an overview). If your DataFrame contains 60,000 rows and 400 columns, how do you start understanding it? Fortunately, [pandas](https://pandas.pydata.org/) offers tools to quickly view overall information about a DataFrame, as well as its first and last few rows.
Once data is loaded into pandas, it is typically stored in a DataFrame (refer to the previous [lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/2-Working-With-Data/07-python#dataframe) for details). If your DataFrame contains 60,000 rows and 400 columns, how do you start understanding it? Fortunately, [pandas](https://pandas.pydata.org/) offers tools to quickly view overall information about a DataFrame, as well as its first and last few rows.
To explore this functionality, we will use the Python scikit-learn library and the well-known **Iris dataset**.
@ -57,7 +57,7 @@ iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
|3 |4.6 |3.1 |1.5 |0.2 |
|4 |5.0 |3.6 |1.4 |0.2 |
- **DataFrame.info**: The `info()` method provides a summary of the content in a `DataFrame`. Let's examine this dataset:
- **DataFrame.info**: The `info()` method provides a summary of the content in a `DataFrame`. Lets examine this dataset:
```python
iris_df.info()
```
@ -73,7 +73,7 @@ Data columns (total 4 columns):
dtypes: float64(4)
memory usage: 4.8 KB
```
This tells us that the *Iris* dataset has 150 entries across four columns, with no null values. All data is stored as 64-bit floating-point numbers.
This tells us the *Iris* dataset has 150 entries across four columns, with no null values. All data is stored as 64-bit floating-point numbers.
- **DataFrame.head()**: To view the first few rows of the `DataFrame`, use the `head()` method:
```python
@ -87,7 +87,7 @@ iris_df.head()
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
```
- **DataFrame.tail()**: To view the last few rows of the `DataFrame`, use the `tail()` method:
- **DataFrame.tail()**: To view the last few rows, use the `tail()` method:
```python
iris_df.tail()
```
@ -99,18 +99,18 @@ iris_df.tail()
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
```
> **Takeaway:** By examining metadata and the first/last few rows of a DataFrame, you can quickly understand its size, structure, and content.
> **Takeaway:** By examining metadata or the first and last few rows of a DataFrame, you can quickly understand its size, structure, and content.
## Dealing with Missing Data
> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.
Datasets often contain missing values. How you handle missing data can impact your analysis and real-world outcomes.
Pandas uses two methods to represent missing values: `NaN` (Not a Number) for floating-point data and `None` for other types. While this dual approach may seem confusing, it provides flexibility for most use cases. However, both `NaN` and `None` have limitations you should be aware of.
Pandas uses two methods to represent missing values: `NaN` (Not a Number) for floating-point values and `None` for other types. While this dual approach may seem confusing, it provides flexibility for most use cases. However, both `None` and `NaN` have limitations you should be aware of.
Learn more about `NaN` and `None` in the [notebook](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb)!
- **Detecting null values**: Use the `isnull()` and `notnull()` methods to detect null data. Both return Boolean masks over your data. We'll use `numpy` for `NaN` values:
- **Detecting null values**: Use the `isnull()` and `notnull()` methods in pandas to detect null data. Both return Boolean masks over your data. Well use `numpy` for `NaN` values:
```python
import numpy as np
@ -124,13 +124,13 @@ example1.isnull()
3 True
dtype: bool
```
Notice the output. While `0` is an arithmetic null, pandas treats it as a valid integer. Similarly, `''` (an empty string) is considered a valid string, not null.
Notice that `0` is treated as a valid integer, not null. Similarly, `''` (an empty string) is considered a valid string, not null.
You can use Boolean masks directly as a `Series` or `DataFrame` index to isolate missing or present values.
> **Takeaway**: The `isnull()` and `notnull()` methods provide results with indices, making it easier to work with your data.
> **Takeaway**: The `isnull()` and `notnull()` methods provide results along with their indices, making it easier to work with your data.
- **Dropping null values**: Pandas offers a convenient way to remove null values from `Series` and `DataFrame`s. For large datasets, removing missing values is often more practical than other approaches. Let's revisit `example1`:
- **Dropping null values**: Pandas allows you to remove null values from `Series` and `DataFrame`s. For large datasets, removing missing values is often more practical than other approaches. Lets revisit `example1`:
```python
example1 = example1.dropna()
example1
@ -142,7 +142,7 @@ dtype: object
```
This output matches `example3[example3.notnull()]`, but `dropna` removes missing values directly from the `Series`.
For `DataFrame`s, you can drop entire rows or columns. By default, `dropna()` removes rows with any null values:
For `DataFrame`s, you can drop rows or columns containing null values:
```python
example2 = pd.DataFrame([[1, np.nan, 7],
[2, 5, 8],
@ -155,9 +155,7 @@ example2
|1 |2.0|5.0|8 |
|2 |NaN|6.0|9 |
(Pandas converts columns to floats to accommodate `NaN`s.)
To drop columns with null values, use `axis=1`:
You can drop rows (default behavior) or columns using `axis=1`:
```python
example2.dropna()
```
@ -165,7 +163,16 @@ example2.dropna()
0 1 2
1 2.0 5.0 8
```
You can also drop rows or columns with all null values using `how='all'`. For finer control, use the `thresh` parameter to specify the minimum number of non-null values required to keep a row or column:
```python
example2.dropna(axis='columns')
```
```
2
0 7
1 8
2 9
```
To drop rows or columns with all null values, use `how='all'`. To drop based on a threshold of non-null values, use the `thresh` parameter:
```python
example2[3] = np.nan
example2
@ -175,7 +182,6 @@ example2
|0 |1.0|NaN|7 |NaN|
|1 |2.0|5.0|8 |NaN|
|2 |NaN|6.0|9 |NaN|
```python
example2.dropna(axis='rows', thresh=3)
```
@ -183,9 +189,8 @@ example2.dropna(axis='rows', thresh=3)
0 1 2 3
1 2.0 5.0 8 NaN
```
Here, rows with fewer than three non-null values are dropped.
- **Filling null values**: Instead of dropping null values, you can replace them with valid ones using `fillna`. This method is more efficient than manually replacing values. Let's create another example `Series`:
- **Filling null values**: Instead of dropping null values, you can replace them with valid ones using `fillna`. Lets create another example `Series`:
```python
example3 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
example3
@ -198,7 +203,7 @@ d NaN
e 3.0
dtype: float64
```
You can replace all null entries with a single value, like `0`:
Replace all null entries with a single value, like `0`:
```python
example3.fillna(0)
```
@ -210,7 +215,7 @@ d 0.0
e 3.0
dtype: float64
```
You can **forward-fill** null values using the last valid value:
Use **forward-fill** to propagate the last valid value:
```python
example3.fillna(method='ffill')
```
@ -222,7 +227,7 @@ d 2.0
e 3.0
dtype: float64
```
You can also **back-fill** null values using the next valid value:
Use **back-fill** to propagate the next valid value backward:
```python
example3.fillna(method='bfill')
```
@ -234,7 +239,7 @@ d 3.0
e 3.0
dtype: float64
```
This works similarly for `DataFrame`s, where you can specify an `axis` for filling null values. Using `example2` again:
For `DataFrame`s, you can specify an `axis` to fill null values:
```python
example2.fillna(method='ffill', axis=1)
```
@ -245,15 +250,15 @@ example2.fillna(method='ffill', axis=1)
2 NaN 6.0 9.0 9.0
```
If no previous value exists for forward-filling, the null value remains.
> **Takeaway:** There are several ways to handle missing values in your datasets. The specific approach you choose (removing them, replacing them, or even how you replace them) should depend on the characteristics of the data. The more you work with and explore datasets, the better you'll become at managing missing values.
> **Takeaway:** There are several ways to handle missing values in your datasets. The approach you choose—whether it's removing them, replacing them, or deciding how to replace them—should depend on the specifics of your data. The more you work with datasets, the better you'll understand how to manage missing values effectively.
## Removing duplicate data
> **Learning goal:** By the end of this subsection, you should feel confident identifying and removing duplicate values from DataFrames.
In addition to missing data, real-world datasets often contain duplicate entries. Luckily, `pandas` offers a straightforward way to detect and remove duplicates.
In addition to missing data, duplicated data is another common issue in real-world datasets. Luckily, `pandas` makes it simple to detect and remove duplicate entries.
- **Identifying duplicates: `duplicated`**: You can easily identify duplicate values using the `duplicated` method in pandas. This method returns a Boolean mask that indicates whether an entry in a `DataFrame` is a duplicate of a previous one. Lets create another example `DataFrame` to see how this works.
- **Identifying duplicates: `duplicated`**: You can easily identify duplicate values using the `duplicated` method in pandas. This method returns a Boolean mask that indicates whether an entry in a `DataFrame` is a duplicate of a previous one. Let's create another example `DataFrame` to see how this works.
```python
example4 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
'numbers': [1, 2, 1, 3, 3]})
@ -278,7 +283,7 @@ example4.duplicated()
4 True
dtype: bool
```
- **Dropping duplicates: `drop_duplicates`:** This method simply returns a copy of the data where all `duplicated` values are `False`:
- **Dropping duplicates: `drop_duplicates`:** This method returns a copy of the data where all `duplicated` values are marked as `False`:
```python
example4.drop_duplicates()
```
@ -298,17 +303,17 @@ letters numbers
1 B 2
```
> **Takeaway:** Removing duplicate data is a crucial step in almost every data science project. Duplicate data can skew your analysis and lead to inaccurate results!
> **Takeaway:** Removing duplicate data is a crucial step in nearly every data science project. Duplicate data can skew your analyses and lead to inaccurate results!
## 🚀 Challenge
All the materials covered are available as a [Jupyter Notebook](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/2-Working-With-Data/08-data-preparation/notebook.ipynb). Additionally, there are exercises at the end of each section—give them a try!
All the materials covered in this lesson are available as a [Jupyter Notebook](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/2-Working-With-Data/08-data-preparation/notebook.ipynb). Additionally, there are exercises at the end of each section—give them a try!
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/15)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
There are many ways to explore and approach preparing your data for analysis and modeling. Cleaning your data is a critical step that requires hands-on practice. Try these Kaggle challenges to learn techniques not covered in this lesson:
There are many ways to explore and approach preparing your data for analysis and modeling. Cleaning your data is a critical step that requires hands-on experience. Try these Kaggle challenges to learn techniques not covered in this lesson:
- [Data Cleaning Challenge: Parsing Dates](https://www.kaggle.com/rtatman/data-cleaning-challenge-parsing-dates/)

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "43c402d9d90ae6da55d004519ada5033",
"translation_date": "2025-08-31T11:05:55+00:00",
"original_hash": "69b32b6789a91f796ebc7a02f5575e03",
"translation_date": "2025-09-05T07:43:38+00:00",
"source_file": "3-Data-Visualization/09-visualization-quantities/README.md",
"language_code": "en"
}
@ -19,7 +19,7 @@ In this lesson, you'll learn how to use one of the many Python libraries availab
## Observe wingspan with Matplotlib
[Matplotlib](https://matplotlib.org/stable/index.html) is an excellent library for creating both simple and complex plots and charts of various types. Generally, the process of plotting data with these libraries involves identifying the parts of your dataframe to target, performing any necessary transformations, assigning x and y axis values, choosing the type of plot, and displaying the plot. Matplotlib offers a wide range of visualizations, but for this lesson, we'll focus on those best suited for visualizing quantities: line charts, scatterplots, and bar plots.
[Matplotlib](https://matplotlib.org/stable/index.html) is an excellent library for creating both simple and complex plots and charts of various types. Generally, the process of plotting data with these libraries involves identifying the parts of your dataframe to target, performing any necessary transformations, assigning x and y axis values, choosing the type of plot, and then displaying it. Matplotlib offers a wide range of visualizations, but for this lesson, we'll focus on those best suited for visualizing quantities: line charts, scatterplots, and bar plots.
> ✅ Choose the chart type that best fits your data structure and the story you want to tell.
> - To analyze trends over time: line
@ -29,7 +29,7 @@ In this lesson, you'll learn how to use one of the many Python libraries availab
> - To show trends: line, column
> - To show relationships between values: line, scatterplot, bubble
If you have a dataset and need to determine how much of a specific item is included, one of your first tasks will be to inspect its values.
If you have a dataset and need to determine how much of a particular item is included, one of your first tasks will be to inspect its values.
✅ There are excellent 'cheat sheets' for Matplotlib available [here](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
@ -63,11 +63,11 @@ wingspan.plot()
```
![Max Wingspan](../../../../3-Data-Visualization/09-visualization-quantities/images/max-wingspan-02.png)
What stands out immediately? There seems to be at least one outlier—what a wingspan! A 2300-centimeter wingspan equals 23 meters—are there Pterodactyls in Minnesota? Let's investigate.
What stands out immediately? There appears to be at least one outlier—what a wingspan! A 2300-centimeter wingspan equals 23 meters—are there Pterodactyls in Minnesota? Let's investigate.
While you could quickly sort the data in Excel to find these outliers (likely typos), continue the visualization process by working directly within the plot.
Add labels to the x-axis to show the types of birds in question:
Add labels to the x-axis to indicate the types of birds being analyzed:
```
plt.title('Max Wingspan in Centimeters')
@ -81,9 +81,9 @@ plt.plot(x, y)
plt.show()
```
![Wingspan with labels](../../../../3-Data-Visualization/09-visualization-quantities/images/max-wingspan-labels-02.png)
![wingspan with labels](../../../../3-Data-Visualization/09-visualization-quantities/images/max-wingspan-labels-02.png)
Even with the labels rotated 45 degrees, there are too many to read. Let's try a different approach: label only the outliers and set the labels within the chart. You can use a scatter chart to make room for the labeling:
Even with the labels rotated 45 degrees, there are too many to read. Let's try a different approach: label only the outliers and place the labels within the chart. You can use a scatter chart to make room for the labeling:
```python
plt.title('Max Wingspan in Centimeters')
@ -99,15 +99,15 @@ for i in range(len(birds)):
plt.show()
```
What's happening here? You used `tick_params` to hide the bottom labels and then created a loop over your birds dataset. By plotting the chart with small round blue dots using `bo`, you checked for any bird with a maximum wingspan over 500 and displayed its label next to the dot. You offset the labels slightly on the y-axis (`y * (1 - 0.05)`) and used the bird name as the label.
What's happening here? You used `tick_params` to hide the bottom labels and then looped through your birds dataset. By plotting the chart with small round blue dots (`bo`), you checked for any bird with a maximum wingspan over 500 and displayed its label next to the dot. You offset the labels slightly on the y-axis (`y * (1 - 0.05)`) and used the bird name as the label.
What did you discover?
![Outliers](../../../../3-Data-Visualization/09-visualization-quantities/images/labeled-wingspan-02.png)
![outliers](../../../../3-Data-Visualization/09-visualization-quantities/images/labeled-wingspan-02.png)
## Filter your data
Both the Bald Eagle and the Prairie Falcon, while likely large birds, appear to be mislabeled with an extra `0` added to their maximum wingspan. It's unlikely you'll encounter a Bald Eagle with a 25-meter wingspan, but if you do, let us know! Let's create a new dataframe without these two outliers:
Both the Bald Eagle and the Prairie Falcon, while likely large birds, seem to have been mislabeled with an extra `0` added to their maximum wingspan. It's unlikely you'll encounter a Bald Eagle with a 25-meter wingspan, but if you do, let us know! Let's create a new dataframe without these two outliers:
```python
plt.title('Max Wingspan in Centimeters')
@ -122,11 +122,11 @@ for i in range(len(birds)):
plt.show()
```
By filtering out outliers, your data becomes more cohesive and easier to understand.
By filtering out the outliers, your data becomes more cohesive and easier to interpret.
![Scatterplot of wingspans](../../../../3-Data-Visualization/09-visualization-quantities/images/scatterplot-wingspan-02.png)
![scatterplot of wingspans](../../../../3-Data-Visualization/09-visualization-quantities/images/scatterplot-wingspan-02.png)
Now that we have a cleaner dataset, at least in terms of wingspan, let's explore more about these birds.
Now that we have a cleaner dataset in terms of wingspan, let's explore more about these birds.
While line and scatter plots can display information about data values and their distributions, we want to focus on the quantities inherent in this dataset. You could create visualizations to answer questions like:
@ -136,7 +136,7 @@ While line and scatter plots can display information about data values and their
## Explore bar charts
Bar charts are useful for showing groupings of data. Let's explore the bird categories in this dataset to see which is the most common.
Bar charts are useful for showing groupings of data. Let's examine the bird categories in this dataset to determine which is the most common.
In the notebook file, create a basic bar chart.
@ -151,13 +151,13 @@ birds.plot(x='Category',
title='Birds of Minnesota')
```
![Full data as a bar chart](../../../../3-Data-Visualization/09-visualization-quantities/images/full-data-bar-02.png)
![full data as a bar chart](../../../../3-Data-Visualization/09-visualization-quantities/images/full-data-bar-02.png)
This bar chart, however, is unreadable due to too much ungrouped data. You need to select only the data you want to plot, so let's examine the length of birds based on their category.
This bar chart, however, is difficult to read because the data isn't grouped. You need to select only the data you want to plot, so let's examine the length of birds based on their category.
Filter your data to include only the bird's category.
✅ Notice how you use Pandas to manage the data and let Matplotlib handle the charting.
✅ Note: Use Pandas to manage the data, and let Matplotlib handle the charting.
Since there are many categories, display this chart vertically and adjust its height to accommodate all the data:
@ -166,15 +166,15 @@ category_count = birds.value_counts(birds['Category'].values, sort=True)
plt.rcParams['figure.figsize'] = [6, 12]
category_count.plot.barh()
```
![Category and length](../../../../3-Data-Visualization/09-visualization-quantities/images/category-counts-02.png)
![category and length](../../../../3-Data-Visualization/09-visualization-quantities/images/category-counts-02.png)
This bar chart provides a clear view of the number of birds in each category. At a glance, you can see that the largest number of birds in this region belong to the Ducks/Geese/Waterfowl category. Given Minnesota's nickname as the 'land of 10,000 lakes,' this isn't surprising!
This bar chart provides a clear view of the number of birds in each category. At a glance, you can see that the Ducks/Geese/Waterfowl category has the largest number of birds in this region. Given that Minnesota is the "land of 10,000 lakes," this isn't surprising!
✅ Try counting other aspects of this dataset. Does anything surprise you?
✅ Try counting other attributes in this dataset. Do any results surprise you?
## Comparing data
You can compare grouped data by creating new axes. Try comparing the MaxLength of birds based on their category:
You can explore different comparisons of grouped data by creating new axes. For example, compare the MaxLength of birds based on their category:
```python
maxlength = birds['MaxLength']
@ -182,7 +182,7 @@ plt.barh(y=birds['Category'], width=maxlength)
plt.rcParams['figure.figsize'] = [6, 12]
plt.show()
```
![Comparing data](../../../../3-Data-Visualization/09-visualization-quantities/images/category-length-02.png)
![comparing data](../../../../3-Data-Visualization/09-visualization-quantities/images/category-length-02.png)
Nothing surprising here: hummingbirds have the smallest MaxLength compared to pelicans or geese. It's reassuring when data aligns with logic!
@ -198,19 +198,19 @@ plt.barh(category, minLength)
plt.show()
```
In this plot, you can see the range of Minimum and Maximum Length for each bird category. You can confidently say that, based on this data, larger birds tend to have a wider length range. Fascinating!
In this plot, you can see the range of Minimum and Maximum Length for each bird category. Based on this data, you can confidently say that larger birds tend to have a wider length range. Fascinating!
![Superimposed values](../../../../3-Data-Visualization/09-visualization-quantities/images/superimposed-02.png)
![superimposed values](../../../../3-Data-Visualization/09-visualization-quantities/images/superimposed-02.png)
## 🚀 Challenge
This bird dataset offers a wealth of information about different bird types within a specific ecosystem. Search online for other bird-related datasets. Practice building charts and graphs to uncover facts you didn't know.
This bird dataset offers a wealth of information about various bird types within a specific ecosystem. Search online for other bird-related datasets and practice building charts and graphs to uncover surprising facts.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/17)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
This lesson introduced you to using Matplotlib for visualizing quantities. Research other ways to work with datasets for visualization. [Plotly](https://github.com/plotly/plotly.py) is one library we won't cover in these lessons, so explore what it can offer.
This lesson introduced you to using Matplotlib for visualizing quantities. Research other methods for working with datasets to create visualizations. [Plotly](https://github.com/plotly/plotly.py) is one library we won't cover in these lessons, so explore its features.
## Assignment
@ -219,4 +219,4 @@ This lesson introduced you to using Matplotlib for visualizing quantities. Resea
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "87faccac113d772551486a67a607153e",
"translation_date": "2025-08-31T11:07:26+00:00",
"original_hash": "02ce904bc1e2bfabb7dc05c25aae375c",
"translation_date": "2025-09-05T07:45:01+00:00",
"source_file": "3-Data-Visualization/10-visualization-distributions/README.md",
"language_code": "en"
}
@ -13,14 +13,15 @@ CO_OP_TRANSLATOR_METADATA:
|:---:|
| Visualizing Distributions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In the previous lesson, you explored an interesting dataset about the birds of Minnesota. You identified some erroneous data by visualizing outliers and examined the differences between bird categories based on their maximum length.
In the previous lesson, you explored a dataset about the birds of Minnesota. You identified some erroneous data by visualizing outliers and examined differences between bird categories based on their maximum length.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/18)
## Explore the birds dataset
Another way to analyze data is by examining its distribution, or how the data is spread along an axis. For instance, you might want to understand the general distribution of maximum wingspan or maximum body mass for the birds of Minnesota in this dataset.
Lets uncover some insights about the data distributions in this dataset. In the _notebook.ipynb_ file located in the root of this lesson folder, import Pandas, Matplotlib, and your data:
Lets uncover some insights about the distributions in this dataset. In the _notebook.ipynb_ file located in the root of this lesson folder, import Pandas, Matplotlib, and your data:
```python
import pandas as pd
@ -37,7 +38,7 @@ birds.head()
| 3 | Ross's goose | Anser rossii | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 57.3 | 64 | 1066 | 1567 | 113 | 116 |
| 4 | Greater white-fronted goose | Anser albifrons | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 81 | 1930 | 3310 | 130 | 165 |
In general, you can quickly visualize how data is distributed by using a scatter plot, as demonstrated in the previous lesson:
In general, you can quickly examine how data is distributed by using a scatter plot, as demonstrated in the previous lesson:
```python
birds.plot(kind='scatter',x='MaxLength',y='Order',figsize=(12,8))
@ -54,7 +55,7 @@ This provides an overview of the general distribution of body length per bird Or
## Working with histograms
Matplotlib provides excellent tools for visualizing data distributions using histograms. A histogram is similar to a bar chart, but it shows the distribution of data through the rise and fall of the bars. To create a histogram, you need numeric data. You can plot a histogram by setting the chart type to 'hist'. This chart displays the distribution of MaxBodyMass across the datasets numeric range. By dividing the data into smaller bins, it reveals the distribution of values:
Matplotlib provides excellent tools for visualizing data distributions using histograms. A histogram is similar to a bar chart, but it shows the distribution through the rise and fall of the bars. To create a histogram, you need numeric data. You can plot a histogram by specifying the chart type as 'hist'. This type of chart displays the distribution of MaxBodyMass across the datasets numeric range. By dividing the data into smaller bins, it reveals the spread of values:
```python
birds['MaxBodyMass'].plot(kind = 'hist', bins = 10, figsize = (12,12))
@ -62,7 +63,7 @@ plt.show()
```
![distribution over the entire dataset](../../../../3-Data-Visualization/10-visualization-distributions/images/dist1-wb.png)
As shown, most of the 400+ birds in this dataset have a Max Body Mass under 2000. You can gain more insight by increasing the `bins` parameter to a higher value, such as 30:
As shown, most of the 400+ birds in this dataset have a Max Body Mass under 2000. You can gain more detailed insights by increasing the `bins` parameter to a higher number, such as 30:
```python
birds['MaxBodyMass'].plot(kind = 'hist', bins = 30, figsize = (12,12))
@ -70,7 +71,7 @@ plt.show()
```
![distribution over the entire dataset with larger bins param](../../../../3-Data-Visualization/10-visualization-distributions/images/dist2-wb.png)
This chart provides a more detailed view of the distribution. To create a chart thats less skewed to the left, you can filter the data to include only birds with a body mass under 60 and set the `bins` parameter to 40:
This chart provides a more granular view of the distribution. To create a chart thats less skewed to the left, you can filter the data to include only birds with a body mass under 60 and set the `bins` parameter to 40:
```python
filteredBirds = birds[(birds['MaxBodyMass'] > 1) & (birds['MaxBodyMass'] < 60)]
@ -83,7 +84,7 @@ plt.show()
Histograms also allow for color and labeling enhancements:
Create a 2D histogram to compare the relationship between two distributions. For example, compare `MaxBodyMass` and `MaxLength`. Matplotlib provides a built-in way to show convergence using brighter colors:
Create a 2D histogram to compare the relationship between two distributions. For example, compare `MaxBodyMass` vs. `MaxLength`. Matplotlib provides a built-in method to show convergence using brighter colors:
```python
x = filteredBirds['MaxBodyMass']
@ -92,17 +93,17 @@ y = filteredBirds['MaxLength']
fig, ax = plt.subplots(tight_layout=True)
hist = ax.hist2d(x, y)
```
There seems to be a clear correlation between these two variables along an expected axis, with one particularly strong point of convergence:
There seems to be a predictable correlation between these two variables along an expected axis, with one particularly strong point of convergence:
![2D plot](../../../../3-Data-Visualization/10-visualization-distributions/images/2D-wb.png)
Histograms are ideal for numeric data. But what if you want to analyze distributions based on text data?
Histograms work well for numeric data by default. But what if you want to analyze distributions based on text data?
## Explore the dataset for distributions using text data
This dataset also contains valuable information about bird categories, genus, species, family, and conservation status. Lets explore the conservation status. What is the distribution of birds based on their conservation status?
This dataset also contains valuable information about bird categories, genus, species, family, and conservation status. Lets explore the conservation status data. What is the distribution of birds based on their conservation status?
> ✅ In the dataset, several acronyms are used to describe conservation status. These acronyms are derived from the [IUCN Red List Categories](https://www.iucnredlist.org/), which classify species' statuses:
> ✅ In the dataset, several acronyms are used to describe conservation status. These acronyms are derived from the [IUCN Red List Categories](https://www.iucnredlist.org/), an organization that tracks species' statuses.
>
> - CR: Critically Endangered
> - EN: Endangered
@ -111,7 +112,7 @@ This dataset also contains valuable information about bird categories, genus, sp
> - NT: Near Threatened
> - VU: Vulnerable
Since these are text-based values, youll need to transform them to create a histogram. Using the filteredBirds dataframe, display its conservation status alongside its Minimum Wingspan. What do you observe?
Since these are text-based values, youll need to transform the data to create a histogram. Using the filteredBirds dataframe, display its conservation status alongside its Minimum Wingspan. What do you observe?
```python
x1 = filteredBirds.loc[filteredBirds.ConservationStatus=='EX', 'MinWingspan']
@ -136,15 +137,15 @@ plt.legend();
![wingspan and conservation collation](../../../../3-Data-Visualization/10-visualization-distributions/images/histogram-conservation-wb.png)
There doesnt appear to be a strong correlation between minimum wingspan and conservation status. Test other elements of the dataset using this method. Try different filters as well. Do you notice any correlations?
There doesnt appear to be a strong correlation between minimum wingspan and conservation status. Test other elements of the dataset using this method. Try different filters as well. Do you find any correlations?
## Density plots
You may have noticed that the histograms weve examined so far are 'stepped' and dont flow smoothly. To create a smoother density chart, you can use a density plot.
You may have noticed that the histograms weve examined so far are 'stepped' and dont flow smoothly in an arc. To create a smoother density chart, you can use a density plot.
To work with density plots, familiarize yourself with a new plotting library, [Seaborn](https://seaborn.pydata.org/generated/seaborn.kdeplot.html).
Load Seaborn and try a basic density plot:
Load Seaborn and try creating a basic density plot:
```python
import seaborn as sns
@ -154,9 +155,9 @@ plt.show()
```
![Density plot](../../../../3-Data-Visualization/10-visualization-distributions/images/density1.png)
This plot mirrors the previous one for Minimum Wingspan data but appears smoother. According to Seaborns documentation, "Relative to a histogram, KDE can produce a plot that is less cluttered and more interpretable, especially when drawing multiple distributions. But it has the potential to introduce distortions if the underlying distribution is bounded or not smooth. Like a histogram, the quality of the representation also depends on the selection of good smoothing parameters." [source](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) In other words, outliers can still negatively impact your charts.
This plot mirrors the previous one for Minimum Wingspan data but is smoother. According to Seaborns documentation, "Relative to a histogram, KDE can produce a plot that is less cluttered and more interpretable, especially when drawing multiple distributions. But it has the potential to introduce distortions if the underlying distribution is bounded or not smooth. Like a histogram, the quality of the representation also depends on the selection of good smoothing parameters." [source](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) In other words, outliers can still negatively impact your charts.
If you revisit the jagged MaxBodyMass line from the second chart, you can smooth it out using this method:
If you want to smooth out the jagged MaxBodyMass line from the second chart you created, you can recreate it using this method:
```python
sns.kdeplot(filteredBirds['MaxBodyMass'])
@ -164,7 +165,7 @@ plt.show()
```
![smooth bodymass line](../../../../3-Data-Visualization/10-visualization-distributions/images/density2.png)
To create a line thats smooth but not overly so, adjust the `bw_adjust` parameter:
If you prefer a smoother but not overly smooth line, adjust the `bw_adjust` parameter:
```python
sns.kdeplot(filteredBirds['MaxBodyMass'], bw_adjust=.2)
@ -174,7 +175,7 @@ plt.show()
✅ Explore the available parameters for this type of plot and experiment!
This type of chart provides visually appealing and explanatory visualizations. For instance, with just a few lines of code, you can display the max body mass density per bird Order:
This type of chart provides visually appealing and explanatory visualizations. For example, with just a few lines of code, you can display the max body mass density per bird Order:
```python
sns.kdeplot(
@ -186,7 +187,7 @@ sns.kdeplot(
![bodymass per order](../../../../3-Data-Visualization/10-visualization-distributions/images/density4.png)
You can also map the density of multiple variables in one chart. Compare the MaxLength and MinLength of a bird to their conservation status:
You can also map the density of multiple variables in one chart. Compare the MaxLength and MinLength of a bird to its conservation status:
```python
sns.kdeplot(data=filteredBirds, x="MinLength", y="MaxLength", hue="ConservationStatus")
@ -194,13 +195,13 @@ sns.kdeplot(data=filteredBirds, x="MinLength", y="MaxLength", hue="ConservationS
![multiple densities, superimposed](../../../../3-Data-Visualization/10-visualization-distributions/images/multi.png)
It might be worth investigating whether the cluster of 'Vulnerable' birds based on their lengths has any significance.
It might be worth investigating whether the cluster of 'Vulnerable' birds based on their lengths is significant.
## 🚀 Challenge
Histograms are a more advanced type of chart compared to basic scatterplots, bar charts, or line charts. Search online for examples of histograms. How are they used, what do they reveal, and in which fields or areas of study are they commonly applied?
Histograms are a more advanced type of chart compared to basic scatterplots, bar charts, or line charts. Search online for examples of histograms. How are they used, what do they demonstrate, and in which fields or areas of study are they commonly applied?
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/19)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
@ -213,4 +214,4 @@ In this lesson, you used Matplotlib and began working with Seaborn to create mor
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "af6a12015c6e250e500b570a9fa42593",
"translation_date": "2025-08-31T11:05:02+00:00",
"original_hash": "cc490897ee2d276870472bcb31602d03",
"translation_date": "2025-09-05T07:42:52+00:00",
"source_file": "3-Data-Visualization/11-visualization-proportions/README.md",
"language_code": "en"
}
@ -13,26 +13,26 @@ CO_OP_TRANSLATOR_METADATA:
|:---:|
|Visualizing Proportions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you'll work with a nature-focused dataset to visualize proportions, such as the distribution of different types of fungi in a dataset about mushrooms. We'll dive into these fascinating fungi using a dataset from Audubon that provides details about 23 species of gilled mushrooms in the Agaricus and Lepiota families. You'll experiment with fun visualizations like:
In this lesson, you'll work with a nature-focused dataset to visualize proportions, such as the distribution of different types of fungi in a dataset about mushrooms. We'll dive into these fascinating fungi using a dataset from Audubon that details 23 species of gilled mushrooms in the Agaricus and Lepiota families. You'll experiment with fun visualizations like:
- Pie charts 🥧
- Donut charts 🍩
- Waffle charts 🧇
- Pie charts 🥧
- Donut charts 🍩
- Waffle charts 🧇
> 💡 Microsoft Research has an interesting project called [Charticulator](https://charticulator.com), which offers a free drag-and-drop interface for creating data visualizations. One of their tutorials uses this mushroom dataset! You can explore the data and learn the library simultaneously: [Charticulator tutorial](https://charticulator.com/tutorials/tutorial4.html).
> 💡 Microsoft Research has an interesting project called [Charticulator](https://charticulator.com), which provides a free drag-and-drop interface for creating data visualizations. One of their tutorials uses this mushroom dataset! You can explore the data and learn the library simultaneously: [Charticulator tutorial](https://charticulator.com/tutorials/tutorial4.html).
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/20)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Get to know your mushrooms 🍄
Mushrooms are fascinating organisms. Let's import a dataset to study them:
Mushrooms are fascinating. Let's import a dataset to study them:
```python
import pandas as pd
import matplotlib.pyplot as plt
mushrooms = pd.read_csv('../../data/mushrooms.csv')
mushrooms.head()
```
```
A table is displayed with some great data for analysis:
| class | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | stalk-root | stalk-surface-above-ring | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat |
@ -42,11 +42,11 @@ A table is displayed with some great data for analysis:
| Edible | Bell | Smooth | White | Bruises | Anise | Free | Close | Broad | Brown | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Meadows |
| Poisonous | Convex | Scaly | White | Bruises | Pungent | Free | Close | Narrow | Brown | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban |
You'll notice that all the data is textual. To use it in a chart, you'll need to convert it. Most of the data is represented as an object:
Immediately, you notice that all the data is textual. You'll need to convert this data to make it usable in a chart. Most of the data is represented as an object:
```python
print(mushrooms.select_dtypes(["object"]).columns)
```
```
The output is:
@ -58,20 +58,20 @@ Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
'ring-type', 'spore-print-color', 'population', 'habitat'],
dtype='object')
```
```
Convert the 'class' column into a category:
```python
cols = mushrooms.select_dtypes(["object"]).columns
mushrooms[cols] = mushrooms[cols].astype('category')
```
```
```python
edibleclass=mushrooms.groupby(['class']).count()
edibleclass
```
```
Now, if you print the mushrooms data, you'll see it grouped into categories based on the poisonous/edible class:
Now, if you print the mushrooms data, you'll see it has been grouped into categories based on the poisonous/edible class:
| | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat |
| --------- | --------- | ----------- | --------- | ------- | ---- | --------------- | ------------ | --------- | ---------- | ----------- | --- | ------------------------ | ---------------------- | ---------------------- | --------- | ---------- | ----------- | --------- | ----------------- | ---------- | ------- |
@ -79,7 +79,7 @@ Now, if you print the mushrooms data, you'll see it grouped into categories base
| Edible | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | ... | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 |
| Poisonous | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | ... | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 |
Using the order in this table to create your class category labels, you can build a pie chart:
Using the order presented in this table to create your class category labels, you can build a pie chart:
## Pie!
@ -88,20 +88,22 @@ labels=['Edible','Poisonous']
plt.pie(edibleclass['population'],labels=labels,autopct='%.1f %%')
plt.title('Edible?')
plt.show()
```
And voilà, a pie chart showing the proportions of the two mushroom classes. It's crucial to get the label order correct, so double-check the array when building the labels!
```
And there you have it—a pie chart showing the proportions of the data based on the two mushroom classes. It's crucial to get the order of the labels correct, especially here, so double-check the order when building the label array!
![pie chart](../../../../3-Data-Visualization/11-visualization-proportions/images/pie1-wb.png)
## Donuts!
A donut chart is a visually appealing variation of a pie chart, with a hole in the center. Let's use this method to explore the habitats where mushrooms grow:
A visually appealing variation of a pie chart is a donut chart, which is essentially a pie chart with a hole in the center. Let's use this method to examine our data.
Look at the various habitats where mushrooms grow:
```python
habitat=mushrooms.groupby(['habitat']).count()
habitat
```
Group the data by habitat. There are seven listed habitats, so use them as labels for your donut chart:
```
Here, you're grouping the data by habitat. There are seven listed habitats, so use those as labels for your donut chart:
```python
labels=['Grasses','Leaves','Meadows','Paths','Urban','Waste','Wood']
@ -117,30 +119,30 @@ fig.gca().add_artist(center_circle)
plt.title('Mushroom Habitats')
plt.show()
```
```
![donut chart](../../../../3-Data-Visualization/11-visualization-proportions/images/donut-wb.png)
This code draws the chart and a center circle, then adds the circle to the chart. You can adjust the width of the center circle by changing `0.40` to another value.
This code draws a chart and a center circle, then adds the center circle to the chart. You can adjust the width of the center circle by changing `0.40` to another value.
Donut charts can be customized in various ways, especially the labels for better readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut).
Donut charts can be customized in various ways, including tweaking the labels for better readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut).
Now that you know how to group data and display it as a pie or donut chart, let's explore another type of chart: the waffle chart.
Now that you know how to group your data and display it as a pie or donut chart, let's explore another type of chart: the waffle chart.
## Waffles!
A waffle chart visualizes quantities as a 2D array of squares. Let's use it to examine the proportions of mushroom cap colors in the dataset. First, install the helper library [PyWaffle](https://pypi.org/project/pywaffle/) and use Matplotlib:
A waffle chart is a unique way to visualize quantities as a 2D array of squares. Let's visualize the different quantities of mushroom cap colors in this dataset. To do this, you'll need to install a helper library called [PyWaffle](https://pypi.org/project/pywaffle/) and use Matplotlib:
```python
pip install pywaffle
```
```
Select a segment of your data to group:
```python
capcolor=mushrooms.groupby(['cap-color']).count()
capcolor
```
```
Create a waffle chart by defining labels and grouping your data:
@ -163,32 +165,34 @@ fig = plt.figure(
figsize = (30,30),
colors=["brown", "tan", "maroon", "green", "pink", "purple", "red", "whitesmoke", "yellow"],
)
```
```
The waffle chart clearly shows the proportions of cap colors in the mushroom dataset. Interestingly, there are many green-capped mushrooms!
Using a waffle chart, you can clearly see the proportions of cap colors in this mushroom dataset. Interestingly, there are many green-capped mushrooms!
![waffle chart](../../../../3-Data-Visualization/11-visualization-proportions/images/waffle.png)
✅ PyWaffle supports icons within the charts, using any icon available in [Font Awesome](https://fontawesome.com/). Experiment with icons to create even more engaging waffle charts.
✅ PyWaffle supports icons within the charts, allowing you to use any icon available in [Font Awesome](https://fontawesome.com/). Experiment with creating a more engaging waffle chart using icons instead of squares.
In this lesson, you learned three ways to visualize proportions. First, group your data into categories, then choose the best visualization method—pie, donut, or waffle. Each offers a quick and intuitive snapshot of the dataset.
In this lesson, you learned three ways to visualize proportions. First, group your data into categories, then decide the best way to display it—pie, donut, or waffle. All are visually appealing and provide an instant snapshot of the dataset.
## 🚀 Challenge
Try recreating these charts in [Charticulator](https://charticulator.com).
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/21)
## Review & Self Study
Choosing between pie, donut, or waffle charts isn't always straightforward. Here are some articles to help you decide:
Sometimes it's not clear when to use a pie, donut, or waffle chart. Here are some articles to help you decide:
https://www.beautiful.ai/blog/battle-of-the-charts-pie-chart-vs-donut-chart
https://medium.com/@hypsypops/pie-chart-vs-donut-chart-showdown-in-the-ring-5d24fd86a9ce
https://www.mit.edu/~mbarker/formula1/f1help/11-ch-c6.htm
https://www.beautiful.ai/blog/battle-of-the-charts-pie-chart-vs-donut-chart
https://medium.com/@hypsypops/pie-chart-vs-donut-chart-showdown-in-the-ring-5d24fd86a9ce
https://www.mit.edu/~mbarker/formula1/f1help/11-ch-c6.htm
https://medium.datadriveninvestor.com/data-visualization-done-the-right-way-with-tableau-waffle-chart-fdf2a19be402
https://medium.datadriveninvestor.com/data-visualization-done-the-right-way-with-tableau-waffle-chart-fdf2a19be402
Do some research to learn more about this decision-making process.
Do some research to find more information on this tricky decision.
## Assignment
@ -197,4 +201,4 @@ Do some research to learn more about this decision-making process.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "cad419b574d5c35eaa417e9abfdcb0c8",
"translation_date": "2025-08-31T11:06:58+00:00",
"original_hash": "b29e427401499e81f4af55a8c4afea76",
"translation_date": "2025-09-05T07:44:29+00:00",
"source_file": "3-Data-Visualization/12-visualization-relationships/README.md",
"language_code": "en"
}
@ -13,21 +13,21 @@ CO_OP_TRANSLATOR_METADATA:
|:---:|
|Visualizing Relationships - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Continuing with the nature focus of our research, let's explore fascinating ways to visualize the relationships between different types of honey, based on a dataset from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
Continuing with the nature focus of our research, let's explore some fascinating visualizations to illustrate the relationships between different types of honey, based on a dataset from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
This dataset, containing around 600 entries, showcases honey production across various U.S. states. For instance, it includes data on the number of colonies, yield per colony, total production, stocks, price per pound, and the value of honey produced in each state from 1998 to 2012, with one row per year for each state.
This dataset, containing around 600 entries, showcases honey production across various U.S. states. For instance, you can examine the number of colonies, yield per colony, total production, stocks, price per pound, and the value of honey produced in a specific state from 1998 to 2012, with one row per year for each state.
It would be intriguing to visualize the relationship between a state's annual production and, for example, the price of honey in that state. Alternatively, you could examine the relationship between honey yield per colony across states. This time frame also includes the emergence of the devastating 'CCD' or 'Colony Collapse Disorder' first identified in 2006 (http://npic.orst.edu/envir/ccd.html), making this dataset particularly significant to study. 🐝
It would be intriguing to visualize the relationship between a state's annual production and, for example, the price of honey in that state. Alternatively, you could explore the relationship between states' honey yield per colony. This time period includes the emergence of the devastating 'CCD' or 'Colony Collapse Disorder,' first observed in 2006 (http://npic.orst.edu/envir/ccd.html), making this dataset particularly significant. 🐝
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/22)
In this lesson, you can use Seaborn, a library you've worked with before, to effectively visualize relationships between variables. One particularly useful function in Seaborn is `relplot`, which enables scatter plots and line plots to quickly illustrate '[statistical relationships](https://seaborn.pydata.org/tutorial/relational.html?highlight=relationships)', helping data scientists better understand how variables interact.
In this lesson, youll use Seaborn, a library youve worked with before, to visualize relationships between variables. A particularly useful feature is Seaborn's `relplot` function, which enables scatter plots and line plots to quickly visualize '[statistical relationships](https://seaborn.pydata.org/tutorial/relational.html?highlight=relationships).' This helps data scientists better understand how variables interact.
## Scatterplots
Use a scatterplot to visualize how the price of honey has changed year over year in each state. Seaborn's `relplot` conveniently organizes state data and displays data points for both categorical and numeric data.
Use a scatterplot to illustrate how the price of honey has changed year over year in each state. Seaborn's `relplot` makes it easy to group state data and display both categorical and numerical data points.
Let's begin by importing the data and Seaborn:
Lets begin by importing the data and Seaborn:
```python
import pandas as pd
@ -36,7 +36,7 @@ import seaborn as sns
honey = pd.read_csv('../../data/honey.csv')
honey.head()
```
You'll notice that the honey dataset includes several interesting columns, such as year and price per pound. Let's explore this data, grouped by U.S. state:
Youll notice that the honey dataset contains several interesting columns, including year and price per pound. Lets explore this data, grouped by U.S. state:
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
| ----- | ------ | ----------- | --------- | -------- | ---------- | --------- | ---- |
@ -46,14 +46,14 @@ You'll notice that the honey dataset includes several interesting columns, such
| CA | 450000 | 83 | 37350000 | 12326000 | 0.62 | 23157000 | 1998 |
| CO | 27000 | 72 | 1944000 | 1594000 | 0.7 | 1361000 | 1998 |
Create a basic scatterplot to show the relationship between the price per pound of honey and its U.S. state of origin. Adjust the `y` axis to ensure all states are visible:
Create a basic scatterplot to show the relationship between the price per pound of honey and its state of origin. Adjust the `y` axis to ensure all states are visible:
```python
sns.relplot(x="priceperlb", y="state", data=honey, height=15, aspect=.5);
```
![scatterplot 1](../../../../3-Data-Visualization/12-visualization-relationships/images/scatter1.png)
Next, use a honey-inspired color scheme to illustrate how the price changes over the years. Add a 'hue' parameter to highlight year-over-year variations:
Next, use a honey-inspired color scheme to show how the price evolves over the years. Add a 'hue' parameter to highlight year-over-year changes:
> ✅ Learn more about the [color palettes you can use in Seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html) - try a beautiful rainbow color scheme!
@ -62,7 +62,7 @@ sns.relplot(x="priceperlb", y="state", hue="year", palette="YlOrBr", data=honey,
```
![scatterplot 2](../../../../3-Data-Visualization/12-visualization-relationships/images/scatter2.png)
With this color scheme, you can clearly see a strong upward trend in honey prices over the years. If you examine a specific state, such as Arizona, you can observe a consistent pattern of price increases year over year, with only a few exceptions:
With this color scheme, you can clearly see a strong upward trend in honey prices over the years. For example, if you examine Arizona's data, youll notice a consistent pattern of price increases year over year, with only a few exceptions:
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
| ----- | ------ | ----------- | --------- | ------- | ---------- | --------- | ---- |
@ -82,22 +82,22 @@ With this color scheme, you can clearly see a strong upward trend in honey price
| AZ | 23000 | 53 | 1219000 | 427000 | 1.55 | 1889000 | 2011 |
| AZ | 22000 | 46 | 1012000 | 253000 | 1.79 | 1811000 | 2012 |
Another way to visualize this trend is by using size instead of color. For colorblind users, this might be a better option. Modify your visualization to represent price increases with larger dot sizes:
Another way to visualize this trend is by using size instead of color. For colorblind users, this might be a better option. Modify your visualization to show price increases through larger dot sizes:
```python
sns.relplot(x="priceperlb", y="state", size="year", data=honey, height=15, aspect=.5);
```
You can observe the dots growing larger over time.
Youll notice the dots gradually increasing in size.
![scatterplot 3](../../../../3-Data-Visualization/12-visualization-relationships/images/scatter3.png)
Is this simply a case of supply and demand? Could factors like climate change and colony collapse be reducing honey availability year over year, thereby driving up prices?
Could this simply be a case of supply and demand? Are factors like climate change and colony collapse reducing the honey supply year over year, leading to higher prices?
To explore correlations between variables in this dataset, let's examine some line charts.
To investigate correlations between variables in this dataset, lets explore some line charts.
## Line charts
Question: Is there a clear upward trend in honey prices per pound year over year? The easiest way to determine this is by creating a single line chart:
Question: Is there a clear upward trend in honey prices per pound year over year? A single line chart can help answer this:
```python
sns.relplot(x="year", y="priceperlb", kind="line", data=honey);
@ -106,7 +106,7 @@ Answer: Yes, although there are some exceptions around 2003:
![line chart 1](../../../../3-Data-Visualization/12-visualization-relationships/images/line1.png)
✅ Seaborn aggregates data into one line by plotting the mean and a 95% confidence interval around the mean. [Source](https://seaborn.pydata.org/tutorial/relational.html). You can disable this behavior by adding `ci=None`.
✅ Seaborn aggregates data into one line, displaying "multiple measurements at each x value by plotting the mean and the 95% confidence interval around the mean." [Source](https://seaborn.pydata.org/tutorial/relational.html). You can disable this behavior by adding `ci=None`.
Question: In 2003, was there also a spike in honey supply? What happens if you examine total production year over year?
@ -116,17 +116,17 @@ sns.relplot(x="year", y="totalprod", kind="line", data=honey);
![line chart 2](../../../../3-Data-Visualization/12-visualization-relationships/images/line2.png)
Answer: Not really. Total production appears to have increased in 2003, even though overall honey production has been declining during these years.
Answer: Not really. Total production actually seems to have increased in 2003, even though honey production generally declined during these years.
Question: In that case, what might have caused the spike in honey prices around 2003?
Question: If not supply, what could have caused the price spike in 2003?
To investigate further, you can use a facet grid.
To investigate, lets use a facet grid.
## Facet grids
Facet grids allow you to focus on one aspect of your dataset (e.g., 'year') and create a plot for each facet using your chosen x and y coordinates. This makes comparisons easier. Does 2003 stand out in this type of visualization?
Facet grids allow you to break down your dataset into smaller subsets (facets). For example, you can use 'year' as a facet to avoid creating too many grids. Seaborn can then plot each facet side by side for easier comparison. Does 2003 stand out in this comparison?
Create a facet grid using `relplot`, as recommended by [Seaborn's documentation](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html?highlight=facetgrid#seaborn.FacetGrid).
Create a facet grid using `relplot` as recommended in [Seaborn's documentation](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html?highlight=facetgrid#seaborn.FacetGrid).
```python
sns.relplot(
@ -136,7 +136,7 @@ sns.relplot(
col_wrap=3,
kind="line"
```
In this visualization, you can compare yield per colony and number of colonies year over year, side by side, with a column wrap set to 3:
In this visualization, you can compare yield per colony and number of colonies year over year, with columns wrapped at 3:
![facet grid](../../../../3-Data-Visualization/12-visualization-relationships/images/facet.png)
@ -144,7 +144,7 @@ For this dataset, nothing particularly stands out regarding the number of coloni
## Dual-line Plots
Try a multiline plot by overlaying two line plots, using Seaborn's 'despine' to remove the top and right spines, and `ax.twinx` [from Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.twinx.html). Twinx allows a chart to share the x-axis while displaying two y-axes. Superimpose yield per colony and number of colonies:
Try a multiline plot by overlaying two line plots, using Seaborn's 'despine' to remove the top and right spines, and `ax.twinx` [from Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.twinx.html). Twinx allows you to share the x-axis while displaying two y-axes. Plot yield per colony and number of colonies together:
```python
fig, ax = plt.subplots(figsize=(12,6))
@ -170,12 +170,14 @@ Go, bees, go!
🐝❤️
## 🚀 Challenge
In this lesson, you learned more about scatterplots and line grids, including facet grids. Challenge yourself to create a facet grid using a different dataset, perhaps one you've used in previous lessons. Note how long it takes to generate and consider how many grids are practical to draw using these techniques.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/23)
In this lesson, you explored scatterplots, line grids, and facet grids. Challenge yourself to create a facet grid using a different dataset, perhaps one youve used in previous lessons. Pay attention to how long it takes to create and how to manage the number of grids effectively.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
Line plots can range from simple to complex. Spend some time reading the [Seaborn documentation](https://seaborn.pydata.org/generated/seaborn.lineplot.html) to learn about the various ways to build them. Try enhancing the line charts you created in this lesson using methods described in the documentation.
Line plots can range from simple to complex. Spend some time reading the [Seaborn documentation](https://seaborn.pydata.org/generated/seaborn.lineplot.html) to learn about the various ways to build them. Try enhancing the line charts you created in this lesson using other methods from the documentation.
## Assignment
[Dive into the beehive](assignment.md)
@ -183,4 +185,4 @@ Line plots can range from simple to complex. Spend some time reading the [Seabor
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -1,13 +1,13 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "4ec4747a9f4f7d194248ea29903ae165",
"translation_date": "2025-08-31T11:06:21+00:00",
"original_hash": "0b380bb6d34102bb061eb41de23d9834",
"translation_date": "2025-09-05T07:44:02+00:00",
"source_file": "3-Data-Visualization/13-meaningful-visualizations/README.md",
"language_code": "en"
}
-->
# Creating Meaningful Visualizations
# Making Meaningful Visualizations
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/13-MeaningfulViz.png)|
|:---:|
@ -15,7 +15,7 @@ CO_OP_TRANSLATOR_METADATA:
> "If you torture the data long enough, it will confess to anything" -- [Ronald Coase](https://en.wikiquote.org/wiki/Ronald_Coase)
One of the essential skills for a data scientist is the ability to create meaningful data visualizations that help answer specific questions. Before visualizing your data, you need to ensure it has been cleaned and prepared, as covered in previous lessons. Once that's done, you can start deciding how best to present the data.
One of the essential skills for a data scientist is the ability to create meaningful data visualizations that help answer questions. Before visualizing your data, you need to ensure it has been cleaned and prepared, as you learned in previous lessons. Once that's done, you can start deciding how best to present the data.
In this lesson, you will explore:
@ -28,11 +28,11 @@ In this lesson, you will explore:
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/24)
## Selecting the appropriate chart type
## Choose the right chart type
In earlier lessons, you experimented with creating various types of data visualizations using Matplotlib and Seaborn. Generally, you can choose the [appropriate chart type](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) based on the question you're trying to answer using the following table:
In earlier lessons, you experimented with creating various data visualizations using Matplotlib and Seaborn. Generally, you can use the [right type of chart](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) based on the question you're trying to answer, as shown in this table:
| Task | Recommended Chart Type |
| You need to: | You should use: |
| -------------------------- | ------------------------------- |
| Show data trends over time | Line |
| Compare categories | Bar, Pie |
@ -41,21 +41,21 @@ In earlier lessons, you experimented with creating various types of data visuali
| Show distributions | Scatter, Histogram, Box |
| Show proportions | Pie, Donut, Waffle |
> ✅ Depending on the structure of your data, you may need to convert it from text to numeric format to make certain charts work.
> ✅ Depending on the structure of your data, you might need to convert it from text to numeric to make certain charts work.
## Avoiding misleading visualizations
## Avoid deception
Even when a data scientist carefully selects the right chart for the data, there are still ways to present data in a misleading manner, often to support a specific narrative at the expense of accuracy. There are numerous examples of deceptive charts and infographics!
Even when a data scientist carefully selects the right chart for the data, there are many ways data can be presented misleadingly to support a particular narrative, often at the expense of the data's integrity. There are countless examples of deceptive charts and infographics!
[![How Charts Lie by Alberto Cairo](../../../../3-Data-Visualization/13-meaningful-visualizations/images/tornado.png)](https://www.youtube.com/watch?v=oX74Nge8Wkw "How charts lie")
> 🎥 Click the image above to watch a conference talk about misleading charts.
This chart flips the X-axis to present the opposite of the truth based on dates:
This chart flips the X-axis to show the opposite of the truth based on the dates:
![bad chart 1](../../../../3-Data-Visualization/13-meaningful-visualizations/images/bad-chart-1.png)
[This chart](https://media.firstcoastnews.com/assets/WTLV/images/170ae16f-4643-438f-b689-50d66ca6a8d8/170ae16f-4643-438f-b689-50d66ca6a8d8_1140x641.jpg) is even more misleading. At first glance, it appears that COVID cases have declined over time in various counties. However, upon closer inspection, the dates have been rearranged to create a deceptive downward trend.
[This chart](https://media.firstcoastnews.com/assets/WTLV/images/170ae16f-4643-438f-b689-50d66ca6a8d8/170ae16f-4643-438f-b689-50d66ca6a8d8_1140x641.jpg) is even more misleading. At first glance, it appears that COVID cases have declined over time in various counties. However, upon closer inspection, the dates have been rearranged to create a false downward trend.
![bad chart 2](../../../../3-Data-Visualization/13-meaningful-visualizations/images/bad-chart-2.jpg)
@ -63,23 +63,23 @@ This infamous example uses both color and a flipped Y-axis to mislead viewers. I
![bad chart 3](../../../../3-Data-Visualization/13-meaningful-visualizations/images/bad-chart-3.jpg)
This peculiar chart manipulates proportions to a comical degree:
This peculiar chart demonstrates how proportions can be manipulated, often to humorous effect:
![bad chart 4](../../../../3-Data-Visualization/13-meaningful-visualizations/images/bad-chart-4.jpg)
Another deceptive tactic is comparing things that are not truly comparable. A [fascinating website](https://tylervigen.com/spurious-correlations) showcases 'spurious correlations,' such as the divorce rate in Maine being linked to margarine consumption. A Reddit group also collects [examples of poor data usage](https://www.reddit.com/r/dataisugly/top/?t=all).
Another deceptive tactic is comparing things that aren't truly comparable. A [fascinating website](https://tylervigen.com/spurious-correlations) showcases 'spurious correlations,' such as the divorce rate in Maine being linked to margarine consumption. A Reddit group also collects [examples of poor data usage](https://www.reddit.com/r/dataisugly/top/?t=all).
Understanding how easily the eye can be tricked by misleading charts is crucial. Even with good intentions, a poorly chosen chart type—like a pie chart with too many categories—can lead to confusion.
It's crucial to understand how easily the eye can be tricked by misleading charts. Even with good intentions, a poorly chosen chart type—like a pie chart with too many categories—can lead to confusion.
## Using color effectively
## Color
The 'Florida gun violence' chart above demonstrates how color can add another layer of meaning to visualizations. Libraries like Matplotlib and Seaborn come with pre-designed color palettes, but if you're creating a chart manually, it's worth studying [color theory](https://colormatters.com/color-and-design/basic-color-theory).
The 'Florida gun violence' chart above illustrates how color can add another layer of meaning to visualizations, especially when charts aren't created using libraries like Matplotlib or Seaborn, which offer pre-vetted color palettes. If you're designing a chart manually, take some time to study [color theory](https://colormatters.com/color-and-design/basic-color-theory).
> ✅ Keep accessibility in mind when designing charts. Some users may be colorblind—does your chart work well for those with visual impairments?
Be cautious when selecting colors for your chart, as they can convey unintended meanings. For example, the 'pink ladies' in the 'height' chart above add a gendered implication that makes the chart even more bizarre.
Be cautious when selecting colors for your chart, as they can convey unintended meanings. For example, the 'pink ladies' in the 'height' chart above add a distinctly 'feminine' connotation, which contributes to the chart's oddness.
While [color meanings](https://colormatters.com/color-symbolism/the-meanings-of-colors) can vary across cultures and change depending on the shade, general associations include:
While [color meanings](https://colormatters.com/color-symbolism/the-meanings-of-colors) can vary across cultures and change depending on the shade, here are some general associations:
| Color | Meaning |
| ------ | ------------------- |
@ -92,9 +92,9 @@ While [color meanings](https://colormatters.com/color-symbolism/the-meanings-of-
If you're tasked with creating a chart with custom colors, ensure that your choices align with the intended message and that the chart remains accessible.
## Styling charts for better readability
## Styling your charts for readability
Charts lose their value if they are difficult to read. Take time to adjust the width and height of your chart to ensure it scales well with your data. For example, if you need to display all 50 states, consider showing them vertically on the Y-axis to avoid horizontal scrolling.
Charts lose their value if they're not easy to read! Take time to adjust the width and height of your chart to ensure it scales well with your data. For example, if you're displaying data for all 50 states, consider showing them vertically on the Y-axis to avoid horizontal scrolling.
Label your axes, include a legend if necessary, and provide tooltips for better data comprehension.
@ -102,33 +102,33 @@ If your data includes verbose text on the X-axis, you can angle the text for imp
![3d plots](../../../../3-Data-Visualization/13-meaningful-visualizations/images/3d.png)
## Animation and 3D visualizations
## Animation and 3D chart display
Some of the most engaging visualizations today are animated. Shirley Wu has created stunning examples using D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/),' where each flower represents a movie. Another example is 'Bussed Out,' an interactive experience for the Guardian that combines visualizations with Greensock and D3, along with a scrollytelling article format, to illustrate how NYC addresses homelessness by bussing people out of the city.
Some of the most compelling data visualizations today are animated. Shirley Wu has created stunning examples using D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/),' where each flower represents a movie. Another example is 'Bussed Out,' an interactive visualization for the Guardian that combines Greensock and D3 with a scrollytelling article format to illustrate how NYC addresses homelessness by bussing people out of the city.
![busing](../../../../3-Data-Visualization/13-meaningful-visualizations/images/busing.png)
> "Bussed Out: How America Moves its Homeless" from [the Guardian](https://www.theguardian.com/us-news/ng-interactive/2017/dec/20/bussed-out-america-moves-homeless-people-country-study). Visualizations by Nadieh Bremer & Shirley Wu
While this lesson doesn't delve deeply into these powerful visualization libraries, you can experiment with D3 in a Vue.js app to create an animated visualization of the book "Dangerous Liaisons" as a social network.
While this lesson doesn't delve deeply into these powerful visualization libraries, you can experiment with D3 in a Vue.js app to create an animated social network visualization of the book "Dangerous Liaisons."
> "Les Liaisons Dangereuses" is an epistolary novel, presented as a series of letters. Written in 1782 by Choderlos de Laclos, it tells the story of the morally corrupt social maneuvers of two French aristocrats, the Vicomte de Valmont and the Marquise de Merteuil. Both meet their downfall, but not before causing significant social damage. The novel unfolds through letters written to various individuals in their circles, plotting revenge or simply creating chaos. Create a visualization of these letters to identify the key players in the narrative.
You will complete a web app that displays an animated view of this social network. It uses a library designed to create a [network visualization](https://github.com/emiliorizzo/vue-d3-network) with Vue.js and D3. Once the app is running, you can drag nodes around the screen to rearrange the data.
You will complete a web app that displays an animated view of this social network. It uses a library designed to create a [network visualization](https://github.com/emiliorizzo/vue-d3-network) with Vue.js and D3. Once the app is running, you can drag the nodes around to rearrange the data visually.
![liaisons](../../../../3-Data-Visualization/13-meaningful-visualizations/images/liaisons.png)
## Project: Build a network chart using D3.js
## Project: Build a chart to show a network using D3.js
> This lesson folder includes a `solution` folder with the completed project for reference.
> This lesson folder includes a `solution` folder where you can find the completed project for reference.
1. Follow the instructions in the README.md file located in the starter folder's root. Ensure you have NPM and Node.js installed on your machine before setting up the project's dependencies.
1. Follow the instructions in the README.md file in the starter folder's root. Ensure you have NPM and Node.js installed on your machine before setting up the project's dependencies.
2. Open the `starter/src` folder. Inside, you'll find an `assets` folder containing a .json file with all the letters from the novel, annotated with 'to' and 'from' fields.
3. Complete the code in `components/Nodes.vue` to enable the visualization. Locate the method called `createLinks()` and add the following nested loop.
Loop through the .json object to extract the 'to' and 'from' data for the letters and build the `links` object for the visualization library:
Loop through the .json object to extract the 'to' and 'from' data for the letters, building the `links` object for the visualization library:
```javascript
//loop through letters
@ -152,9 +152,9 @@ Run your app from the terminal (npm run serve) and enjoy the visualization!
## 🚀 Challenge
Explore the internet to find examples of misleading visualizations. How does the author mislead the audience, and is it intentional? Try correcting the visualizations to show how they should appear.
Explore the internet to find examples of misleading visualizations. How does the author mislead the viewer, and is it intentional? Try correcting the visualizations to show how they should look.
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/25)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
@ -164,19 +164,19 @@ https://gizmodo.com/how-to-lie-with-data-visualization-1563576606
http://ixd.prattsi.org/2017/12/visual-lies-usability-in-deceptive-data-visualizations/
Explore these interesting visualizations of historical assets and artifacts:
Check out these interesting visualizations of historical assets and artifacts:
https://handbook.pubpub.org/
Read this article on how animation can enhance visualizations:
Read this article on how animation can enhance your visualizations:
https://medium.com/@EvanSinar/use-animation-to-supercharge-data-visualization-cd905a882ad4
## Assignment
[Create your own custom visualization](assignment.md)
[Build your own custom visualization](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "c368f8f2506fe56bca0f7be05c4eb71d",
"translation_date": "2025-08-31T11:00:47+00:00",
"original_hash": "79ca8a5a3135e94d2d43f56ba62d5205",
"translation_date": "2025-09-05T07:41:32+00:00",
"source_file": "4-Data-Science-Lifecycle/14-Introduction/README.md",
"language_code": "en"
}
@ -15,7 +15,7 @@ CO_OP_TRANSLATOR_METADATA:
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/26)
By now, youve likely realized that data science is a process. This process can be divided into five stages:
By now, youve likely realized that data science is a structured process. This process can be divided into five stages:
- Capturing
- Processing
@ -31,7 +31,7 @@ This lesson focuses on three parts of the lifecycle: capturing, processing, and
## Capturing
The first stage of the lifecycle is crucial because the subsequent stages depend on it. It essentially combines two steps: acquiring the data and defining the purpose and problems to be addressed.
Defining the projects goals requires a deeper understanding of the problem or question. First, we need to identify and engage with those whose problem needs solving. These could be stakeholders in a business or project sponsors who can clarify who or what will benefit from the project, as well as what they need and why. A well-defined goal should be measurable and quantifiable to determine an acceptable outcome.
Defining the projects goals requires a deeper understanding of the problem or question. First, you need to identify and engage with those who need their problem solved. These could be stakeholders in a business or project sponsors who can help determine who or what will benefit from the project, as well as what they need and why. A well-defined goal should be measurable and quantifiable to establish an acceptable outcome.
Questions a data scientist might ask:
- Has this problem been tackled before? What was discovered?
@ -39,9 +39,9 @@ Questions a data scientist might ask:
- Is there any ambiguity, and how can it be reduced?
- What are the constraints?
- What might the end result look like?
- What resources (time, people, computational) are available?
- What resources (time, personnel, computational) are available?
Next, the focus shifts to identifying, collecting, and exploring the data needed to achieve these defined goals. During the acquisition step, data scientists must also evaluate the quantity and quality of the data. This involves some data exploration to ensure that the acquired data will support achieving the desired outcome.
Next, you need to identify, collect, and explore the data required to achieve these defined goals. During the acquisition step, data scientists must also assess the quantity and quality of the data. This involves some data exploration to ensure that the acquired data will support achieving the desired outcome.
Questions a data scientist might ask about the data:
- What data is already available to me?
@ -49,53 +49,53 @@ Questions a data scientist might ask about the data:
- What are the privacy concerns?
- Do I have enough data to solve this problem?
- Is the data of sufficient quality for this problem?
- If additional insights are uncovered through this data, should we consider revising or redefining the goals?
- If additional insights are discovered through this data, should the goals be reconsidered or redefined?
## Processing
The processing stage of the lifecycle focuses on uncovering patterns in the data and building models. Some techniques used in this stage involve statistical methods to identify patterns. For large datasets, this task would be too time-consuming for a human, so computers are used to speed up the process. This stage is also where data science and machine learning intersect. As you learned in the first lesson, machine learning involves building models to understand the data. Models represent the relationships between variables in the data and help predict outcomes.
The processing stage of the lifecycle focuses on uncovering patterns in the data and building models. Some techniques used in this stage rely on statistical methods to identify patterns. For large datasets, this task is typically too time-consuming for humans and requires computers to handle the workload efficiently. This stage is also where data science intersects with machine learning. As you learned in the first lesson, machine learning involves building models to understand the data. Models represent the relationships between variables in the data and help predict outcomes.
Common techniques used in this stage are covered in the ML for Beginners curriculum. Follow the links to learn more about them:
- [Classification](https://github.com/microsoft/ML-For-Beginners/tree/main/4-Classification): Organizing data into categories for more efficient use.
- [Clustering](https://github.com/microsoft/ML-For-Beginners/tree/main/5-Clustering): Grouping data into similar clusters.
- [Regression](https://github.com/microsoft/ML-For-Beginners/tree/main/2-Regression): Determining relationships between variables to predict or forecast values.
- [Regression](https://github.com/microsoft/ML-For-Beginners/tree/main/2-Regression): Identifying relationships between variables to predict or forecast values.
## Maintaining
In the lifecycle diagram, you may have noticed that maintenance sits between capturing and processing. Maintenance is an ongoing process of managing, storing, and securing the data throughout the project and should be considered throughout the projects duration.
In the lifecycle diagram, you may notice that maintenance is positioned between capturing and processing. Maintenance is an ongoing process of managing, storing, and securing the data throughout the project and should be considered throughout its entirety.
### Storing Data
Decisions about how and where data is stored can impact storage costs and the performance of data access. These decisions are unlikely to be made by a data scientist alone, but they may influence how the data scientist works with the data based on its storage method.
Decisions about how and where data is stored can impact storage costs and the performance of data access. These decisions are unlikely to be made solely by a data scientist, but they may influence how the data is handled based on its storage method.
Here are some aspects of modern data storage systems that can affect these decisions:
**On-premise vs. off-premise vs. public or private cloud**
**On-premise vs off-premise vs public or private cloud**
On-premise refers to hosting and managing data on your own equipment, such as owning a server with hard drives to store the data. Off-premise relies on equipment you dont own, such as a data center. The public cloud is a popular choice for storing data, requiring no knowledge of how or where the data is stored. Public refers to a shared underlying infrastructure used by all cloud users. Some organizations have strict security policies requiring complete control over the equipment where the data is hosted, so they use a private cloud that provides dedicated cloud services. Youll learn more about data in the cloud in [later lessons](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/5-Data-Science-In-Cloud).
On-premise refers to hosting and managing data on your own equipment, such as owning a server with hard drives to store the data. Off-premise relies on equipment you dont own, such as a data center. The public cloud is a popular choice for storing data, requiring no knowledge of how or where the data is stored. Public refers to a shared underlying infrastructure used by all cloud users. Some organizations have strict security policies requiring complete access to the equipment hosting their data and may opt for a private cloud that offers dedicated cloud services. Youll learn more about cloud data in [later lessons](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/5-Data-Science-In-Cloud).
**Cold vs. hot data**
**Cold vs hot data**
When training models, you may need more training data. Once your model is finalized, additional data will still arrive for the model to fulfill its purpose. In either case, the cost of storing and accessing data increases as more data accumulates. Separating rarely used data (cold data) from frequently accessed data (hot data) can be a more cost-effective storage solution using hardware or software services. Accessing cold data may take longer compared to hot data.
When training models, you may need more training data. Once satisfied with your model, additional data will arrive for the model to fulfill its purpose. In either case, the cost of storing and accessing data increases as more data accumulates. Separating rarely used data (cold data) from frequently accessed data (hot data) can be a cost-effective storage solution using hardware or software services. Accessing cold data may take longer compared to hot data.
### Managing Data
As you work with data, you may find that some of it needs cleaning using techniques covered in the lesson on [data preparation](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/2-Working-With-Data/08-data-preparation) to build accurate models. When new data arrives, it will require similar processing to maintain quality consistency. Some projects use automated tools for cleansing, aggregation, and compression before moving the data to its final location. Azure Data Factory is an example of such a tool.
As you work with data, you may find that some of it needs cleaning using techniques covered in the lesson on [data preparation](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/2-Working-With-Data/08-data-preparation) to build accurate models. When new data arrives, similar techniques will need to be applied to maintain quality consistency. Some projects use automated tools for cleansing, aggregation, and compression before moving the data to its final location. Azure Data Factory is an example of such a tool.
### Securing the Data
One of the main goals of securing data is ensuring that those working with it have control over what is collected and how it is used. Keeping data secure involves limiting access to only those who need it, complying with local laws and regulations, and maintaining ethical standards, as discussed in the [ethics lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/1-Introduction/02-ethics).
A key goal of securing data is ensuring that those working with it control what is collected and how it is used. Keeping data secure involves limiting access to only those who need it, adhering to local laws and regulations, and maintaining ethical standards, as discussed in the [ethics lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/1-Introduction/02-ethics).
Here are some steps a team might take to ensure security:
- Ensure all data is encrypted.
- Provide customers with information about how their data is used.
- Remove data access for individuals who have left the project.
- Restrict data modification to specific project members.
Here are some security measures a team might take:
- Ensure all data is encrypted
- Provide customers with information on how their data is used
- Remove data access for individuals who leave the project
- Restrict data modification to specific project members
## 🚀 Challenge
There are many versions of the Data Science Lifecycle, with different names and numbers of stages, but they all include the processes discussed in this lesson.
There are various versions of the Data Science Lifecycle, where steps may have different names and numbers of stages but include the same processes discussed in this lesson.
Explore the [Team Data Science Process lifecycle](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle) and the [Cross-industry standard process for data mining](https://www.datascience-pm.com/crisp-dm-2/). Identify three similarities and differences between the two.
@ -104,7 +104,7 @@ Explore the [Team Data Science Process lifecycle](https://docs.microsoft.com/en-
|![Team Data Science Lifecycle](../../../../4-Data-Science-Lifecycle/14-Introduction/images/tdsp-lifecycle2.png) | ![Data Science Process Alliance Image](../../../../4-Data-Science-Lifecycle/14-Introduction/images/CRISP-DM.png) |
| Image by [Microsoft](https://docs.microsoft.comazure/architecture/data-science-process/lifecycle) | Image by [Data Science Process Alliance](https://www.datascience-pm.com/crisp-dm-2/) |
## [Post-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/27)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
@ -120,4 +120,4 @@ Applying the Data Science Lifecycle involves multiple roles and tasks, with some
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "d92f57eb110dc7f765c05cbf0f837c77",
"translation_date": "2025-08-31T11:00:04+00:00",
"original_hash": "a167aa0bfb1c46ece1b3d21ae939cc0d",
"translation_date": "2025-09-05T07:40:54+00:00",
"source_file": "4-Data-Science-Lifecycle/15-analyzing/README.md",
"language_code": "en"
}
@ -17,39 +17,39 @@ CO_OP_TRANSLATOR_METADATA:
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/28)
The "Analyzing" phase in the data lifecycle ensures that the data can address the questions posed or solve a specific problem. This step may also involve verifying that a model is effectively tackling these questions and issues. This lesson focuses on Exploratory Data Analysis (EDA), which includes techniques for identifying features and relationships within the data, as well as preparing the data for modeling.
The "Analyzing" phase in the data lifecycle ensures that the data can address the questions posed or solve a specific problem. This step may also focus on verifying that a model is effectively tackling these questions and problems. This lesson centers on Exploratory Data Analysis (EDA), which involves techniques for identifying features and relationships within the data and preparing it for modeling.
We'll use an example dataset from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv/version/1) to demonstrate how this can be done using Python and the Pandas library. This dataset contains counts of common words found in emails, with the sources of these emails anonymized. Use the [notebook](../../../../4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb) in this directory to follow along.
We'll use an example dataset from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv/version/1) to demonstrate how this can be applied using Python and the Pandas library. This dataset includes counts of common words found in emails, with the sources of these emails anonymized. Use the [notebook](../../../../4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb) in this directory to follow along.
## Exploratory Data Analysis
The "Capture" phase of the lifecycle involves acquiring data and defining the problems and questions at hand. But how can we confirm that the data will support the desired outcomes?
A data scientist might ask the following questions when working with acquired data:
The "Capture" phase of the lifecycle involves acquiring data as well as defining the problems and questions at hand. But how can we confirm that the data will support the desired outcomes?
A data scientist might ask the following questions when acquiring data:
- Do I have enough data to solve this problem?
- Is the data of sufficient quality for this problem?
- If new insights emerge from the data, should we consider revising or redefining the goals?
- If new insights emerge from this data, should we consider revising or redefining the goals?
Exploratory Data Analysis is the process of familiarizing yourself with the data and can help answer these questions, as well as identify challenges associated with the dataset. Lets explore some techniques used to achieve this.
## Data Profiling, Descriptive Statistics, and Pandas
How can we determine if we have enough data to solve the problem? Data profiling provides a summary and general overview of the dataset using descriptive statistics techniques. Data profiling helps us understand what is available, while descriptive statistics help us understand how much is available.
How can we determine if we have enough data to solve the problem? Data profiling provides a summary and general overview of the dataset using descriptive statistics techniques. Data profiling helps us understand what is available, while descriptive statistics help us understand the quantity and characteristics of the data.
In previous lessons, we used Pandas to generate descriptive statistics with the [`describe()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html). This function provides the count, maximum and minimum values, mean, standard deviation, and quantiles for numerical data. Using descriptive statistics like `describe()` can help you evaluate whether you have sufficient data or need more.
## Sampling and Querying
Exploring every detail in a large dataset can be time-consuming and is often left to computers. However, sampling is a useful technique for gaining a better understanding of the data and what it represents. By working with a sample, you can apply probability and statistics to draw general conclusions about the dataset. While theres no strict rule for how much data to sample, its worth noting that larger samples lead to more accurate generalizations about the data.
Exploring every detail in a large dataset can be time-consuming and is often left to computers. However, sampling is a useful technique for gaining a better understanding of the data and what it represents. By working with a sample, you can apply probability and statistics to draw general conclusions about the dataset. While theres no strict rule for how much data to sample, its important to note that larger samples lead to more accurate generalizations.
Pandas includes the [`sample()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html), which allows you to specify the number of random samples you want to extract and use.
Pandas includes the [`sample()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html), which allows you to specify the number of random samples you want to extract and analyze.
General querying of the data can help answer specific questions or test theories you may have. Unlike sampling, queries allow you to focus on particular parts of the dataset that are relevant to your questions. The [`query()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library lets you select columns and retrieve rows to answer specific questions about the data.
## Exploring with Visualizations
You dont need to wait until the data is fully cleaned and analyzed to start creating visualizations. In fact, visualizations can be helpful during exploration, as they can reveal patterns, relationships, and issues within the data. Additionally, visualizations provide a way to communicate findings to people who arent directly involved in managing the data, offering an opportunity to share and refine questions that may not have been addressed during the "Capture" phase. Refer to the [section on Visualizations](../../../../../../../../../3-Data-Visualization) to learn more about popular methods for visual exploration.
You dont need to wait until the data is fully cleaned and analyzed to start creating visualizations. In fact, visual representations during exploration can help identify patterns, relationships, and issues within the data. Additionally, visualizations provide a way to communicate findings to stakeholders who may not be directly involved in data management. This can also be an opportunity to address new questions that werent considered during the "Capture" phase. Refer to the [section on Visualizations](../../../../../../../../../3-Data-Visualization) to learn more about popular methods for visual exploration.
## Exploring to Identify Inconsistencies
The techniques covered in this lesson can help identify missing or inconsistent values, but Pandas also offers specific functions for detecting these issues. [isna() or isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) can be used to check for missing values. An important part of exploring these values is understanding why they are missing in the first place. This insight can guide you in deciding what [actions to take to resolve them](../../../../../../../../../2-Working-With-Data/08-data-preparation/notebook.ipynb).
The techniques covered in this lesson can help identify missing or inconsistent values, but Pandas also offers specific functions for detecting these issues. [isna() or isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) can be used to check for missing values. An important aspect of exploring these values is understanding why they occurred in the first place. This insight can guide you in deciding what [actions to take to resolve them](../../../../../../../../../2-Working-With-Data/08-data-preparation/notebook.ipynb).
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/27)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Assignment

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1ac43023e78bfe76481a32c878ace516",
"translation_date": "2025-08-31T11:01:18+00:00",
"original_hash": "fcac117b5c4793907fb76f55c98fd40e",
"translation_date": "2025-09-05T07:42:05+00:00",
"source_file": "4-Data-Science-Lifecycle/16-communication/README.md",
"language_code": "en"
}
@ -20,100 +20,94 @@ Test your knowledge of the upcoming content with the Pre-Lecture Quiz above!
# Introduction
### What is Communication?
Lets begin this lesson by defining communication. **To communicate is to convey or exchange information.** Information can include ideas, thoughts, feelings, messages, subtle signals, data—anything that a **_sender_** (someone sharing information) wants a **_receiver_** (someone receiving information) to understand. In this lesson, well refer to senders as communicators and receivers as the audience.
Lets begin this lesson by defining communication. **To communicate is to share or exchange information.** This information can include ideas, thoughts, feelings, messages, signals, data—anything that a **_sender_** (someone sharing information) wants a **_receiver_** (someone receiving information) to understand. In this lesson, well refer to senders as communicators and receivers as the audience.
### Data Communication & Storytelling
When communicating, the goal is to convey or exchange information. However, when communicating data, your goal shouldnt just be to pass along numbers. Instead, you should aim to tell a story informed by your data—effective data communication and storytelling go hand-in-hand. Your audience is more likely to remember a story you tell than a number you share. Later in this lesson, well explore ways to use storytelling to communicate your data more effectively.
When communicating, the goal is to share or exchange information. However, when communicating data, the goal isnt just to pass along numbers. Instead, the aim is to tell a story informed by your data—effective data communication and storytelling go hand-in-hand. Your audience is more likely to remember a story than a set of numbers. Later in this lesson, well explore ways to use storytelling to communicate your data more effectively.
### Types of Communication
This lesson will cover two types of communication: One-Way Communication and Two-Way Communication.
**One-way communication** occurs when a sender shares information with a receiver without expecting feedback or a response. Examples of one-way communication include mass emails, news broadcasts, or TV commercials that inform you about a product. In these cases, the senders goal is to deliver information, not to engage in an exchange.
**One-way communication** occurs when a sender shares information with a receiver without expecting feedback or a response. Examples include mass emails, news broadcasts, or TV commercials. In these cases, the senders goal is simply to deliver information, not to engage in an exchange.
**Two-way communication** happens when all parties act as both senders and receivers. A sender begins by sharing information, and the receiver provides feedback or a response. This is the type of communication we typically think of, such as conversations in person, over the phone, on social media, or via text messages.
**Two-way communication** involves all parties acting as both senders and receivers. A sender shares information, and the receiver provides feedback or a response. This is the type of communication we typically associate with conversations—whether in person, over the phone, via social media, or through text messages.
When communicating data, you may use one-way communication (e.g., presenting at a conference or to a large group where questions wont be asked immediately) or two-way communication (e.g., persuading stakeholders for buy-in or convincing a teammate to invest time and effort in a new project).
When communicating data, you may use one-way communication (e.g., presenting at a conference or to a large group where questions arent asked immediately) or two-way communication (e.g., persuading stakeholders or convincing a teammate to invest time and effort in a new initiative).
# Effective Communication
### Your Responsibilities as a Communicator
As a communicator, its your responsibility to ensure that your audience takes away the information you want them to understand. When communicating data, you dont just want your audience to remember numbers—you want them to grasp a story informed by your data. A good data communicator is also a good storyteller.
As a communicator, its your responsibility to ensure your audience understands the information you want them to take away. When communicating data, your goal isnt just for your audience to remember numbers—its for them to grasp a story informed by your data. A skilled data communicator is also a skilled storyteller.
How do you tell a story with data? There are countless ways, but here are six strategies well discuss in this lesson:
1. Understand Your Audience, Your Medium, & Your Communication Method
1. Understand Your Audience, Your Channel, & Your Communication Method
2. Begin with the End in Mind
3. Approach it Like an Actual Story
3. Approach It Like an Actual Story
4. Use Meaningful Words & Phrases
5. Use Emotion
Each of these strategies is explained in detail below.
Each strategy is explained in detail below.
### 1. Understand Your Audience, Your Channel & Your Communication Method
The way you communicate with family members is likely different from how you communicate with friends. You probably use different words and phrases tailored to the people youre speaking to. The same principle applies when communicating data. Consider who your audience is, their goals, and the context they have about the situation youre explaining.
The way you communicate with family members is likely different from how you communicate with friends. You use words and phrases tailored to the people youre speaking to. The same principle applies when communicating data. Consider who your audience is, their goals, and the context they have regarding the situation youre explaining.
You can often categorize your audience. In a _Harvard Business Review_ article, “[How to Tell a Story with Data](http://blogs.hbr.org/2013/04/how-to-tell-a-story-with-data/),” Dell Executive Strategist Jim Stikeleather identifies five audience categories:
You can often categorize your audience into one of five groups, as outlined by Dell Executive Strategist Jim Stikeleather in a _Harvard Business Review_ article, “[How to Tell a Story with Data](http://blogs.hbr.org/2013/04/how-to-tell-a-story-with-data/).”
- **Novice**: First exposure to the subject but doesnt want oversimplification.
- **Generalist**: Aware of the topic but looking for an overview and major themes.
- **Managerial**: Seeks an in-depth, actionable understanding of intricacies and interrelationships, with access to details.
- **Expert**: Prefers exploration and discovery over storytelling, with a focus on great detail.
- **Executive**: Has limited time and wants to understand the significance and conclusions of weighted probabilities.
- **Novice**: First exposure to the subject, but doesnt want oversimplification.
- **Generalist**: Familiar with the topic, seeking an overview and major themes.
- **Managerial**: Requires actionable understanding of details and interrelationships.
- **Expert**: Prefers exploration and discovery, with less emphasis on storytelling.
- **Executive**: Focused on significance and conclusions, with limited time.
These categories can guide how you present data to your audience.
Additionally, consider the channel youre using to communicate. Your approach will differ if youre writing a memo or email versus presenting at a meeting or conference.
Additionally, consider the channel youre using to communicate—whether its a memo, email, meeting, or conference presentation. Your approach should adapt accordingly.
Understanding whether youll use one-way or two-way communication is also critical. For example:
- If your audience is mostly Novices and youre using one-way communication, youll need to educate them and provide context before presenting your data and explaining its significance. Clarity is key since they cant ask direct questions.
- If your audience is mostly Managerial and youre using two-way communication, you can likely skip the context and dive into the data and its implications. However, youll need to manage timing and keep the discussion on track, as questions may arise that could derail your story.
Finally, understand whether youll be using one-way or two-way communication. For example:
- If your audience is primarily Novice and youre using one-way communication, youll need to educate them, provide context, and clearly explain your data and its significance.
- If your audience is primarily Managerial and youre using two-way communication, you can likely skip the context and dive straight into the data. However, youll need to manage timing and keep the discussion focused on your story, as questions may arise that could derail the conversation.
### 2. Begin With The End In Mind
Starting with the end in mind means knowing your intended takeaways for the audience before you begin communicating. Being clear about what you want your audience to learn helps you craft a coherent story. This approach works for both one-way and two-way communication.
Starting with the end in mind means identifying your intended takeaways for your audience before you begin communicating. This helps you craft a coherent story that your audience can follow. This approach works for both one-way and two-way communication.
How do you start with the end in mind? Before communicating your data, write down your key takeaways. As you prepare your story, continually ask yourself, “How does this fit into the story Im telling?”
How do you start with the end in mind? Write down your key takeaways before you begin. As you prepare your story, continually ask yourself, “How does this fit into the story Im telling?”
**Caution**: While starting with the end in mind is ideal, avoid cherry-picking data—only sharing data that supports your point while ignoring other data. If some of your data contradicts your takeaways, share it honestly and explain why youre sticking with your conclusions despite the conflicting data.
Be cautious, though—dont cherry-pick data. Cherry-picking occurs when a communicator only shares data that supports their point while ignoring other data. If some of your data contradicts your intended takeaways, share it anyway. Be transparent with your audience about why youre sticking to your story despite conflicting data.
### 3. Approach it Like an Actual Story
Traditional stories often follow five phases: Exposition, Rising Action, Climax, Falling Action, and Denouement. Or, more simply: Context, Conflict, Climax, Closure, and Conclusion. You can use a similar structure when communicating data.
### 3. Approach It Like an Actual Story
Traditional stories follow five phases: Exposition, Rising Action, Climax, Falling Action, and Denouement—or, more simply, Context, Conflict, Climax, Closure, and Conclusion. You can use a similar structure when communicating your data.
- **Context**: Set the stage and ensure everyone is on the same page.
- **Conflict**: Explain why you collected the data and the problem youre addressing.
- **Climax**: Present the data, its meaning, and the solutions it suggests.
- **Closure**: Reiterate the problem and proposed solutions.
- **Conclusion**: Summarize key takeaways and recommend next steps.
Start with context to ensure your audience is on the same page. Introduce the conflict—why did you collect this data? What problem were you trying to solve? Then present the climax—what does the data reveal? What solutions does it suggest? Follow with closure, reiterating the problem and proposed solutions. Finally, conclude by summarizing key takeaways and recommending next steps.
### 4. Use Meaningful Words & Phrases
If I told you, “Our users take a long time to onboard onto our platform,” how long would you think “a long time” is? An hour? A week? Its unclear. Now imagine I said this to an audience—each person might interpret “a long time” differently.
Vague language can lead to confusion. For example, if you say, “Our users take a long time to onboard onto our platform,” your audience might interpret “long time” differently—some might think an hour, others a week. Instead, say, “Our users take, on average, 3 minutes to sign up and onboard onto our platform.”
Instead, what if I said, “Our users take, on average, 3 minutes to sign up and onboard onto our platform”? Thats much clearer.
When communicating data, dont assume your audience thinks like you. Clarity is your responsibility. If your data or story isnt clear, your audience may struggle to follow and miss your key takeaways.
Driving clarity is your responsibility as a communicator. If your data or story isnt clear, your audience will struggle to follow and may not understand your key takeaways.
Use specific, meaningful words and phrases instead of vague ones. For example:
- “We had an *impressive* year!” (What does “impressive” mean? A 2% increase? A 50% increase?)
- “Our users success rates increased *dramatically*.” (How much is “dramatic”?)
- “This project will require *significant* effort.” (What does “significant” mean?)
- “We had an *impressive* year!” (How impressive? 2% growth? 50% growth?)
- “Our users success rates increased *dramatically*.” (What does “dramatically” mean?)
- “This project will require *significant* effort.” (How much effort is “significant”?)
While vague words can be useful for introductions or summaries, ensure the rest of your presentation is precise and clear.
While vague words can be useful for introductions or summaries, ensure the core of your presentation is clear.
### 5. Use Emotion
Emotion is a powerful tool in storytelling, especially when communicating data. It helps your audience empathize, makes them more likely to take action, and increases the chances theyll remember your message.
Emotion is a powerful tool in storytelling, especially when communicating data. Evoking emotion helps your audience empathize, increases the likelihood theyll remember your message, and motivates them to take action.
Youve likely seen this in TV commercials. Some use somber tones to evoke sadness and emphasize their message, while others are upbeat and associate their data with happiness.
Youve likely seen this in TV commercials—some use somber tones to highlight serious issues, while others use upbeat emotions to associate their data with positivity.
Here are a few ways to use emotion when communicating data:
- **Testimonials and Personal Stories**: Collect both quantitative and qualitative data. If your data is mostly quantitative, gather personal stories to add depth and context.
- **Imagery**: Use images to help your audience visualize the situation and feel the emotion you want to convey.
- **Color**: Colors evoke different emotions. For example:
Here are a few ways to incorporate emotion into your data communication:
- **Testimonials and Personal Stories**: Collect both quantitative and qualitative data, and integrate personal stories to complement your numbers.
- **Imagery**: Use visuals to help your audience connect emotionally with your data.
- **Color**: Different colors evoke different emotions. For example:
- Blue: Peace and trust
- Green: Nature and environment
- Red: Passion and excitement
- Yellow: Optimism and happiness
Be mindful that colors can have different meanings in different cultures.
Be mindful that color meanings can vary across cultures.
# Communication Case Study
Emerson is a Product Manager for a mobile app. Emerson notices that customers submit 42% more complaints and bug reports on weekends. Additionally, customers who dont receive a response to their complaints within 48 hours are 32% more likely to rate the app 1 or 2 stars in the app store.
Emerson is a Product Manager for a mobile app. Emerson notices that customers submit 42% more complaints and bug reports on weekends. Additionally, customers whose complaints go unanswered for more than 48 hours are 32% more likely to rate the app 1 or 2 stars in the app store.
After researching, Emerson identifies two solutions to address the issue. Emerson schedules a 30-minute meeting with the three company leads to present the data and proposed solutions.
@ -121,8 +115,8 @@ The goal of the meeting is to help the company leads understand that the followi
**Solution 1.** Hire customer service reps to work on weekends.
**Solution 2.** Purchase a new customer service ticketing system that helps reps prioritize complaints based on how long theyve been in the queue.
In the meeting, Emerson spends 5 minutes explaining why having a low rating on the app store is problematic, 10 minutes discussing the research process and how trends were identified, 10 minutes reviewing recent customer complaints, and the final 5 minutes briefly covering two potential solutions.
**Solution 2.** Purchase a new customer service ticketing system that allows reps to prioritize complaints based on how long theyve been in the queue.
In the meeting, Emerson spends 5 minutes explaining why having a low rating on the app store is problematic, 10 minutes discussing the research process and how trends were identified, 10 minutes reviewing recent customer complaints, and the last 5 minutes briefly mentioning two potential solutions.
Was this an effective way for Emerson to communicate during this meeting?
@ -142,15 +136,13 @@ It could be presented like this: “Users submit 42% more complaints and bug rep
**Climax** After establishing the context and conflict, Emerson could move to the climax for about 5 minutes.
Here, Emerson could introduce the proposed solutions, explain how they address the outlined issues, detail how they could be integrated into current workflows, provide cost estimates, discuss the ROI, and perhaps even share screenshots or wireframes illustrating how the solutions would look in practice. Emerson could also include testimonials from users whose complaints took over 48 hours to resolve, as well as feedback from a current customer service representative about the existing ticketing system.
Here, Emerson could introduce the proposed solutions, explain how they address the outlined issues, detail how they could be integrated into current workflows, provide cost estimates, highlight the ROI, and perhaps even share screenshots or wireframes of how the solutions would look in practice. Emerson could also include testimonials from users who experienced delays in complaint resolution and feedback from a current customer service representative about the existing ticketing system.
**Closure** Emerson could then spend 5 minutes summarizing the companys challenges, revisiting the proposed solutions, and reinforcing why these solutions are the right choice.
**Conclusion** Since this is a meeting with a few stakeholders involving two-way communication, Emerson could allocate 10 minutes for questions to ensure any confusion among the team leads is addressed before the meeting concludes.
If Emerson adopted approach #2, its far more likely the team leads would leave the meeting with the intended takeaways: that the handling of complaints and bugs needs improvement, and there are two actionable solutions to achieve that improvement. This approach would be a much more effective way to communicate the data and the narrative Emerson wants to convey.
**Conclusion** Since this is a meeting with a few stakeholders involving two-way communication, Emerson could allocate 10 minutes for questions to ensure any confusion among the team leads is addressed before the meeting ends.
---
If Emerson adopted this second approach, its much more likely the team leads would leave the meeting with the intended takeaways: that the way complaints and bugs are handled needs improvement, and there are two actionable solutions to make that improvement happen. This approach would be far more effective in communicating the data and the story Emerson wants to convey.
# Conclusion
### Summary of main points
@ -203,14 +195,10 @@ If Emerson adopted approach #2, its far more likely the team leads would leav
[1. Communicating Data - Communicating Data with Tableau [Book] (oreilly.com)](https://www.oreilly.com/library/view/communicating-data-with/9781449372019/ch01.html)
---
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/31)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
Review what you've just learned with the Post-Lecture Quiz above!
---
## Assignment
[Market Research](assignment.md)

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "408c55cab2880daa4e78616308bd5db7",
"translation_date": "2025-08-31T10:56:02+00:00",
"original_hash": "6a0556b17de4c8d1a9470b02247b01d4",
"translation_date": "2025-09-05T07:37:30+00:00",
"source_file": "5-Data-Science-In-Cloud/17-Introduction/README.md",
"language_code": "en"
}
@ -13,63 +13,62 @@ CO_OP_TRANSLATOR_METADATA:
|:---:|
| Data Science In The Cloud: Introduction - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you will learn the basic principles of the Cloud, understand why using Cloud services can be beneficial for your data science projects, and explore examples of data science projects implemented in the Cloud.
In this lesson, you will learn the basic concepts of the Cloud, understand why using Cloud services can be beneficial for your data science projects, and explore examples of data science projects implemented in the Cloud.
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/32)
## What is the Cloud?
The Cloud, or Cloud Computing, refers to the delivery of a variety of pay-as-you-go computing services hosted on infrastructure over the internet. These services include storage, databases, networking, software, analytics, and intelligent services.
The Cloud, or Cloud Computing, refers to the delivery of a variety of pay-as-you-go computing services hosted on infrastructure over the internet. These services include solutions like storage, databases, networking, software, analytics, and intelligent services.
We typically distinguish between Public, Private, and Hybrid clouds as follows:
Clouds are typically categorized into Public, Private, and Hybrid clouds:
* Public cloud: A public cloud is owned and operated by a third-party cloud service provider that delivers its computing resources over the Internet to the public.
* Public cloud: A public cloud is owned and operated by a third-party cloud service provider that delivers its computing resources over the internet to the general public.
* Private cloud: Refers to cloud computing resources used exclusively by a single business or organization, with services and infrastructure maintained on a private network.
* Hybrid cloud: A hybrid cloud combines public and private clouds. Users can maintain an on-premises datacenter while running data and applications on one or more public clouds.
Most cloud computing services fall into three main categories: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).
* Infrastructure as a Service (IaaS): Users rent IT infrastructure such as servers, virtual machines (VMs), storage, networks, and operating systems.
* Platform as a Service (PaaS): Users rent an environment for developing, testing, delivering, and managing software applications without worrying about the underlying infrastructure.
* Software as a Service (SaaS): Users access software applications over the Internet, typically on a subscription basis, without managing hosting, infrastructure, or maintenance tasks like updates and security patches.
* Platform as a Service (PaaS): Users rent an environment for developing, testing, delivering, and managing software applications without worrying about the underlying infrastructure like servers, storage, networks, and databases.
* Software as a Service (SaaS): Users access software applications over the internet, typically on a subscription basis, without managing hosting, infrastructure, or maintenance tasks like upgrades and security patches.
Some of the largest Cloud providers include Amazon Web Services, Google Cloud Platform, and Microsoft Azure.
## Why Choose the Cloud for Data Science?
Developers and IT professionals choose to work with the Cloud for several reasons, including:
Developers and IT professionals opt for the Cloud for various reasons, including:
* Innovation: Integrate innovative services provided by Cloud providers directly into your applications.
* Flexibility: Pay only for the services you need, with a wide range of options. You can adapt services as your needs evolve.
* Budget: Avoid upfront investments in hardware and software, and pay only for what you use.
* Scalability: Scale resources up or down based on project needs, allowing applications to adjust computing power, storage, and bandwidth dynamically.
* Productivity: Focus on your business instead of managing datacenters and other infrastructure tasks.
* Innovation: Integrate cutting-edge services provided by Cloud providers directly into your applications.
* Flexibility: Pay only for the services you need, with a wide range of options. Typically, you pay as you go and adjust services based on your evolving needs.
* Budget: Avoid upfront investments in hardware and software, as well as the setup and operation of on-site datacenters. Pay only for what you use.
* Scalability: Scale resources based on project needs, allowing applications to adjust computing power, storage, and bandwidth in response to external factors.
* Productivity: Focus on your business rather than spending time on tasks like managing datacenters.
* Reliability: Benefit from continuous data backups and disaster recovery plans to ensure business continuity during crises.
* Security: Leverage policies, technologies, and controls to enhance the security of your projects.
These are some of the most common reasons for using Cloud services. Now that we understand what the Cloud is and its benefits, lets explore how it can help data scientists and developers address challenges such as:
These are some of the common reasons why people choose Cloud services. Now that we understand the Cloud and its benefits, lets explore how it can assist data scientists and developers working with data in overcoming challenges such as:
* Storing large amounts of data: Instead of managing large servers, store data in the Cloud using solutions like Azure Cosmos DB, Azure SQL Database, and Azure Data Lake Storage.
* Performing Data Integration: Transition from data collection to actionable insights using Cloud-based data integration services like Data Factory.
* Processing data: Harness the Clouds computing power to process large datasets without needing powerful local machines.
* Using data analytics services: Turn data into actionable insights with services like Azure Synapse Analytics, Azure Stream Analytics, and Azure Databricks.
* Using Machine Learning and data intelligence services: Leverage pre-built machine learning algorithms and cognitive services like speech-to-text, text-to-speech, and computer vision with services like AzureML.
* Storing large amounts of data: Instead of managing and protecting large servers, store data directly in the Cloud using solutions like Azure Cosmos DB, Azure SQL Database, and Azure Data Lake Storage.
* Performing Data Integration: Data integration is crucial for transitioning from data collection to actionable insights. Cloud services like Data Factory enable you to collect, transform, and integrate data from various sources into a single data warehouse.
* Processing data: Processing large datasets requires significant computing power, which many individuals lack. The Clouds vast computing resources can be harnessed to run and deploy solutions.
* Using data analytics services: Services like Azure Synapse Analytics, Azure Stream Analytics, and Azure Databricks help transform data into actionable insights.
* Using Machine Learning and data intelligence services: Instead of building algorithms from scratch, leverage machine learning services like AzureML. Cognitive services such as speech-to-text, text-to-speech, and computer vision are also available.
## Examples of Data Science in the Cloud
Lets make this more concrete by exploring a couple of scenarios.
Lets explore a couple of scenarios to make this more concrete.
### Real-time social media sentiment analysis
A common beginner project in machine learning is real-time sentiment analysis of social media data.
Imagine you run a news media website and want to use live data to understand what content your readers might be interested in. You could build a program to analyze the sentiment of Twitter posts in real time on topics relevant to your audience.
Imagine you run a news media website and want to use live data to understand what content your readers might be interested in. You can build a program to analyze sentiment in real-time from Twitter posts on topics relevant to your audience.
Key indicators include the volume of tweets on specific topics (hashtags) and sentiment, determined using analytics tools.
Key indicators include the volume of tweets on specific topics (hashtags) and sentiment analysis using tools designed for this purpose.
Steps to create this project:
* Create an event hub to collect streaming input from Twitter.
* Create an event hub for streaming input to collect data from Twitter.
* Configure and start a Twitter client application to call the Twitter Streaming APIs.
* Create a Stream Analytics job.
* Specify the job input and query.
@ -79,21 +78,20 @@ Steps to create this project:
For the full process, refer to the [documentation](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?WT.mc_id=academic-77958-bethanycheum&ocid=AID30411099).
### Scientific papers analysis
Heres another example: analyzing COVID-related scientific papers, a project created by [Dmitry Soshnikov](http://soshnikov.com), one of the authors of this curriculum.
Heres another example: a project by [Dmitry Soshnikov](http://soshnikov.com), one of the authors of this curriculum.
Dmitry created a tool to analyze COVID-related scientific papers. This project demonstrates how to extract knowledge from scientific papers, gain insights, and help researchers navigate large collections of papers efficiently.
Dmitry developed a tool to extract knowledge from scientific papers, gain insights, and help researchers efficiently navigate large collections of documents.
Steps involved:
* Extract and preprocess information using [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
* Use [Azure ML](https://azure.microsoft.com/services/machine-learning?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) to parallelize processing.
* Store and query information with [Cosmos DB](https://azure.microsoft.com/services/cosmos-db?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
* Store and query information using [Cosmos DB](https://azure.microsoft.com/services/cosmos-db?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
* Create an interactive dashboard for data exploration and visualization using Power BI.
For the full process, visit [Dmitrys blog](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/).
As you can see, Cloud services offer numerous ways to perform Data Science.
As demonstrated, Cloud services offer numerous ways to perform Data Science.
## Footnote
@ -104,7 +102,7 @@ Sources:
## Post-Lecture Quiz
[Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/33)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Assignment
@ -113,4 +111,4 @@ Sources:
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "14b2a7f1c63202920bd98eeb913f5614",
"translation_date": "2025-08-31T10:54:46+00:00",
"original_hash": "39f3b3a9d873eaa522c2e792ce0ca503",
"translation_date": "2025-09-05T07:36:26+00:00",
"source_file": "5-Data-Science-In-Cloud/18-Low-Code/README.md",
"language_code": "en"
}
@ -36,33 +36,34 @@ Table of contents:
- [Review & Self Study](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [Assignment](../../../../5-Data-Science-In-Cloud/18-Low-Code)
## [Pre-Lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/34)
## [Pre-Lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## 1. Introduction
### 1.1 What is Azure Machine Learning?
The Azure cloud platform offers over 200 products and services designed to help you create innovative solutions. Data scientists spend a significant amount of time exploring and preparing data, as well as testing various model-training algorithms to achieve accurate results. These tasks can be time-intensive and often lead to inefficient use of costly compute resources.
The Azure cloud platform offers over 200 products and cloud services designed to help you create innovative solutions. Data scientists spend a significant amount of time exploring and preparing data, as well as testing various model-training algorithms to achieve accurate results. These tasks can be time-intensive and often lead to inefficient use of costly compute resources.
[Azure ML](https://docs.microsoft.com/azure/machine-learning/overview-what-is-azure-machine-learning?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) is a cloud-based platform for developing and managing machine learning solutions in Azure. It provides a variety of tools and features to help data scientists prepare data, train models, deploy predictive services, and monitor their usage. Most importantly, it enhances efficiency by automating many of the repetitive tasks involved in training models. It also allows the use of scalable cloud-based compute resources, enabling the handling of large datasets while incurring costs only when resources are actively used.
[Azure ML](https://docs.microsoft.com/azure/machine-learning/overview-what-is-azure-machine-learning?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) is a cloud-based platform for developing and managing machine learning solutions in Azure. It provides a variety of tools and features to help data scientists prepare data, train models, deploy predictive services, and monitor their usage. Most importantly, it enhances efficiency by automating many of the repetitive tasks involved in training models. It also allows the use of scalable cloud-based compute resources, enabling the handling of large datasets while incurring costs only when resources are in use.
Azure ML offers a comprehensive suite of tools for developers and data scientists, including:
Azure ML offers all the tools developers and data scientists need for their machine learning workflows, including:
- **Azure Machine Learning Studio**: A web-based portal for low-code and no-code options for model training, deployment, automation, tracking, and asset management. It integrates seamlessly with the Azure Machine Learning SDK.
- **Jupyter Notebooks**: For quickly prototyping and testing machine learning models.
- **Azure Machine Learning Studio**: A web portal for low-code and no-code options for model training, deployment, automation, tracking, and asset management. It integrates seamlessly with the Azure Machine Learning SDK.
- **Jupyter Notebooks**: For quickly prototyping and testing ML models.
- **Azure Machine Learning Designer**: A drag-and-drop interface for building experiments and deploying pipelines in a low-code environment.
- **Automated machine learning UI (AutoML)**: Automates repetitive tasks in model development, enabling the creation of scalable, efficient, and high-quality machine learning models.
- **Automated machine learning UI (AutoML)**: Automates repetitive tasks in model development, enabling scalable, efficient, and productive ML model creation while maintaining quality.
- **Data Labeling**: A tool that assists in automatically labeling data.
- **Machine learning extension for Visual Studio Code**: A full-featured development environment for managing machine learning projects.
- **Machine learning extension for Visual Studio Code**: A full-featured development environment for building and managing ML projects.
- **Machine learning CLI**: Command-line tools for managing Azure ML resources.
- **Integration with open-source frameworks**: Compatibility with PyTorch, TensorFlow, Scikit-learn, and other frameworks for end-to-end machine learning processes.
- **MLflow**: An open-source library for managing the lifecycle of machine learning experiments. **MLflow Tracking** logs and tracks metrics and artifacts from training runs, regardless of the experiment's environment.
- **Integration with open-source frameworks**: Compatibility with PyTorch, TensorFlow, Scikit-learn, and more for training, deploying, and managing the ML lifecycle.
- **MLflow**: An open-source library for managing the lifecycle of ML experiments. **MLFlow Tracking** logs and tracks training metrics and model artifacts, regardless of the experiment's environment.
### 1.2 The Heart Failure Prediction Project:
Building projects is one of the best ways to test your skills and knowledge. In this lesson, we will explore two approaches to creating a data science project for predicting heart failure in Azure ML Studio: the Low code/No code method and the Azure ML SDK method, as illustrated in the following diagram:
Building projects is one of the best ways to test your skills and knowledge. In this lesson, we will explore two approaches to building a data science project for predicting heart failure in Azure ML Studio: the Low code/No code method and the Azure ML SDK method, as illustrated in the following diagram:
![project-schema](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/project-schema.PNG)
Each approach has its advantages and disadvantages. The Low code/No code method is easier to get started with, as it involves interacting with a graphical user interface (GUI) and requires no prior coding knowledge. This method is ideal for quickly testing a project's feasibility and creating a proof of concept (POC). However, as the project scales and needs to be production-ready, relying solely on the GUI becomes impractical. Automating tasks programmatically, from resource creation to model deployment, becomes essential. This is where the Azure ML SDK comes into play.
Each approach has its advantages and disadvantages. The Low code/No code method is easier to start with, as it involves interacting with a graphical user interface (GUI) and requires no prior coding knowledge. This method is ideal for quickly testing a project's feasibility and creating a proof of concept (POC). However, as the project scales and needs to be production-ready, relying solely on the GUI becomes impractical. At this stage, programmatically automating tasks such as resource creation and model deployment becomes essential, which is where the Azure ML SDK comes into play.
| | Low code/No code | Azure ML SDK |
|-------------------|------------------|---------------------------|
@ -72,7 +73,7 @@ Each approach has its advantages and disadvantages. The Low code/No code method
### 1.3 The Heart Failure Dataset:
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for 31% of all global deaths. Factors such as tobacco use, unhealthy diets, obesity, physical inactivity, and excessive alcohol consumption can serve as features for predictive models. Estimating the likelihood of developing CVDs can be invaluable in preventing heart attacks in high-risk individuals.
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for 31% of all global deaths. Factors such as tobacco use, unhealthy diets, obesity, physical inactivity, and excessive alcohol consumption can serve as features for predictive models. Estimating the likelihood of developing a CVD can be invaluable in preventing heart attacks in high-risk individuals.
Kaggle provides a publicly available [Heart Failure dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data), which we will use for this project. You can download the dataset now. It is a tabular dataset with 13 columns (12 features and 1 target variable) and 299 rows.
@ -97,18 +98,18 @@ Once you have the dataset, we can begin the project in Azure.
## 2. Low code/No code training of a model in Azure ML Studio
### 2.1 Create an Azure ML workspace
To train a model in Azure ML, you first need to create an Azure ML workspace. The workspace is the central resource for Azure Machine Learning, where you can manage all the artifacts created during your machine learning workflows. It keeps a record of all training runs, including logs, metrics, outputs, and snapshots of your scripts. This information helps you identify which training run produced the best model. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-workspace?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)
To train a model in Azure ML, you first need to create an Azure ML workspace. The workspace is the top-level resource in Azure Machine Learning, serving as a centralized hub for all the artifacts you create. It keeps a history of all training runs, including logs, metrics, outputs, and snapshots of your scripts. This information helps you identify which training run produced the best model. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-workspace?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)
It is recommended to use the latest version of a browser compatible with your operating system. Supported browsers include:
It is recommended to use the latest browser version compatible with your operating system. Supported browsers include:
- Microsoft Edge (latest version, not the legacy version)
- Microsoft Edge (latest version, not legacy)
- Safari (latest version, Mac only)
- Chrome (latest version)
- Firefox (latest version)
To use Azure Machine Learning, create a workspace in your Azure subscription. This workspace will allow you to manage data, compute resources, code, models, and other artifacts related to your machine learning projects.
To use Azure Machine Learning, create a workspace in your Azure subscription. This workspace will allow you to manage data, compute resources, code, models, and other artifacts related to your machine learning workloads.
> **_NOTE:_** Your Azure subscription will incur a small charge for data storage as long as the Azure Machine Learning workspace exists. It is recommended to delete the workspace when it is no longer needed.
> **_NOTE:_** Your Azure subscription will incur a small charge for data storage as long as the Azure Machine Learning workspace exists. It is recommended to delete the workspace when it is no longer in use.
1. Sign in to the [Azure portal](https://ms.portal.azure.com/) using the Microsoft credentials associated with your Azure subscription.
2. Select **Create a resource**.
@ -135,9 +136,9 @@ To use Azure Machine Learning, create a workspace in your Azure subscription. Th
![workspace-4](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/workspace-4.PNG)
- Click the create + review button, then click the create button.
3. Wait for your workspace to be created (this may take a few minutes). Once created, navigate to it in the portal. You can find it under the Machine Learning Azure service.
4. On the Overview page for your workspace, launch Azure Machine Learning Studio (or open a new browser tab and go to https://ml.azure.com). Sign in to Azure Machine Learning Studio using your Microsoft account. If prompted, select your Azure directory, subscription, and Azure Machine Learning workspace.
- Click **Review + create**, then click the **Create** button.
3. Wait for your workspace to be created (this may take a few minutes). Once it's ready, navigate to it in the portal. You can find it under the Machine Learning Azure service.
4. On the Overview page for your workspace, launch Azure Machine Learning Studio (or open a new browser tab and go to https://ml.azure.com). Sign in to Azure Machine Learning Studio using your Microsoft account. If prompted, select your Azure directory, subscription, and workspace.
![workspace-5](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/workspace-5.PNG)
@ -145,11 +146,11 @@ To use Azure Machine Learning, create a workspace in your Azure subscription. Th
![workspace-6](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/workspace-6.PNG)
You can manage your workspace through the Azure portal, but Azure Machine Learning Studio provides a more user-friendly interface tailored for data scientists and machine learning engineers.
You can manage your workspace through the Azure portal, but Azure Machine Learning Studio provides a more user-friendly interface tailored for data scientists and ML engineers.
### 2.2 Compute Resources
Compute Resources are cloud-based resources used for running model training and data exploration processes. There are four types of compute resources you can create:
Compute resources are cloud-based resources used for running model training and data exploration processes. There are four types of compute resources you can create:
- **Compute Instances**: Development workstations for data scientists to work with data and models. This involves creating a Virtual Machine (VM) and launching a notebook instance. Models can then be trained by calling a compute cluster from the notebook.
- **Compute Clusters**: Scalable clusters of VMs for on-demand processing of experiment code. These are essential for training models and can include specialized GPU or CPU resources.
@ -158,13 +159,13 @@ Compute Resources are cloud-based resources used for running model training and
#### 2.2.1 Choosing the right options for your compute resources
When creating a compute resource, there are several important factors to consider, as these choices can significantly impact your project.
There are several important factors to consider when creating a compute resource, and these choices can significantly impact your project.
**Do you need CPU or GPU?**
A CPU (Central Processing Unit) is the main electronic circuitry that executes instructions in a computer program. A GPU (Graphics Processing Unit) is a specialized electronic circuit designed to process graphics-related tasks at high speeds.
The key difference between CPU and GPU architecture is that CPUs are optimized for handling a wide range of tasks quickly (measured by clock speed) but are limited in the number of tasks they can run concurrently. GPUs, on the other hand, excel at parallel computing, making them ideal for deep learning tasks.
The key difference between CPU and GPU architecture is that CPUs are optimized for handling a wide range of tasks quickly (measured by clock speed) but are limited in the number of tasks they can run concurrently. GPUs, on the other hand, are designed for parallel computing, making them ideal for deep learning tasks.
| CPU | GPU |
|-----------------------------------------|-----------------------------|
@ -182,30 +183,30 @@ You can adjust the size of your RAM, disk, number of cores, and clock speed base
**Dedicated or Low-Priority Instances?**
Low-priority instances are interruptible, meaning Microsoft Azure can reassign these resources to other tasks, potentially interrupting your job. Dedicated instances, which are non-interruptible, ensure your job won't be terminated without your permission. This decision also comes down to time versus cost, as interruptible instances are cheaper than dedicated ones.
Low-priority instances are interruptible, meaning Microsoft Azure can reassign these resources to other tasks, potentially interrupting your job. Dedicated instances, which are non-interruptible, ensure your job won't be terminated without your permission. This is another trade-off between time and cost, as interruptible instances are cheaper than dedicated ones.
#### 2.2.2 Creating a compute cluster
In the [Azure ML workspace](https://ml.azure.com/) created earlier, navigate to the "Compute" section to view the various compute resources discussed (e.g., compute instances, compute clusters, inference clusters, and attached compute). For this project, you'll need a compute cluster for model training. In the Studio, click on the "Compute" menu, then the "Compute cluster" tab, and click the "+ New" button to create a compute cluster.
In the [Azure ML workspace](https://ml.azure.com/) we created earlier, navigate to the "Compute" section to view the different compute resources discussed (e.g., compute instances, compute clusters, inference clusters, and attached compute). For this project, we need a compute cluster for model training. In the Studio, click on the "Compute" menu, then the "Compute cluster" tab, and click the "+ New" button to create a compute cluster.
![22](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/cluster-1.PNG)
1. Select your options: Dedicated vs Low priority, CPU or GPU, VM size, and core number (default settings can be used for this project).
1. Choose your options: Dedicated vs Low priority, CPU or GPU, VM size, and core number (you can keep the default settings for this project).
2. Click the "Next" button.
![23](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/cluster-2.PNG)
3. Assign a name to the cluster.
4. Configure options such as the minimum/maximum number of nodes, idle seconds before scale-down, and SSH access. Note that setting the minimum number of nodes to 0 saves money when the cluster is idle. A higher maximum number of nodes shortens training time, with 3 nodes being the recommended maximum.
5. Click the "Create" button. This process may take a few minutes.
4. Configure options such as the minimum/maximum number of nodes, idle seconds before scale-down, and SSH access. Note that setting the minimum number of nodes to 0 will save money when the cluster is idle. A higher maximum number of nodes will shorten training time. The recommended maximum number of nodes is 3.
5. Click the "Create" button. This step may take a few minutes.
![29](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/cluster-3.PNG)
Great! Now that the compute cluster is ready, the next step is to load the data into Azure ML Studio.
Great! Now that we have a compute cluster, we need to load the data into Azure ML Studio.
### 2.3 Loading the Dataset
1. In the [Azure ML workspace](https://ml.azure.com/) created earlier, click on "Datasets" in the left menu and then "+ Create dataset" to create a new dataset. Select the "From local files" option and upload the Kaggle dataset downloaded earlier.
1. In the [Azure ML workspace](https://ml.azure.com/) we created earlier, click on "Datasets" in the left menu and then click the "+ Create dataset" button. Select the "From local files" option and upload the Kaggle dataset we downloaded earlier.
![24](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/dataset-1.PNG)
@ -217,52 +218,52 @@ Great! Now that the compute cluster is ready, the next step is to load the data
![26](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/dataset-3.PNG)
Fantastic! With the dataset uploaded and the compute cluster created, you're ready to start training the model.
Excellent! With the dataset uploaded and the compute cluster created, we can now begin training the model.
### 2.4 Low code/No Code training with AutoML
Developing traditional machine learning models is resource-intensive, requiring significant domain expertise and time to compare multiple models. Automated machine learning (AutoML) simplifies this process by automating repetitive tasks in model development. AutoML enables data scientists, analysts, and developers to efficiently build high-quality ML models, reducing the time needed to create production-ready models. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
Traditional machine learning model development is resource-intensive, requiring significant domain expertise and time to produce and compare multiple models. Automated machine learning (AutoML) simplifies this process by automating the iterative tasks of model development. AutoML enables data scientists, analysts, and developers to build ML models efficiently and at scale, while maintaining high model quality. It significantly reduces the time required to create production-ready ML models. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)
1. In the [Azure ML workspace](https://ml.azure.com/) created earlier, click on "Automated ML" in the left menu and select the dataset you uploaded. Click "Next."
1. In the [Azure ML workspace](https://ml.azure.com/) we created earlier, click on "Automated ML" in the left menu and select the dataset you just uploaded. Click "Next."
![27](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/aml-1.PNG)
2. Enter a new experiment name, specify the target column (DEATH_EVENT), and select the compute cluster created earlier. Click "Next."
2. Enter a new experiment name, specify the target column (DEATH_EVENT), and select the compute cluster we created. Click "Next."
![28](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/aml-2.PNG)
3. Choose "Classification" and click "Finish." This process may take 30 minutes to 1 hour, depending on your compute cluster size.
3. Choose "Classification" and click "Finish." This step may take 30 minutes to 1 hour, depending on your compute cluster size.
![30](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/aml-3.PNG)
4. Once the run is complete, go to the "Automated ML" tab, select your run, and click on the algorithm listed in the "Best model summary" card.
4. Once the run is complete, click on the "Automated ML" tab, select your run, and click on the algorithm listed in the "Best model summary" card.
![31](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/aml-4.PNG)
Here, you'll find detailed information about the best model generated by AutoML. You can also explore other models in the "Models" tab. Spend some time reviewing the explanations (preview button) for the models. Once you've selected the model to use (in this case, the best model chosen by AutoML), you'll proceed to deploy it.
Here, you can view detailed information about the best model generated by AutoML. You can also explore other models in the "Models" tab. Spend some time reviewing the explanations (preview button) for the models. Once you've chosen the model you want to use (in this case, we'll select the best model chosen by AutoML), we can proceed to deploy it.
## 3. Low code/No Code model deployment and endpoint consumption
### 3.1 Model deployment
The AutoML interface allows you to deploy the best model as a web service in just a few steps. Deployment integrates the model so it can make predictions based on new data, enabling applications to identify opportunities. For this project, deploying the model as a web service allows medical applications to make live predictions about patients' heart attack risks.
The AutoML interface allows you to deploy the best model as a web service in just a few steps. Deployment integrates the model so it can make predictions based on new data, enabling applications to identify opportunities or risks. For this project, deploying the model as a web service allows medical applications to make live predictions about patients' risk of heart failure.
In the best model description, click the "Deploy" button.
![deploy-1](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/deploy-1.PNG)
15. Provide a name, description, compute type (Azure Container Instance), enable authentication, and click "Deploy." This process may take about 20 minutes. Deployment involves registering the model, generating resources, and configuring them for the web service. A status message will appear under "Deploy status." Refresh periodically to check the status. The deployment is complete and running when the status is "Healthy."
15. Provide a name, description, compute type (Azure Container Instance), enable authentication, and click "Deploy." This step may take about 20 minutes. The deployment process involves registering the model, generating resources, and configuring them for the web service. A status message will appear under "Deploy status." Click "Refresh" periodically to check the status. The deployment is complete and running when the status is "Healthy."
![deploy-2](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/deploy-2.PNG)
16. Once deployed, go to the "Endpoint" tab and select the endpoint you just deployed. Here, you'll find all the details about the endpoint.
16. Once deployed, click on the "Endpoint" tab and select the endpoint you just deployed. Here, you can find all the details about the endpoint.
![deploy-3](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/deploy-3.PNG)
Excellent! With the model deployed, you can now consume the endpoint.
Fantastic! With the model deployed, we can now consume the endpoint.
### 3.2 Endpoint consumption
Click on the "Consume" tab. Here, you'll find the REST endpoint and a Python script for consumption. Take some time to review the Python code.
Click on the "Consume" tab. Here, you'll find the REST endpoint and a Python script under the consumption options. Take some time to review the Python code.
This script can be run directly from your local machine to consume the endpoint.
@ -274,13 +275,13 @@ Pay attention to these two lines of code:
url = 'http://98e3715f-xxxx-xxxx-xxxx-9ec22d57b796.centralus.azurecontainer.io/score'
api_key = '' # Replace this with the API key for the web service
```
The `url` variable contains the REST endpoint from the "Consume" tab, and the `api_key` variable contains the primary key (if authentication is enabled). These elements allow the script to consume the endpoint.
The `url` variable contains the REST endpoint found in the "Consume" tab, and the `api_key` variable contains the primary key (if authentication is enabled). These variables allow the script to consume the endpoint.
18. Running the script should produce the following output:
```python
b'"{\\"result\\": [true]}"'
```
This indicates that the prediction for heart failure based on the provided data is true. This makes sense because the default data in the script has all values set to 0 or false. You can modify the data using the following sample input:
This indicates that the prediction for heart failure based on the given data is true. This makes sense because the default data in the script has all values set to 0 or false. You can modify the data using the following sample input:
```python
data = {
@ -324,19 +325,19 @@ The script should return:
Congratulations! You've successfully trained, deployed, and consumed a model on Azure ML!
> **_NOTE:_** Once you've completed the project, remember to delete all resources.
> **_NOTE:_** Once you're done with the project, don't forget to delete all the resources.
## 🚀 Challenge
Examine the model explanations and details generated by AutoML for the top models. Try to understand why the best model outperformed the others. What algorithms were compared? What distinguishes them? Why does the best model perform better in this scenario?
Examine the model explanations and details generated by AutoML for the top models. Try to understand why the best model outperformed the others. What algorithms were compared? What are the differences between them? Why is the best model more effective in this case?
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/35)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study
In this lesson, you learned how to train, deploy, and consume a model to predict heart failure risk using a Low code/No code approach in the cloud. If you haven't already, explore the model explanations generated by AutoML for the top models and understand why the best model is superior.
In this lesson, you learned how to train, deploy, and consume a model to predict heart failure risk using a Low code/No code approach in the cloud. If you haven't already, explore the model explanations generated by AutoML for the top models and try to understand why the best model is superior.
For further exploration of Low code/No code AutoML, refer to this [documentation](https://docs.microsoft.com/azure/machine-learning/tutorial-first-experiment-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
You can further explore Low code/No code AutoML by reading this [documentation](https://docs.microsoft.com/azure/machine-learning/tutorial-first-experiment-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
## Assignment

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "73dead89dc2ddda4d6ec0232814a191e",
"translation_date": "2025-08-31T10:56:26+00:00",
"original_hash": "5da2d6b3736f6d668b89de9bf3bdd31b",
"translation_date": "2025-09-05T07:37:53+00:00",
"source_file": "5-Data-Science-In-Cloud/19-Azure/README.md",
"language_code": "en"
}
@ -51,11 +51,11 @@ Key features of the SDK include:
- Manage cloud resources for monitoring, logging, and organizing machine learning experiments.
- Train models locally or using cloud resources, including GPU-accelerated training.
- Use automated machine learning, which takes configuration parameters and training data, and automatically tests algorithms and hyperparameter settings to find the best model for predictions.
- Deploy web services to turn trained models into RESTful services that can be used in any application.
- Deploy web services to turn trained models into RESTful services that can be integrated into applications.
[Learn more about the Azure Machine Learning SDK](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)
In the [previous lesson](../18-Low-Code/README.md), we learned how to train, deploy, and consume a model using a Low code/No code approach. We used the Heart Failure dataset to create a heart failure prediction model. In this lesson, we will achieve the same goal but using the Azure Machine Learning SDK.
In the [previous lesson](../18-Low-Code/README.md), we explored how to train, deploy, and use a model in a low-code/no-code approach. We used the Heart Failure dataset to create a heart failure prediction model. In this lesson, we will achieve the same goal but using the Azure Machine Learning SDK.
![project-schema](../../../../5-Data-Science-In-Cloud/19-Azure/images/project-schema.PNG)
@ -66,49 +66,49 @@ Refer to [this section](../18-Low-Code/README.md) for an introduction to the Hea
## 2. Training a model with the Azure ML SDK
### 2.1 Create an Azure ML workspace
For simplicity, we will work in a Jupyter notebook. This assumes you already have a Workspace and a compute instance. If you already have a Workspace, you can skip to section 2.3 Notebook creation.
To simplify, we will work in a Jupyter notebook. This assumes you already have a Workspace and a compute instance. If you already have a Workspace, you can skip to section 2.3 Notebook creation.
If not, follow the instructions in the section **2.1 Create an Azure ML workspace** in the [previous lesson](../18-Low-Code/README.md) to create a workspace.
If not, follow the instructions in the section **2.1 Create an Azure ML workspace** in the [previous lesson](../18-Low-Code/README.md) to set up a workspace.
### 2.2 Create a compute instance
In the [Azure ML workspace](https://ml.azure.com/) we created earlier, go to the Compute menu to see the available compute resources.
In the [Azure ML workspace](https://ml.azure.com/) created earlier, navigate to the compute menu to view available compute resources.
![compute-instance-1](../../../../5-Data-Science-In-Cloud/19-Azure/images/compute-instance-1.PNG)
Lets create a compute instance to host a Jupyter notebook.
1. Click the + New button.
2. Name your compute instance.
3. Choose your options: CPU or GPU, VM size, and core count.
To create a compute instance for provisioning a Jupyter notebook:
1. Click the + New button.
2. Assign a name to your compute instance.
3. Select options: CPU or GPU, VM size, and core count.
4. Click the Create button.
Congratulations! Youve created a compute instance. Well use this instance to create a Notebook in the [Creating Notebooks section](../../../../5-Data-Science-In-Cloud/19-Azure).
Congratulations! You've created a compute instance. We'll use this instance to create a Notebook in the [Creating Notebooks section](../../../../5-Data-Science-In-Cloud/19-Azure).
### 2.3 Loading the Dataset
If you havent uploaded the dataset yet, refer to the [previous lesson](../18-Low-Code/README.md) in the section **2.3 Loading the Dataset**.
If you haven't uploaded the dataset yet, refer to the section **2.3 Loading the Dataset** in the [previous lesson](../18-Low-Code/README.md).
### 2.4 Creating Notebooks
> **_NOTE:_** For the next step, you can either create a new notebook from scratch or upload the [notebook we created](../../../../5-Data-Science-In-Cloud/19-Azure/notebook.ipynb) into your Azure ML Studio. To upload it, click on the "Notebook" menu and upload the file.
> **_NOTE:_** For the next step, you can either create a new notebook from scratch or upload the [notebook we created](../../../../5-Data-Science-In-Cloud/19-Azure/notebook.ipynb) to your Azure ML Studio. To upload it, click on the "Notebook" menu and upload the file.
Notebooks are a crucial part of the data science process. They can be used for Exploratory Data Analysis (EDA), training models on a compute cluster, or deploying endpoints on an inference cluster.
Notebooks are a crucial part of the data science workflow. They can be used for Exploratory Data Analysis (EDA), training models on compute clusters, and deploying endpoints on inference clusters.
To create a Notebook, we need a compute node running the Jupyter notebook instance. Go back to the [Azure ML workspace](https://ml.azure.com/) and click on Compute instances. In the list of compute instances, you should see the [compute instance we created earlier](../../../../5-Data-Science-In-Cloud/19-Azure).
To create a Notebook, you need a compute node running the Jupyter notebook instance. Return to the [Azure ML workspace](https://ml.azure.com/) and click on Compute instances. In the list, locate the [compute instance created earlier](../../../../5-Data-Science-In-Cloud/19-Azure).
1. In the Applications section, click on the Jupyter option.
2. Tick the "Yes, I understand" box and click Continue.
1. In the Applications section, click the Jupyter option.
2. Check the "Yes, I understand" box and click Continue.
![notebook-1](../../../../5-Data-Science-In-Cloud/19-Azure/images/notebook-1.PNG)
3. This will open a new browser tab with your Jupyter notebook instance. Click the "New" button to create a notebook.
3. A new browser tab will open with your Jupyter notebook instance. Click the "New" button to create a notebook.
![notebook-2](../../../../5-Data-Science-In-Cloud/19-Azure/images/notebook-2.PNG)
Now that we have a Notebook, we can start training the model with the Azure ML SDK.
Now that we have a Notebook, we can begin training the model using the Azure ML SDK.
### 2.5 Training a model
If you have any doubts, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). It contains all the necessary information about the modules well use in this lesson.
If you have any doubts, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). It contains all the necessary details about the modules we'll use in this lesson.
#### 2.5.1 Setup Workspace, experiment, compute cluster, and dataset
#### 2.5.1 Setup Workspace, experiment, compute cluster and dataset
Load the `workspace` from the configuration file using the following code:
@ -124,7 +124,7 @@ from azureml.core import Experiment
experiment_name = 'aml-experiment'
experiment = Experiment(ws, experiment_name)
```
To get or create an experiment in a workspace, request the experiment by name. Experiment names must be 3-36 characters long, start with a letter or number, and only contain letters, numbers, underscores, and dashes. If the experiment doesnt exist in the workspace, a new one is created.
To get or create an experiment in a workspace, request the experiment by name. Experiment names must be 3-36 characters long, start with a letter or number, and contain only letters, numbers, underscores, and dashes. If the experiment doesn't exist, a new one is created.
Now, create a compute cluster for training using the following code. Note that this step may take a few minutes.
@ -156,18 +156,18 @@ df.describe()
Set the AutoML configuration using the [AutoMLConfig class](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig(class)?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
As described in the documentation, there are many parameters you can configure. For this project, well use the following:
The documentation lists many parameters you can customize. For this project, we'll use the following:
- `experiment_timeout_minutes`: Maximum time (in minutes) the experiment can run before stopping automatically.
- `experiment_timeout_minutes`: Maximum time (in minutes) for the experiment to run before stopping automatically.
- `max_concurrent_iterations`: Maximum number of concurrent training iterations allowed.
- `primary_metric`: The primary metric used to evaluate the experiment.
- `compute_target`: The Azure Machine Learning compute target for the experiment.
- `task`: The type of task to perform ('classification', 'regression', or 'forecasting').
- `training_data`: The training data, including features and a label column (optionally a sample weights column).
- `label_column_name`: The name of the label column.
- `primary_metric`: Metric used to evaluate the experiment's progress.
- `compute_target`: Azure Machine Learning compute target for the experiment.
- `task`: Type of task (e.g., 'classification', 'regression', or 'forecasting').
- `training_data`: Training data containing features and a label column (optionally sample weights).
- `label_column_name`: Name of the label column.
- `path`: Full path to the Azure Machine Learning project folder.
- `enable_early_stopping`: Whether to stop early if the score doesnt improve in the short term.
- `featurization`: Whether to perform automatic featurization or use custom settings.
- `enable_early_stopping`: Whether to stop early if the score doesn't improve.
- `featurization`: Whether to automate or customize the featurization step.
- `debug_log`: Log file for debug information.
```python
@ -192,12 +192,12 @@ automl_config = AutoMLConfig(compute_target=compute_target,
**automl_settings
)
```
Once configured, train the model using the following code. This step may take up to an hour, depending on your cluster size.
With the configuration set, train the model using the following code. This step may take up to an hour depending on your cluster size.
```python
remote_run = experiment.submit(automl_config)
```
Run the RunDetails widget to display the different experiments.
Use the RunDetails widget to display the experiment progress.
```python
from azureml.widgets import RunDetails
RunDetails(remote_run).show()
@ -206,12 +206,12 @@ RunDetails(remote_run).show()
### 3.1 Saving the best model
The `remote_run` is an object of type [AutoMLRun](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). This object has a `get_output()` method that returns the best run and the corresponding fitted model.
The `remote_run` object is of type [AutoMLRun](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). It has a `get_output()` method that returns the best run and its fitted model.
```python
best_run, fitted_model = remote_run.get_output()
```
You can view the parameters of the best model by printing the `fitted_model` and check its properties using the [get_properties()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml_core_Run_get_properties?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method.
View the parameters of the best model by printing `fitted_model`. Check its properties using the [get_properties()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml_core_Run_get_properties?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method.
```python
best_run.get_properties()
@ -230,7 +230,7 @@ model = best_run.register_model(model_name = model_name,
```
### 3.2 Model Deployment
Once the best model is saved, deploy it using the [InferenceConfig](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py?ocid=AID3041109) class. InferenceConfig defines the settings for a custom environment used for deployment. The [AciWebservice](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py) class represents a machine learning model deployed as a web service endpoint on Azure Container Instances. A deployed service is a load-balanced HTTP endpoint with a REST API. You can send data to this API and receive predictions from the model.
After saving the best model, deploy it using the [InferenceConfig](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py?ocid=AID3041109) class. InferenceConfig specifies settings for a custom environment used for deployment. The [AciWebservice](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py) class represents a machine learning model deployed as a web service endpoint on Azure Container Instances. The deployed service is a load-balanced HTTP endpoint with a REST API. You can send data to this API and receive predictions from the model.
Deploy the model using the [deploy](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py#deploy-workspace--name--models--inference-config-none--deployment-config-none--deployment-target-none--overwrite-false--show-output-false-?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method.
@ -279,7 +279,7 @@ data = {
test_sample = str.encode(json.dumps(data))
```
Then send this input to your model for prediction:
Then send this input to your model for predictions:
```python
response = aci_service.run(input_data=test_sample)
response
@ -296,7 +296,7 @@ There are many other things you can do with the SDK, but unfortunately, we can't
**HINT:** Visit the [SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) and use keywords like "Pipeline" in the search bar. You should find the `azureml.pipeline.core.Pipeline` class in the search results.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/37)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study

@ -1,8 +1,8 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "67076ed50f54e7d26ba1ba378d6078f1",
"translation_date": "2025-08-31T11:11:55+00:00",
"original_hash": "f95679140c7cb39c30ccba535cd8f03f",
"translation_date": "2025-09-05T07:46:42+00:00",
"source_file": "6-Data-Science-In-Wild/20-Real-World-Examples/README.md",
"language_code": "en"
}
@ -15,35 +15,35 @@ CO_OP_TRANSLATOR_METADATA:
We're nearing the end of this learning journey!
We began by defining data science and ethics, explored tools and techniques for data analysis and visualization, reviewed the data science lifecycle, and examined how to scale and automate workflows using cloud computing services. Now, you might be wondering: _"How do I apply all these learnings to real-world scenarios?"_
We began by defining data science and ethics, explored various tools and techniques for data analysis and visualization, reviewed the data science lifecycle, and examined how to scale and automate workflows using cloud computing services. Now, you might be wondering: _"How do I apply all these learnings to real-world scenarios?"_
In this lesson, we'll delve into real-world applications of data science across industries and explore specific examples in research, digital humanities, and sustainability. We'll also discuss student project opportunities and wrap up with resources to help you continue your learning journey.
## Pre-Lecture Quiz
[Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/38)
[Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Data Science + Industry
The democratization of AI has made it easier for developers to design and integrate AI-driven decision-making and data-driven insights into user experiences and development workflows. Here are some examples of how data science is applied in real-world industry scenarios:
The democratization of AI has made it easier for developers to design and integrate AI-driven decision-making and data-driven insights into user experiences and workflows. Here are some examples of how data science is applied in real-world industry scenarios:
* [Google Flu Trends](https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/) used data science to correlate search terms with flu trends. Although the approach had flaws, it highlighted the potential (and challenges) of data-driven healthcare predictions.
* [Google Flu Trends](https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/) used data science to correlate search terms with flu trends. While the approach had its flaws, it highlighted the potential (and challenges) of data-driven healthcare predictions.
* [UPS Routing Predictions](https://www.technologyreview.com/2018/11/21/139000/how-ups-uses-ai-to-outsmart-bad-weather/) - explains how UPS uses data science and machine learning to predict optimal delivery routes, factoring in weather, traffic, deadlines, and more.
* [NYC Taxicab Route Visualization](http://chriswhong.github.io/nyctaxi/) - data obtained through [Freedom Of Information Laws](https://chriswhong.com/open-data/foil_nyc_taxi/) was used to visualize a day in the life of NYC cabs, providing insights into navigation, earnings, and trip durations over a 24-hour period.
* [NYC Taxicab Route Visualization](http://chriswhong.github.io/nyctaxi/) - data obtained through [Freedom Of Information Laws](https://chriswhong.com/open-data/foil_nyc_taxi/) was used to visualize a day in the life of NYC taxis, providing insights into navigation, earnings, and trip durations over a 24-hour period.
* [Uber Data Science Workbench](https://eng.uber.com/dsw/) - leverages data from millions of daily Uber trips (pickup/dropoff locations, trip durations, preferred routes, etc.) to build analytics tools for pricing, safety, fraud detection, and navigation decisions.
* [Sports Analytics](https://towardsdatascience.com/scope-of-analytics-in-sports-world-37ed09c39860) - focuses on _predictive analytics_ (team and player analysis, e.g., [Moneyball](https://datasciencedegree.wisconsin.edu/blog/moneyball-proves-importance-big-data-big-ideas/)) and _data visualization_ (team dashboards, fan engagement, etc.) with applications like talent scouting, sports betting, and venue management.
* [Sports Analytics](https://towardsdatascience.com/scope-of-analytics-in-sports-world-37ed09c39860) - focuses on _predictive analytics_ (team and player analysis, like [Moneyball](https://datasciencedegree.wisconsin.edu/blog/moneyball-proves-importance-big-data-big-ideas/), and fan management) and _data visualization_ (team dashboards, fan engagement, etc.) with applications in talent scouting, sports betting, and venue management.
* [Data Science in Banking](https://data-flair.training/blogs/data-science-in-banking/) - showcases the role of data science in finance, including risk modeling, fraud detection, customer segmentation, real-time predictions, and recommender systems. Predictive analytics also support critical measures like [credit scores](https://dzone.com/articles/using-big-data-and-predictive-analytics-for-credit).
* [Data Science in Banking](https://data-flair.training/blogs/data-science-in-banking/) - showcases the role of data science in finance, including risk modeling, fraud detection, customer segmentation, real-time predictions, and recommender systems. Predictive analytics also play a key role in measures like [credit scores](https://dzone.com/articles/using-big-data-and-predictive-analytics-for-credit).
* [Data Science in Healthcare](https://data-flair.training/blogs/data-science-in-healthcare/) - highlights applications such as medical imaging (MRI, X-Ray, CT-Scan), genomics (DNA sequencing), drug development (risk assessment, success prediction), predictive analytics (patient care and logistics), and disease tracking/prevention.
![Data Science Applications in The Real World](../../../../6-Data-Science-In-Wild/20-Real-World-Examples/images/data-science-applications.png) Image Credit: [Data Flair: 6 Amazing Data Science Applications ](https://data-flair.training/blogs/data-science-applications/)
The figure illustrates other domains and examples of data science applications. Interested in exploring more? Check out the [Review & Self Study](../../../../6-Data-Science-In-Wild/20-Real-World-Examples) section below.
The figure illustrates other domains and examples of data science applications. Want to explore more? Check out the [Review & Self Study](../../../../6-Data-Science-In-Wild/20-Real-World-Examples) section below.
## Data Science + Research
@ -51,24 +51,24 @@ The figure illustrates other domains and examples of data science applications.
| :---------------------------------------------------------------------------------------------------------------: |
| Data Science & Research - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
While industry applications often focus on large-scale use cases, research projects can provide valuable insights in two key areas:
While industry applications often focus on large-scale use cases, research projects can provide value in two key areas:
* _Innovation opportunities_ - rapid prototyping of advanced concepts and testing user experiences for next-generation applications.
* _Deployment challenges_ - identifying potential harms or unintended consequences of data science technologies in real-world contexts.
* _Innovation opportunities_ - enabling rapid prototyping of advanced concepts and testing user experiences for next-generation applications.
* _Deployment challenges_ - examining potential harms or unintended consequences of data science technologies in real-world contexts.
For students, research projects offer learning and collaboration opportunities that deepen understanding and foster connections with experts in areas of interest. What do research projects look like, and how can they make an impact?
For students, research projects offer learning and collaboration opportunities that deepen understanding and foster connections with experts in areas of interest. So, what do research projects look like, and how can they make an impact?
Consider the [MIT Gender Shades Study](http://gendershades.org/overview.html) by Joy Buolamwini (MIT Media Labs), co-authored with Timnit Gebru (then at Microsoft Research). This study focused on:
Consider the [MIT Gender Shades Study](http://gendershades.org/overview.html) by Joy Buolamwini (MIT Media Labs), with a [signature research paper](http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf) co-authored with Timnit Gebru (then at Microsoft Research):
* **What:** Evaluating bias in automated facial analysis algorithms and datasets based on gender and skin type.
* **Why:** Facial analysis is used in critical areas like law enforcement, airport security, and hiring systems, where inaccuracies (e.g., due to bias) can lead to economic and social harm. Addressing bias is essential for fairness.
* **What:** The study aimed to _evaluate bias in automated facial analysis algorithms and datasets_ based on gender and skin type.
* **Why:** Facial analysis is used in areas like law enforcement, airport security, and hiring systems—contexts where inaccurate classifications (e.g., due to bias) can lead to economic and social harm. Addressing bias is crucial for fairness.
* **How:** Researchers noted that existing benchmarks predominantly featured lighter-skinned subjects. They curated a new dataset (1000+ images) balanced by gender and skin type, which was used to evaluate the accuracy of three gender classification products (Microsoft, IBM, Face++).
Results revealed that while overall accuracy was good, error rates varied significantly across subgroups, with **misgendering** being higher for females and individuals with darker skin tones, indicating bias.
**Key Outcomes:** The study emphasized the need for more _representative datasets_ (balanced subgroups) and _inclusive teams_ (diverse backgrounds) to identify and address biases early in AI solutions. Such research has influenced organizations to adopt principles and practices for _responsible AI_ to enhance fairness in their AI products and processes.
**Key Outcomes:** The study emphasized the need for _representative datasets_ (balanced subgroups) and _inclusive teams_ (diverse backgrounds) to identify and address biases early in AI solutions. Such research has influenced organizations to adopt principles and practices for _responsible AI_ to enhance fairness in their products and processes.
**Interested in Microsoft research efforts?**
**Interested in related research at Microsoft?**
* Explore [Microsoft Research Projects](https://www.microsoft.com/research/research-area/artificial-intelligence/?facet%5Btax%5D%5Bmsr-research-area%5D%5B%5D=13556&facet%5Btax%5D%5Bmsr-content-type%5D%5B%5D=msr-project) on Artificial Intelligence.
* Check out student projects from [Microsoft Research Data Science Summer School](https://www.microsoft.com/en-us/research/academic-program/data-science-summer-school/).
@ -80,21 +80,21 @@ Results revealed that while overall accuracy was good, error rates varied signif
| :---------------------------------------------------------------------------------------------------------------: |
| Data Science & Digital Humanities - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Digital Humanities [is defined](https://digitalhumanities.stanford.edu/about-dh-stanford) as "a collection of practices and approaches combining computational methods with humanistic inquiry." [Stanford projects](https://digitalhumanities.stanford.edu/projects) like _"rebooting history"_ and _"poetic thinking"_ demonstrate the connection between [Digital Humanities and Data Science](https://digitalhumanities.stanford.edu/digital-humanities-and-data-science), using techniques like network analysis, information visualization, spatial analysis, and text analysis to revisit historical and literary datasets for new insights.
Digital Humanities [is defined](https://digitalhumanities.stanford.edu/about-dh-stanford) as "a collection of practices and approaches combining computational methods with humanistic inquiry." [Stanford projects](https://digitalhumanities.stanford.edu/projects) like _"rebooting history"_ and _"poetic thinking"_ highlight the connection between [Digital Humanities and Data Science](https://digitalhumanities.stanford.edu/digital-humanities-and-data-science), showcasing techniques like network analysis, information visualization, spatial analysis, and text analysis to uncover new insights from historical and literary datasets.
*Want to explore a project in this field?*
*Want to explore or expand a project in this field?*
Check out ["Emily Dickinson and the Meter of Mood"](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671) by [Jen Looper](https://twitter.com/jenlooper). This project examines how data science can reinterpret familiar poetry and reevaluate its meaning and the author's contributions. For example, _can we predict the season in which a poem was written by analyzing its tone or sentiment?_ What does this reveal about the author's mindset during that time?
Check out ["Emily Dickinson and the Meter of Mood"](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671) by [Jen Looper](https://twitter.com/jenlooper), which uses data science to revisit familiar poetry and reinterpret its meaning. For example, _can we predict the season in which a poem was written by analyzing its tone or sentiment?_ What might this reveal about the author's state of mind during that time?
To explore this, follow the data science lifecycle:
* [`Data Acquisition`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#acquiring-the-dataset) - collect relevant datasets using APIs (e.g., [Poetry DB API](https://poetrydb.org/index.html)) or web scraping tools (e.g., [Project Gutenberg](https://www.gutenberg.org/files/12242/12242-h/12242-h.htm)).
To answer this, follow the steps of the data science lifecycle:
* [`Data Acquisition`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#acquiring-the-dataset) - collect a relevant dataset using APIs (e.g., [Poetry DB API](https://poetrydb.org/index.html)) or web scraping tools (e.g., [Project Gutenberg](https://www.gutenberg.org/files/12242/12242-h/12242-h.htm)).
* [`Data Cleaning`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#clean-the-data) - format and sanitize text using tools like Visual Studio Code and Microsoft Excel.
* [`Data Analysis`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#working-with-the-data-in-a-notebook) - import the dataset into "Notebooks" for analysis using Python packages (e.g., pandas, numpy, matplotlib) to organize and visualize the data.
* [`Data Analysis`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#working-with-the-data-in-a-notebook) - import the dataset into "Notebooks" for analysis using Python packages like pandas, numpy, and matplotlib.
* [`Sentiment Analysis`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#sentiment-analysis-using-cognitive-services) - integrate cloud services like Text Analytics and use low-code tools like [Power Automate](https://flow.microsoft.com/en-us/) for automated workflows.
This workflow allows you to explore seasonal impacts on poem sentiment and develop your own interpretations of the author. Try it out, then extend the notebook to ask new questions or visualize the data differently!
> Use tools from the [Digital Humanities toolkit](https://github.com/Digital-Humanities-Toolkit) to pursue similar inquiries.
> Use tools from the [Digital Humanities toolkit](https://github.com/Digital-Humanities-Toolkit) to explore further.
## Data Science + Sustainability
@ -102,40 +102,40 @@ This workflow allows you to explore seasonal impacts on poem sentiment and devel
| :---------------------------------------------------------------------------------------------------------------: |
| Data Science & Sustainability - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
The [2030 Agenda For Sustainable Development](https://sdgs.un.org/2030agenda), adopted by all United Nations members in 2015, outlines 17 goals, including those aimed at **Protecting the Planet** from degradation and climate change. The [Microsoft Sustainability](https://www.microsoft.com/en-us/sustainability) initiative supports these goals by leveraging technology to build a more sustainable future, focusing on 4 key objectives: being carbon negative, water positive, zero waste, and bio-diverse by 2030.
The [2030 Agenda For Sustainable Development](https://sdgs.un.org/2030agenda), adopted by all United Nations members in 2015, outlines 17 goals, including those aimed at **Protecting the Planet** from degradation and climate change. The [Microsoft Sustainability](https://www.microsoft.com/en-us/sustainability) initiative supports these goals by leveraging technology to build a more sustainable future, focusing on [four key objectives](https://dev.to/azure/a-visual-guide-to-sustainable-software-engineering-53hh): being carbon negative, water positive, zero waste, and bio-diverse by 2030.
Addressing these challenges requires large-scale data and cloud-based solutions. The [Planetary Computer](https://planetarycomputer.microsoft.com/) initiative provides four components to assist data scientists and developers:
Addressing these challenges requires large-scale data and cloud-based solutions. The [Planetary Computer](https://planetarycomputer.microsoft.com/) initiative offers four components to assist data scientists and developers:
* [Data Catalog](https://planetarycomputer.microsoft.com/catalog) - offers petabytes of Earth Systems data (free and Azure-hosted).
* [Planetary API](https://planetarycomputer.microsoft.com/docs/reference/stac/) - enables users to search for relevant data across space and time.
* [Hub](https://planetarycomputer.microsoft.com/docs/overview/environment/) - provides a managed environment for processing massive geospatial datasets.
* [Applications](https://planetarycomputer.microsoft.com/applications) - showcases use cases and tools for sustainability insights.
* [Data Catalog](https://planetarycomputer.microsoft.com/catalog) - provides petabytes of Earth Systems data (free and hosted on Azure).
* [Planetary API](https://planetarycomputer.microsoft.com/docs/reference/stac/) - enables users to search for data across space and time.
* [Hub](https://planetarycomputer.microsoft.com/docs/overview/environment/) - a managed environment for processing massive geospatial datasets.
* [Applications](https://planetarycomputer.microsoft.com/applications) - showcases tools and use cases for sustainability insights.
**The Planetary Computer Project is currently in preview (as of Sep 2021)** - here's how you can start contributing to sustainability solutions using data science.
* [Request access](https://planetarycomputer.microsoft.com/account/request) to begin exploring and connect with others.
* [Explore documentation](https://planetarycomputer.microsoft.com/docs/overview/about) to learn about supported datasets and APIs.
* Check out applications like [Ecosystem Monitoring](https://analytics-lab.org/ecosystemmonitoring/) for inspiration on project ideas.
* [Request access](https://planetarycomputer.microsoft.com/account/request) to begin exploring and connect with others in the community.
* [Explore documentation](https://planetarycomputer.microsoft.com/docs/overview/about) to learn about the available datasets and APIs.
* Check out applications like [Ecosystem Monitoring](https://analytics-lab.org/ecosystemmonitoring/) for ideas and inspiration.
Consider how you can use data visualization to highlight or amplify insights into issues like climate change and deforestation. Or think about how these insights can be leveraged to design new user experiences that encourage behavioral changes for more sustainable living.
Consider how you can use data visualization to highlight or amplify insights related to issues like climate change and deforestation. Or think about how these insights can be leveraged to design new user experiences that encourage behavioral changes for more sustainable living.
## Data Science + Students
We've discussed real-world applications in industry and research, and looked at examples of data science applications in digital humanities and sustainability. So how can you develop your skills and share your knowledge as data science beginners?
We've discussed real-world applications in industry and research, and looked at examples of data science applications in digital humanities and sustainability. So how can you, as beginners in data science, develop your skills and share your expertise?
Here are some examples of student data science projects to inspire you:
* [MSR Data Science Summer School](https://www.microsoft.com/en-us/research/academic-program/data-science-summer-school/#!projects) with GitHub [projects](https://github.com/msr-ds3) exploring topics such as:
- [Racial Bias in Police Use of Force](https://www.microsoft.com/en-us/research/video/data-science-summer-school-2019-replicating-an-empirical-analysis-of-racial-differences-in-police-use-of-force/) | [Github](https://github.com/msr-ds3/stop-question-frisk)
- [Reliability of NYC Subway System](https://www.microsoft.com/en-us/research/video/data-science-summer-school-2018-exploring-the-reliability-of-the-nyc-subway-system/) | [Github](https://github.com/msr-ds3/nyctransit)
* [Digitizing Material Culture: Exploring socio-economic distributions in Sirkap](https://claremont.maps.arcgis.com/apps/Cascade/index.html?appid=bdf2aef0f45a4674ba41cd373fa23afc) - by [Ornella Altunyan](https://twitter.com/ornelladotcom) and her team at Claremont, using [ArcGIS StoryMaps](https://storymaps.arcgis.com/).
* [Digitizing Material Culture: Exploring socio-economic distributions in Sirkap](https://claremont.maps.arcgis.com/apps/Cascade/index.html?appid=bdf2aef0f45a4674ba41cd373fa23afc) - a project by [Ornella Altunyan](https://twitter.com/ornelladotcom) and her team at Claremont, using [ArcGIS StoryMaps](https://storymaps.arcgis.com/).
## 🚀 Challenge
Look for articles that suggest beginner-friendly data science projects - like [these 50 topic areas](https://www.upgrad.com/blog/data-science-project-ideas-topics-beginners/), [these 21 project ideas](https://www.intellspot.com/data-science-project-ideas), or [these 16 projects with source code](https://data-flair.training/blogs/data-science-project-ideas/) that you can analyze and remix. And don't forget to blog about your learning experiences and share your insights with the community.
Look for articles that suggest beginner-friendly data science projects, such as [these 50 topic areas](https://www.upgrad.com/blog/data-science-project-ideas-topics-beginners/), [these 21 project ideas](https://www.intellspot.com/data-science-project-ideas), or [these 16 projects with source code](https://data-flair.training/blogs/data-science-project-ideas/) that you can analyze and adapt. Don't forget to blog about your learning experiences and share your insights with the community.
## Post-Lecture Quiz
[Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/39)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ds/)
## Review & Self Study

Loading…
Cancel
Save