Merge pull request #634 from microsoft/update-translations
🌐 Update translations via Co-op Translator
pull/635/head
commit
c934b074fd
@ -0,0 +1,50 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "2583a9894af7123b2fcae3376b14c035",
|
||||||
|
"translation_date": "2025-08-31T11:09:34+00:00",
|
||||||
|
"source_file": "1-Introduction/01-defining-data-science/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
We can also analyze the test results to identify which questions are most often answered incorrectly. This could indicate areas where the material might need to be clarified or expanded. Additionally, we could track how students interact with the course content—such as which videos they replay, which sections they skip, or how often they participate in discussions. This data could help us understand how students engage with the material and identify opportunities to make the course more engaging and effective.
|
||||||
|
|
||||||
|
By collecting and analyzing this data, we are essentially digitizing the learning process. Once we have this data, we can apply data science techniques to gain insights and make informed decisions about how to improve the course. This is an example of digital transformation in education.
|
||||||
|
|
||||||
|
Digital transformation is not limited to education—it can be applied to virtually any industry. For example:
|
||||||
|
|
||||||
|
- In **healthcare**, digital transformation might involve using patient data to predict disease outbreaks or personalize treatment plans.
|
||||||
|
- In **retail**, it could mean analyzing customer purchase data to optimize inventory or create personalized marketing campaigns.
|
||||||
|
- In **manufacturing**, it might involve using sensor data from machines to predict maintenance needs and reduce downtime.
|
||||||
|
|
||||||
|
The key idea is that by digitizing processes and applying data science, businesses can gain valuable insights, improve efficiency, and make better decisions.
|
||||||
|
You might say this method isn't perfect, as modules can vary in length. It might be more reasonable to divide the time by the module's length (measured in the number of characters) and compare those results instead.
|
||||||
|
When we start analyzing the results of multiple-choice tests, we can try to identify which concepts students struggle to understand and use that information to improve the content. To achieve this, we need to design tests so that each question corresponds to a specific concept or piece of knowledge.
|
||||||
|
|
||||||
|
If we want to go a step further, we can compare the time taken for each module with the age category of the students. We might discover that for certain age groups, it takes an unusually long time to complete the module, or that students drop out before finishing it. This can help us provide age-appropriate recommendations for the module and reduce dissatisfaction caused by unmet expectations.
|
||||||
|
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
In this challenge, we will try to identify concepts relevant to the field of Data Science by analyzing texts. We will take a Wikipedia article on Data Science, download and process the text, and then create a word cloud like this one:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Visit [`notebook.ipynb`](../../../../../../../../../1-Introduction/01-defining-data-science/notebook.ipynb ':ignore') to review the code. You can also run the code and observe how it performs all the data transformations in real time.
|
||||||
|
|
||||||
|
> If you are unfamiliar with running code in a Jupyter Notebook, check out [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
|
||||||
|
|
||||||
|
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/1)
|
||||||
|
|
||||||
|
## Assignments
|
||||||
|
|
||||||
|
* **Task 1**: Modify the code above to identify related concepts for the fields of **Big Data** and **Machine Learning**.
|
||||||
|
* **Task 2**: [Think About Data Science Scenarios](assignment.md)
|
||||||
|
|
||||||
|
## Credits
|
||||||
|
|
||||||
|
This lesson was created with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,46 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "4e0f1773b9bee1be3b28f9fe2c71b3de",
|
||||||
|
"translation_date": "2025-08-31T11:09:48+00:00",
|
||||||
|
"source_file": "1-Introduction/01-defining-data-science/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Assignment: Data Science Scenarios
|
||||||
|
|
||||||
|
In this first assignment, we ask you to think about some real-life processes or problems in different domains, and how you can improve them using the Data Science process. Consider the following:
|
||||||
|
|
||||||
|
1. What data can you collect?
|
||||||
|
1. How would you collect it?
|
||||||
|
1. How would you store the data? How large is the data likely to be?
|
||||||
|
1. What insights might you be able to derive from this data? What decisions could be made based on the data?
|
||||||
|
|
||||||
|
Try to think about 3 different problems/processes and describe each of the points above for each domain.
|
||||||
|
|
||||||
|
Here are some domains and problems to help you start thinking:
|
||||||
|
|
||||||
|
1. How can data be used to improve the education process for children in schools?
|
||||||
|
1. How can data be used to manage vaccination during a pandemic?
|
||||||
|
1. How can data be used to ensure productivity at work?
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
Fill in the following table (replace the suggested domains with your own if needed):
|
||||||
|
|
||||||
|
| Problem Domain | Problem | What data to collect | How to store the data | What insights/decisions we can make |
|
||||||
|
|----------------|---------|-----------------------|-----------------------|--------------------------------------|
|
||||||
|
| Education | | | | |
|
||||||
|
| Vaccination | | | | |
|
||||||
|
| Productivity | | | | |
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | -- |
|
||||||
|
The solution identifies reasonable data sources, methods of storing data, and possible decisions/insights for all domains | Some aspects of the solution lack detail, data storage is not discussed, at least 2 domains are described | Only parts of the data solution are described, only one domain is considered.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,48 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "a8f79b9c0484c35b4f26e8aec7fc4d56",
|
||||||
|
"translation_date": "2025-08-31T11:09:55+00:00",
|
||||||
|
"source_file": "1-Introduction/01-defining-data-science/solution/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Assignment: Data Science Scenarios
|
||||||
|
|
||||||
|
In this first assignment, we ask you to think about some real-life processes or problems in different domains, and how you can improve them using the Data Science process. Consider the following:
|
||||||
|
|
||||||
|
1. What data can you collect?
|
||||||
|
1. How would you collect it?
|
||||||
|
1. How would you store the data? How large is the data likely to be?
|
||||||
|
1. What insights might you be able to derive from this data? What decisions could be made based on the data?
|
||||||
|
|
||||||
|
Try to think about three different problems or processes and describe each of the points above for each domain.
|
||||||
|
|
||||||
|
Here are some domains and problems to help you start thinking:
|
||||||
|
|
||||||
|
1. How can you use data to improve the education process for children in schools?
|
||||||
|
1. How can you use data to manage vaccination during a pandemic?
|
||||||
|
1. How can you use data to ensure you are being productive at work?
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
Fill in the following table (replace the suggested domains with your own if needed):
|
||||||
|
|
||||||
|
| Problem Domain | Problem | What data to collect | How to store the data | What insights/decisions we can make |
|
||||||
|
|----------------|---------|-----------------------|-----------------------|--------------------------------------|
|
||||||
|
| Education | At universities, lecture attendance is often low, and we hypothesize that students who attend lectures more frequently tend to perform better in exams. We want to encourage attendance and test this hypothesis. | Attendance can be tracked using photos taken by security cameras in classrooms or by tracking the Bluetooth/Wi-Fi addresses of students' mobile phones in class. Exam data is already available in the university database. | If we use security camera images, we need to store a few (5-10) photos taken during class (unstructured data) and then use AI to identify students' faces (convert data to structured form). | We can calculate average attendance for each student and check for correlations with exam grades. We'll discuss correlation further in the [probability and statistics](../../04-stats-and-probability/README.md) section. To encourage attendance, we can publish weekly attendance rankings on the school portal and hold prize draws for students with the highest attendance. |
|
||||||
|
| Vaccination | | | | |
|
||||||
|
| Productivity | | | | |
|
||||||
|
|
||||||
|
> *We provide just one example answer to give you an idea of what is expected in this assignment.*
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | -- |
|
||||||
|
Reasonable data sources, storage methods, and possible decisions/insights are identified for all domains | Some aspects of the solution lack detail, data storage is not discussed, at least two domains are described | Only parts of the data solution are described, and only one domain is considered.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,35 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "b588c0fc73014f52520c666efc3e0cc3",
|
||||||
|
"translation_date": "2025-08-31T11:11:32+00:00",
|
||||||
|
"source_file": "1-Introduction/02-ethics/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
## Write A Data Ethics Case Study
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
You've learned about various [Data Ethics Challenges](README.md#2-ethics-challenges) and seen some examples of [Case Studies](README.md#3-case-studies) reflecting data ethics challenges in real-world contexts.
|
||||||
|
|
||||||
|
In this assignment, you'll write your own case study reflecting a data ethics challenge from your own experience, or from a relevant real-world context you are familiar with. Just follow these steps:
|
||||||
|
|
||||||
|
1. `Pick a Data Ethics Challenge`. Look at [the lesson examples](README.md#2-ethics-challenges) or explore online examples like [the Deon Checklist](https://deon.drivendata.org/examples/) to get inspiration.
|
||||||
|
|
||||||
|
2. `Describe a Real World Example`. Think about a situation you have heard of (headlines, research study etc.) or experienced (local community), where this specific challenge occurred. Think about the data ethics questions related to the challenge - and discuss the potential harms or unintended consequences that arise because of this issue. Bonus points: think about potential solutions or processes that may be applied here to help eliminate or mitigate the adverse impact of this challenge.
|
||||||
|
|
||||||
|
3. `Provide a Related Resources list`. Share one or more resources (links to an article, a personal blog post or image, online research paper etc.) to prove this was a real-world occurrence. Bonus points: share resources that also showcase the potential harms & consequences from the incident, or highlight positive steps taken to prevent its recurrence.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | -- |
|
||||||
|
One or more data ethics challenges are identified. <br/> <br/> The case study clearly describes a real-world incident reflecting that challenge, and highlights undesirable consequences or harms it caused. <br/><br/> There is at least one linked resource to prove this occurred. | One data ethics challenge is identified. <br/><br/> At least one relevant harm or consequence is discussed briefly. <br/><br/> However discussion is limited or lacks proof of real-world occurrence. | A data challenge is identified. <br/><br/> However the description or resources do not adequately reflect the challenge or prove its real-world occurrence. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,85 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "356d12cffc3125db133a2d27b827a745",
|
||||||
|
"translation_date": "2025-08-31T11:10:03+00:00",
|
||||||
|
"source_file": "1-Introduction/03-defining-data/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Defining Data
|
||||||
|
|
||||||
|
| ](../../sketchnotes/03-DefiningData.png)|
|
||||||
|
|:---:|
|
||||||
|
|Defining Data - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
Data consists of facts, information, observations, and measurements that are used to make discoveries and support informed decisions. A data point is a single unit of data within a dataset, which is a collection of data points. Datasets can come in various formats and structures, often depending on their source or origin. For instance, a company's monthly earnings might be stored in a spreadsheet, while hourly heart rate data from a smartwatch might be in [JSON](https://stackoverflow.com/a/383699) format. It's common for data scientists to work with different types of data within a dataset.
|
||||||
|
|
||||||
|
This lesson focuses on identifying and classifying data based on its characteristics and sources.
|
||||||
|
|
||||||
|
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/4)
|
||||||
|
|
||||||
|
## How Data is Described
|
||||||
|
|
||||||
|
### Raw Data
|
||||||
|
Raw data refers to data in its original state, directly from its source, without any analysis or organization. To make sense of a dataset, it needs to be organized into a format that can be understood by humans and the technology used for further analysis. The structure of a dataset describes how it is organized and can be classified as structured, unstructured, or semi-structured. These classifications depend on the source but ultimately fall into one of these three categories.
|
||||||
|
|
||||||
|
### Quantitative Data
|
||||||
|
Quantitative data consists of numerical observations within a dataset that can typically be analyzed, measured, and used mathematically. Examples of quantitative data include a country's population, a person's height, or a company's quarterly earnings. With further analysis, quantitative data can be used to identify seasonal trends in the Air Quality Index (AQI) or estimate the likelihood of rush hour traffic on a typical workday.
|
||||||
|
|
||||||
|
### Qualitative Data
|
||||||
|
Qualitative data, also known as categorical data, cannot be measured objectively like quantitative data. It often consists of subjective information that captures the quality of something, such as a product or process. Sometimes, qualitative data is numerical but not typically used mathematically, like phone numbers or timestamps. Examples of qualitative data include video comments, the make and model of a car, or your closest friends' favorite color. Qualitative data can be used to understand which products consumers prefer or to identify popular keywords in job application resumes.
|
||||||
|
|
||||||
|
### Structured Data
|
||||||
|
Structured data is organized into rows and columns, where each row has the same set of columns. Columns represent specific types of values and are identified by names describing what the values represent, while rows contain the actual data. Columns often have rules or restrictions to ensure the values accurately represent the column. For example, imagine a spreadsheet of customers where each row must include a phone number, and the phone numbers cannot contain alphabetical characters. Rules might be applied to ensure the phone number column is never empty and only contains numbers.
|
||||||
|
|
||||||
|
One advantage of structured data is that it can be organized in a way that allows it to relate to other structured data. However, because the data is designed to follow a specific structure, making changes to its overall organization can require significant effort. For instance, adding an email column to the customer spreadsheet that cannot be empty would require figuring out how to populate this column for existing rows.
|
||||||
|
|
||||||
|
Examples of structured data: spreadsheets, relational databases, phone numbers, bank statements.
|
||||||
|
|
||||||
|
### Unstructured Data
|
||||||
|
Unstructured data cannot typically be organized into rows or columns and lacks a defined format or set of rules. Because unstructured data has fewer restrictions, it is easier to add new information compared to structured datasets. For example, if a sensor measuring barometric pressure every two minutes receives an update to also record temperature, it wouldn't require altering the existing data if it's unstructured. However, analyzing or investigating unstructured data can take longer. For instance, a scientist trying to calculate the average temperature for the previous month might find that the sensor recorded an "e" in some entries to indicate it was broken, resulting in incomplete data.
|
||||||
|
|
||||||
|
Examples of unstructured data: text files, text messages, video files.
|
||||||
|
|
||||||
|
### Semi-structured Data
|
||||||
|
Semi-structured data combines features of both structured and unstructured data. It doesn't typically conform to rows and columns but is organized in a way that is considered structured and may follow a fixed format or set of rules. The structure can vary between sources, ranging from a well-defined hierarchy to something more flexible that allows for easy integration of new information. Metadata provides indicators for how the data is organized and stored, with various names depending on the type of data. Common names for metadata include tags, elements, entities, and attributes. For example, a typical email message includes a subject, body, and recipients, and can be organized by sender or date.
|
||||||
|
|
||||||
|
Examples of semi-structured data: HTML, CSV files, JavaScript Object Notation (JSON).
|
||||||
|
|
||||||
|
## Sources of Data
|
||||||
|
|
||||||
|
A data source refers to the original location where the data was generated or "lives," and it varies based on how and when it was collected. Data generated by its user(s) is known as primary data, while secondary data comes from a source that has collected data for general use. For example, scientists collecting observations in a rainforest would be considered primary data, and if they share it with other scientists, it becomes secondary data for those users.
|
||||||
|
|
||||||
|
Databases are a common data source and rely on a database management system to host and maintain the data. Users explore the data using commands called queries. Files can also serve as data sources, including audio, image, and video files, as well as spreadsheets like Excel. The internet is another common location for hosting data, where both databases and files can be found. Application programming interfaces (APIs) allow programmers to create ways to share data with external users over the internet, while web scraping extracts data from web pages. The [lessons in Working with Data](../../../../../../../../../2-Working-With-Data) focus on how to use various data sources.
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
In this lesson, we have learned:
|
||||||
|
|
||||||
|
- What data is
|
||||||
|
- How data is described
|
||||||
|
- How data is classified and categorized
|
||||||
|
- Where data can be found
|
||||||
|
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
Kaggle is an excellent source of open datasets. Use the [dataset search tool](https://www.kaggle.com/datasets) to find some interesting datasets and classify 3-5 datasets using the following criteria:
|
||||||
|
|
||||||
|
- Is the data quantitative or qualitative?
|
||||||
|
- Is the data structured, unstructured, or semi-structured?
|
||||||
|
|
||||||
|
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/5)
|
||||||
|
|
||||||
|
## Review & Self Study
|
||||||
|
|
||||||
|
- This Microsoft Learn unit, titled [Classify your Data](https://docs.microsoft.com/en-us/learn/modules/choose-storage-approach-in-azure/2-classify-data), provides a detailed breakdown of structured, semi-structured, and unstructured data.
|
||||||
|
|
||||||
|
## Assignment
|
||||||
|
|
||||||
|
[Classifying Datasets](assignment.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,79 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "2e5cacb967c1e9dfd07809bfc441a0b4",
|
||||||
|
"translation_date": "2025-08-31T11:10:19+00:00",
|
||||||
|
"source_file": "1-Introduction/03-defining-data/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Classifying Datasets
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
Follow the prompts in this assignment to identify and classify the data with one of each of the following data types:
|
||||||
|
|
||||||
|
**Structure Types**: Structured, Semi-Structured, or Unstructured
|
||||||
|
|
||||||
|
**Value Types**: Qualitative or Quantitative
|
||||||
|
|
||||||
|
**Source Types**: Primary or Secondary
|
||||||
|
|
||||||
|
1. A company has been acquired and now has a parent company. The data scientists have received a spreadsheet of customer phone numbers from the parent company.
|
||||||
|
|
||||||
|
Structure Type:
|
||||||
|
|
||||||
|
Value Type:
|
||||||
|
|
||||||
|
Source Type:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
2. A smart watch has been collecting heart rate data from its wearer, and the raw data is in JSON format.
|
||||||
|
|
||||||
|
Structure Type:
|
||||||
|
|
||||||
|
Value Type:
|
||||||
|
|
||||||
|
Source Type:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
3. A workplace survey of employee morale that is stored in a CSV file.
|
||||||
|
|
||||||
|
Structure Type:
|
||||||
|
|
||||||
|
Value Type:
|
||||||
|
|
||||||
|
Source Type:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
4. Astrophysicists are accessing a database of galaxies that has been collected by a space probe. The data contains the number of planets within in each galaxy.
|
||||||
|
|
||||||
|
Structure Type:
|
||||||
|
|
||||||
|
Value Type:
|
||||||
|
|
||||||
|
Source Type:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
5. A personal finance app uses APIs to connect to a user's financial accounts in order to calculate their net worth. They can see all of their transactions in a format of rows and columns and looks similar to a spreadsheet.
|
||||||
|
|
||||||
|
Structure Type:
|
||||||
|
|
||||||
|
Value Type:
|
||||||
|
|
||||||
|
Source Type:
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | -- |
|
||||||
|
Correctly identifies all structure, value, and sources |Correctly identifies 3 all structure, value, and sources|Correctly identifies 2 or less all structure, value, and sources|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,40 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "01d1b493e8b51a6ebb42524f6b1bcfff",
|
||||||
|
"translation_date": "2025-08-31T11:09:12+00:00",
|
||||||
|
"source_file": "1-Introduction/04-stats-and-probability/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Small Diabetes Study
|
||||||
|
|
||||||
|
In this assignment, we will work with a small dataset of diabetes patients taken from [here](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html).
|
||||||
|
|
||||||
|
| | AGE | SEX | BMI | BP | S1 | S2 | S3 | S4 | S5 | S6 | Y |
|
||||||
|
|---|-----|-----|-----|----|----|----|----|----|----|----|----|
|
||||||
|
| 0 | 59 | 2 | 32.1 | 101. | 157 | 93.2 | 38.0 | 4. | 4.8598 | 87 | 151 |
|
||||||
|
| 1 | 48 | 1 | 21.6 | 87.0 | 183 | 103.2 | 70. | 3. | 3.8918 | 69 | 75 |
|
||||||
|
| 2 | 72 | 2 | 30.5 | 93.0 | 156 | 93.6 | 41.0 | 4.0 | 4. | 85 | 141 |
|
||||||
|
| ... | ... | ... | ... | ...| ...| ...| ...| ...| ...| ...| ... |
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
* Open the [assignment notebook](../../../../1-Introduction/04-stats-and-probability/assignment.ipynb) in a Jupyter notebook environment
|
||||||
|
* Complete all tasks listed in the notebook, namely:
|
||||||
|
* [ ] Calculate the mean values and variance for all variables
|
||||||
|
* [ ] Create boxplots for BMI, BP, and Y based on gender
|
||||||
|
* [ ] Analyze the distribution of Age, Sex, BMI, and Y variables
|
||||||
|
* [ ] Examine the correlation between different variables and disease progression (Y)
|
||||||
|
* [ ] Test the hypothesis that the progression of diabetes differs between men and women
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | -- |
|
||||||
|
All required tasks are completed, visually represented, and explained | Most tasks are completed, but explanations or insights from graphs and/or calculated values are missing | Only basic tasks like calculating mean/variance and creating simple plots are completed, with no conclusions drawn from the data
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,31 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "696a8474a01054281704cbfb09148949",
|
||||||
|
"translation_date": "2025-08-31T11:08:01+00:00",
|
||||||
|
"source_file": "1-Introduction/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Introduction to Data Science
|
||||||
|
|
||||||
|

|
||||||
|
> Photo by <a href="https://unsplash.com/@dawson2406?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Stephen Dawson</a> on <a href="https://unsplash.com/s/photos/data?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
|
||||||
|
|
||||||
|
In these lessons, you will explore what Data Science is and learn about the ethical responsibilities that a data scientist must keep in mind. You will also understand what data is and get an introduction to statistics and probability, which are the foundational academic fields of Data Science.
|
||||||
|
|
||||||
|
### Topics
|
||||||
|
|
||||||
|
1. [Defining Data Science](01-defining-data-science/README.md)
|
||||||
|
2. [Data Science Ethics](02-ethics/README.md)
|
||||||
|
3. [Defining Data](03-defining-data/README.md)
|
||||||
|
4. [Introduction to Statistics and Probability](04-stats-and-probability/README.md)
|
||||||
|
|
||||||
|
### Credits
|
||||||
|
|
||||||
|
These lessons were created with ❤️ by [Nitya Narasimhan](https://twitter.com/nitya) and [Dmitry Soshnikov](https://twitter.com/shwars).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,73 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "2f2d7693f28e4b2675f275e489dc5aac",
|
||||||
|
"translation_date": "2025-08-31T10:59:00+00:00",
|
||||||
|
"source_file": "2-Working-With-Data/05-relational-databases/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Displaying airport data
|
||||||
|
|
||||||
|
You have been provided a [database](https://raw.githubusercontent.com/Microsoft/Data-Science-For-Beginners/main/2-Working-With-Data/05-relational-databases/airports.db) built on [SQLite](https://sqlite.org/index.html) which contains information about airports. The schema is shown below. You will use the [SQLite extension](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite&WT.mc_id=academic-77958-bethanycheum) in [Visual Studio Code](https://code.visualstudio.com?WT.mc_id=academic-77958-bethanycheum) to display information about airports in various cities.
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
To begin the assignment, you'll need to complete a few steps. This involves installing some tools and downloading the sample database.
|
||||||
|
|
||||||
|
### Set up your system
|
||||||
|
|
||||||
|
You can use Visual Studio Code and the SQLite extension to interact with the database.
|
||||||
|
|
||||||
|
1. Go to [code.visualstudio.com](https://code.visualstudio.com?WT.mc_id=academic-77958-bethanycheum) and follow the instructions to install Visual Studio Code.
|
||||||
|
1. Install the [SQLite extension](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite&WT.mc_id=academic-77958-bethanycheum) as described on the Marketplace page.
|
||||||
|
|
||||||
|
### Download and open the database
|
||||||
|
|
||||||
|
Next, download and open the database.
|
||||||
|
|
||||||
|
1. Download the [database file from GitHub](https://raw.githubusercontent.com/Microsoft/Data-Science-For-Beginners/main/2-Working-With-Data/05-relational-databases/airports.db) and save it to a folder.
|
||||||
|
1. Open Visual Studio Code.
|
||||||
|
1. Open the database in the SQLite extension by pressing **Ctrl-Shift-P** (or **Cmd-Shift-P** on a Mac) and typing `SQLite: Open database`.
|
||||||
|
1. Select **Choose database from file** and open the **airports.db** file you downloaded earlier.
|
||||||
|
1. After opening the database (you won't see any visible changes on the screen), create a new query window by pressing **Ctrl-Shift-P** (or **Cmd-Shift-P** on a Mac) and typing `SQLite: New query`.
|
||||||
|
|
||||||
|
Once the query window is open, you can use it to execute SQL statements against the database. Use the command **Ctrl-Shift-Q** (or **Cmd-Shift-Q** on a Mac) to run queries on the database.
|
||||||
|
|
||||||
|
> [!NOTE] For more details about the SQLite extension, refer to the [documentation](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite&WT.mc_id=academic-77958-bethanycheum).
|
||||||
|
|
||||||
|
## Database schema
|
||||||
|
|
||||||
|
A database's schema defines its table design and structure. The **airports** database contains two tables: `cities`, which lists cities in the United Kingdom and Ireland, and `airports`, which lists all airports. Since some cities may have multiple airports, two separate tables were created to store this information. In this exercise, you will use joins to display data for various cities.
|
||||||
|
|
||||||
|
| Cities |
|
||||||
|
| ---------------- |
|
||||||
|
| id (PK, integer) |
|
||||||
|
| city (text) |
|
||||||
|
| country (text) |
|
||||||
|
|
||||||
|
| Airports |
|
||||||
|
| -------------------------------- |
|
||||||
|
| id (PK, integer) |
|
||||||
|
| name (text) |
|
||||||
|
| code (text) |
|
||||||
|
| city_id (FK to id in **Cities**) |
|
||||||
|
|
||||||
|
## Assignment
|
||||||
|
|
||||||
|
Write queries to retrieve the following information:
|
||||||
|
|
||||||
|
1. All city names in the `Cities` table.
|
||||||
|
1. All cities in Ireland from the `Cities` table.
|
||||||
|
1. All airport names along with their city and country.
|
||||||
|
1. All airports located in London, United Kingdom.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
| Exemplary | Adequate | Needs Improvement |
|
||||||
|
| --------- | -------- | ----------------- |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,33 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "f824bfdb8b12d33293913f76f5c787c5",
|
||||||
|
"translation_date": "2025-08-31T10:57:41+00:00",
|
||||||
|
"source_file": "2-Working-With-Data/06-non-relational/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Soda Profits
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
The [Coca Cola Co spreadsheet](../../../../2-Working-With-Data/06-non-relational/CocaColaCo.xlsx) is missing some calculations. Your task is to:
|
||||||
|
|
||||||
|
1. Calculate the Gross profits for FY '15, '16, '17, and '18
|
||||||
|
- Gross Profit = Net Operating revenues - Cost of goods sold
|
||||||
|
1. Calculate the average of all the gross profits. Try to do this using a function.
|
||||||
|
- Average = Sum of gross profits divided by the number of fiscal years (10)
|
||||||
|
- Documentation on the [AVERAGE function](https://support.microsoft.com/en-us/office/average-function-047bac88-d466-426c-a32b-8f33eb960cf6)
|
||||||
|
1. This is an Excel file, but it should be editable in any spreadsheet platform
|
||||||
|
|
||||||
|
[Data source credit to Yiyi Wang](https://www.kaggle.com/yiyiwang0826/cocacola-excel)
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,37 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "dc8f035ce92e4eaa078ab19caa68267a",
|
||||||
|
"translation_date": "2025-08-31T10:58:32+00:00",
|
||||||
|
"source_file": "2-Working-With-Data/07-python/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Assignment for Data Processing in Python
|
||||||
|
|
||||||
|
In this assignment, we will ask you to expand upon the code we started developing in our challenges. The assignment consists of two parts:
|
||||||
|
|
||||||
|
## COVID-19 Spread Modeling
|
||||||
|
|
||||||
|
- [ ] Plot *R* graphs for 5-6 different countries on one plot for comparison, or using several plots side-by-side.
|
||||||
|
- [ ] Analyze how the number of deaths and recoveries correlates with the number of infected cases.
|
||||||
|
- [ ] Determine how long a typical disease lasts by visually correlating infection rates and death rates, and identifying any anomalies. You may need to examine data from different countries to figure this out.
|
||||||
|
- [ ] Calculate the fatality rate and observe how it changes over time. *You may want to account for the duration of the disease in days to shift one time series before performing calculations.*
|
||||||
|
|
||||||
|
## COVID-19 Papers Analysis
|
||||||
|
|
||||||
|
- [ ] Build a co-occurrence matrix for different medications and identify which medications are frequently mentioned together (i.e., in the same abstract). You can adapt the code for building a co-occurrence matrix for medications and diagnoses.
|
||||||
|
- [ ] Visualize this matrix using a heatmap.
|
||||||
|
- [ ] As a stretch goal, visualize the co-occurrence of medications using [chord diagram](https://en.wikipedia.org/wiki/Chord_diagram). [This library](https://pypi.org/project/chord/) may assist you in creating a chord diagram.
|
||||||
|
- [ ] As another stretch goal, extract dosages of different medications (e.g., **400mg** in *take 400mg of chloroquine daily*) using regular expressions, and build a dataframe that displays various dosages for different medications. **Note**: Consider numeric values that are located near the medication name in the text.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
All tasks are completed, graphically illustrated, and explained, including at least one of the two stretch goals | More than 5 tasks are completed, no stretch goals are attempted, or the results are unclear | Fewer than 5 (but more than 3) tasks are completed, and visualizations do not effectively demonstrate the point
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,28 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "f9d5a7275e046223fa6474477674b810",
|
||||||
|
"translation_date": "2025-08-31T10:59:43+00:00",
|
||||||
|
"source_file": "2-Working-With-Data/08-data-preparation/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Evaluating Data from a Form
|
||||||
|
|
||||||
|
A client has been testing a [small form](../../../../2-Working-With-Data/08-data-preparation/index.html) to collect some basic information about their customer base. They have shared their findings with you to validate the data they have gathered. You can open the `index.html` page in your browser to review the form.
|
||||||
|
|
||||||
|
You have been provided with a [dataset of csv records](../../../../data/form.csv) containing entries from the form, along with some basic visualizations. The client has noted that some of the visualizations appear incorrect, but they are unsure how to fix them. You can explore this further in the [assignment notebook](../../../../2-Working-With-Data/08-data-preparation/assignment.ipynb).
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
Use the techniques covered in this lesson to provide recommendations for improving the form so that it collects accurate and consistent information.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,31 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "abc3309ab41bc5a7846f70ee1a055838",
|
||||||
|
"translation_date": "2025-08-31T10:57:11+00:00",
|
||||||
|
"source_file": "2-Working-With-Data/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Working with Data
|
||||||
|
|
||||||
|

|
||||||
|
> Photo by <a href="https://unsplash.com/@swimstaralex?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Alexander Sinn</a> on <a href="https://unsplash.com/s/photos/data?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
|
||||||
|
|
||||||
|
In these lessons, you will explore various methods for managing, manipulating, and utilizing data in applications. You'll dive into relational and non-relational databases and learn how data is stored within them. Additionally, you'll gain foundational knowledge of using Python to handle data and uncover numerous ways to leverage Python for data management and analysis.
|
||||||
|
|
||||||
|
### Topics
|
||||||
|
|
||||||
|
1. [Relational databases](05-relational-databases/README.md)
|
||||||
|
2. [Non-relational databases](06-non-relational/README.md)
|
||||||
|
3. [Working with Python](07-python/README.md)
|
||||||
|
4. [Preparing data](08-data-preparation/README.md)
|
||||||
|
|
||||||
|
### Credits
|
||||||
|
|
||||||
|
These lessons were created with ❤️ by [Christopher Harrison](https://twitter.com/geektrainer), [Dmitry Soshnikov](https://twitter.com/shwars), and [Jasmine Greenaway](https://twitter.com/paladique)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,222 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "43c402d9d90ae6da55d004519ada5033",
|
||||||
|
"translation_date": "2025-08-31T11:05:55+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/09-visualization-quantities/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Visualizing Quantities
|
||||||
|
|
||||||
|
| ](../../sketchnotes/09-Visualizing-Quantities.png)|
|
||||||
|
|:---:|
|
||||||
|
| Visualizing Quantities - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
In this lesson, you'll learn how to use one of the many Python libraries available to create engaging visualizations focused on the concept of quantity. Using a cleaned dataset about the birds of Minnesota, you'll uncover fascinating insights about local wildlife.
|
||||||
|
|
||||||
|
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/16)
|
||||||
|
|
||||||
|
## Observe wingspan with Matplotlib
|
||||||
|
|
||||||
|
[Matplotlib](https://matplotlib.org/stable/index.html) is an excellent library for creating both simple and complex plots and charts of various types. Generally, the process of plotting data with these libraries involves identifying the parts of your dataframe to target, performing any necessary transformations, assigning x and y axis values, choosing the type of plot, and displaying the plot. Matplotlib offers a wide range of visualizations, but for this lesson, we'll focus on those best suited for visualizing quantities: line charts, scatterplots, and bar plots.
|
||||||
|
|
||||||
|
> ✅ Choose the chart type that best fits your data structure and the story you want to tell.
|
||||||
|
> - To analyze trends over time: line
|
||||||
|
> - To compare values: bar, column, pie, scatterplot
|
||||||
|
> - To show how parts relate to a whole: pie
|
||||||
|
> - To show data distribution: scatterplot, bar
|
||||||
|
> - To show trends: line, column
|
||||||
|
> - To show relationships between values: line, scatterplot, bubble
|
||||||
|
|
||||||
|
If you have a dataset and need to determine how much of a specific item is included, one of your first tasks will be to inspect its values.
|
||||||
|
|
||||||
|
✅ There are excellent 'cheat sheets' for Matplotlib available [here](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
|
||||||
|
|
||||||
|
## Build a line plot about bird wingspan values
|
||||||
|
|
||||||
|
Open the `notebook.ipynb` file located at the root of this lesson folder and add a cell.
|
||||||
|
|
||||||
|
> Note: The data is stored in the root of this repository in the `/data` folder.
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pandas as pd
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
birds = pd.read_csv('../../data/birds.csv')
|
||||||
|
birds.head()
|
||||||
|
```
|
||||||
|
This data contains a mix of text and numbers:
|
||||||
|
|
||||||
|
| | Name | ScientificName | Category | Order | Family | Genus | ConservationStatus | MinLength | MaxLength | MinBodyMass | MaxBodyMass | MinWingspan | MaxWingspan |
|
||||||
|
| ---: | :--------------------------- | :--------------------- | :-------------------- | :----------- | :------- | :---------- | :----------------- | --------: | --------: | ----------: | ----------: | ----------: | ----------: |
|
||||||
|
| 0 | Black-bellied whistling-duck | Dendrocygna autumnalis | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 47 | 56 | 652 | 1020 | 76 | 94 |
|
||||||
|
| 1 | Fulvous whistling-duck | Dendrocygna bicolor | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 45 | 53 | 712 | 1050 | 85 | 93 |
|
||||||
|
| 2 | Snow goose | Anser caerulescens | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 79 | 2050 | 4050 | 135 | 165 |
|
||||||
|
| 3 | Ross's goose | Anser rossii | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 57.3 | 64 | 1066 | 1567 | 113 | 116 |
|
||||||
|
| 4 | Greater white-fronted goose | Anser albifrons | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 81 | 1930 | 3310 | 130 | 165 |
|
||||||
|
|
||||||
|
Let's start by plotting some of the numeric data using a basic line plot. Suppose you wanted to visualize the maximum wingspan of these fascinating birds.
|
||||||
|
|
||||||
|
```python
|
||||||
|
wingspan = birds['MaxWingspan']
|
||||||
|
wingspan.plot()
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
What stands out immediately? There seems to be at least one outlier—what a wingspan! A 2300-centimeter wingspan equals 23 meters—are there Pterodactyls in Minnesota? Let's investigate.
|
||||||
|
|
||||||
|
While you could quickly sort the data in Excel to find these outliers (likely typos), continue the visualization process by working directly within the plot.
|
||||||
|
|
||||||
|
Add labels to the x-axis to show the types of birds in question:
|
||||||
|
|
||||||
|
```
|
||||||
|
plt.title('Max Wingspan in Centimeters')
|
||||||
|
plt.ylabel('Wingspan (CM)')
|
||||||
|
plt.xlabel('Birds')
|
||||||
|
plt.xticks(rotation=45)
|
||||||
|
x = birds['Name']
|
||||||
|
y = birds['MaxWingspan']
|
||||||
|
|
||||||
|
plt.plot(x, y)
|
||||||
|
|
||||||
|
plt.show()
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
Even with the labels rotated 45 degrees, there are too many to read. Let's try a different approach: label only the outliers and set the labels within the chart. You can use a scatter chart to make room for the labeling:
|
||||||
|
|
||||||
|
```python
|
||||||
|
plt.title('Max Wingspan in Centimeters')
|
||||||
|
plt.ylabel('Wingspan (CM)')
|
||||||
|
plt.tick_params(axis='both',which='both',labelbottom=False,bottom=False)
|
||||||
|
|
||||||
|
for i in range(len(birds)):
|
||||||
|
x = birds['Name'][i]
|
||||||
|
y = birds['MaxWingspan'][i]
|
||||||
|
plt.plot(x, y, 'bo')
|
||||||
|
if birds['MaxWingspan'][i] > 500:
|
||||||
|
plt.text(x, y * (1 - 0.05), birds['Name'][i], fontsize=12)
|
||||||
|
|
||||||
|
plt.show()
|
||||||
|
```
|
||||||
|
What's happening here? You used `tick_params` to hide the bottom labels and then created a loop over your birds dataset. By plotting the chart with small round blue dots using `bo`, you checked for any bird with a maximum wingspan over 500 and displayed its label next to the dot. You offset the labels slightly on the y-axis (`y * (1 - 0.05)`) and used the bird name as the label.
|
||||||
|
|
||||||
|
What did you discover?
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Filter your data
|
||||||
|
|
||||||
|
Both the Bald Eagle and the Prairie Falcon, while likely large birds, appear to be mislabeled with an extra `0` added to their maximum wingspan. It's unlikely you'll encounter a Bald Eagle with a 25-meter wingspan, but if you do, let us know! Let's create a new dataframe without these two outliers:
|
||||||
|
|
||||||
|
```python
|
||||||
|
plt.title('Max Wingspan in Centimeters')
|
||||||
|
plt.ylabel('Wingspan (CM)')
|
||||||
|
plt.xlabel('Birds')
|
||||||
|
plt.tick_params(axis='both',which='both',labelbottom=False,bottom=False)
|
||||||
|
for i in range(len(birds)):
|
||||||
|
x = birds['Name'][i]
|
||||||
|
y = birds['MaxWingspan'][i]
|
||||||
|
if birds['Name'][i] not in ['Bald eagle', 'Prairie falcon']:
|
||||||
|
plt.plot(x, y, 'bo')
|
||||||
|
plt.show()
|
||||||
|
```
|
||||||
|
|
||||||
|
By filtering out outliers, your data becomes more cohesive and easier to understand.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Now that we have a cleaner dataset, at least in terms of wingspan, let's explore more about these birds.
|
||||||
|
|
||||||
|
While line and scatter plots can display information about data values and their distributions, we want to focus on the quantities inherent in this dataset. You could create visualizations to answer questions like:
|
||||||
|
|
||||||
|
> How many categories of birds are there, and what are their counts?
|
||||||
|
> How many birds are extinct, endangered, rare, or common?
|
||||||
|
> How many birds belong to various genera and orders in Linnaeus's classification?
|
||||||
|
|
||||||
|
## Explore bar charts
|
||||||
|
|
||||||
|
Bar charts are useful for showing groupings of data. Let's explore the bird categories in this dataset to see which is the most common.
|
||||||
|
|
||||||
|
In the notebook file, create a basic bar chart.
|
||||||
|
|
||||||
|
✅ Note: You can either filter out the two outlier birds identified earlier, correct the typo in their wingspan, or leave them in for these exercises, which don't depend on wingspan values.
|
||||||
|
|
||||||
|
To create a bar chart, select the data you want to focus on. Bar charts can be created from raw data:
|
||||||
|
|
||||||
|
```python
|
||||||
|
birds.plot(x='Category',
|
||||||
|
kind='bar',
|
||||||
|
stacked=True,
|
||||||
|
title='Birds of Minnesota')
|
||||||
|
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
This bar chart, however, is unreadable due to too much ungrouped data. You need to select only the data you want to plot, so let's examine the length of birds based on their category.
|
||||||
|
|
||||||
|
Filter your data to include only the bird's category.
|
||||||
|
|
||||||
|
✅ Notice how you use Pandas to manage the data and let Matplotlib handle the charting.
|
||||||
|
|
||||||
|
Since there are many categories, display this chart vertically and adjust its height to accommodate all the data:
|
||||||
|
|
||||||
|
```python
|
||||||
|
category_count = birds.value_counts(birds['Category'].values, sort=True)
|
||||||
|
plt.rcParams['figure.figsize'] = [6, 12]
|
||||||
|
category_count.plot.barh()
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
This bar chart provides a clear view of the number of birds in each category. At a glance, you can see that the largest number of birds in this region belong to the Ducks/Geese/Waterfowl category. Given Minnesota's nickname as the 'land of 10,000 lakes,' this isn't surprising!
|
||||||
|
|
||||||
|
✅ Try counting other aspects of this dataset. Does anything surprise you?
|
||||||
|
|
||||||
|
## Comparing data
|
||||||
|
|
||||||
|
You can compare grouped data by creating new axes. Try comparing the MaxLength of birds based on their category:
|
||||||
|
|
||||||
|
```python
|
||||||
|
maxlength = birds['MaxLength']
|
||||||
|
plt.barh(y=birds['Category'], width=maxlength)
|
||||||
|
plt.rcParams['figure.figsize'] = [6, 12]
|
||||||
|
plt.show()
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
Nothing surprising here: hummingbirds have the smallest MaxLength compared to pelicans or geese. It's reassuring when data aligns with logic!
|
||||||
|
|
||||||
|
You can create more engaging bar chart visualizations by overlaying data. Let's overlay Minimum and Maximum Length for each bird category:
|
||||||
|
|
||||||
|
```python
|
||||||
|
minLength = birds['MinLength']
|
||||||
|
maxLength = birds['MaxLength']
|
||||||
|
category = birds['Category']
|
||||||
|
|
||||||
|
plt.barh(category, maxLength)
|
||||||
|
plt.barh(category, minLength)
|
||||||
|
|
||||||
|
plt.show()
|
||||||
|
```
|
||||||
|
In this plot, you can see the range of Minimum and Maximum Length for each bird category. You can confidently say that, based on this data, larger birds tend to have a wider length range. Fascinating!
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
This bird dataset offers a wealth of information about different bird types within a specific ecosystem. Search online for other bird-related datasets. Practice building charts and graphs to uncover facts you didn't know.
|
||||||
|
|
||||||
|
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/17)
|
||||||
|
|
||||||
|
## Review & Self Study
|
||||||
|
|
||||||
|
This lesson introduced you to using Matplotlib for visualizing quantities. Research other ways to work with datasets for visualization. [Plotly](https://github.com/plotly/plotly.py) is one library we won't cover in these lessons, so explore what it can offer.
|
||||||
|
|
||||||
|
## Assignment
|
||||||
|
|
||||||
|
[Lines, Scatters, and Bars](assignment.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "ad163c4fda72c8278280b61cad317ff4",
|
||||||
|
"translation_date": "2025-08-31T11:06:16+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/09-visualization-quantities/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Lines, Scatters and Bars
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
In this lesson, you explored line charts, scatterplots, and bar charts to highlight interesting insights from the dataset. For this assignment, delve deeper into the dataset to uncover a fact about a specific type of bird. For instance, create a notebook that visualizes all the fascinating data you can find about Snow Geese. Use the three types of plots mentioned above to craft a compelling narrative in your notebook.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
The notebook includes clear annotations, a strong narrative, and visually appealing graphs | The notebook lacks one of these elements | The notebook lacks two of these elements
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "40eeb9b9f94009c537c7811f9f27f037",
|
||||||
|
"translation_date": "2025-08-31T11:07:55+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/10-visualization-distributions/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Apply your skills
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
Up to this point, you have worked with the Minnesota birds dataset to uncover information about bird numbers and population density. Now, practice applying these techniques by exploring a different dataset, perhaps one from [Kaggle](https://www.kaggle.com/). Create a notebook that tells a story about this dataset, and be sure to include histograms in your analysis.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
A notebook is provided with annotations about the dataset, including its source, and uses at least 5 histograms to uncover insights about the data. | A notebook is provided with incomplete annotations or contains errors. | A notebook is provided without annotations and contains errors.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,200 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "af6a12015c6e250e500b570a9fa42593",
|
||||||
|
"translation_date": "2025-08-31T11:05:02+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/11-visualization-proportions/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Visualizing Proportions
|
||||||
|
|
||||||
|
| ](../../sketchnotes/11-Visualizing-Proportions.png)|
|
||||||
|
|:---:|
|
||||||
|
|Visualizing Proportions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
In this lesson, you'll work with a nature-focused dataset to visualize proportions, such as the distribution of different types of fungi in a dataset about mushrooms. We'll dive into these fascinating fungi using a dataset from Audubon that provides details about 23 species of gilled mushrooms in the Agaricus and Lepiota families. You'll experiment with fun visualizations like:
|
||||||
|
|
||||||
|
- Pie charts 🥧
|
||||||
|
- Donut charts 🍩
|
||||||
|
- Waffle charts 🧇
|
||||||
|
|
||||||
|
> 💡 Microsoft Research has an interesting project called [Charticulator](https://charticulator.com), which offers a free drag-and-drop interface for creating data visualizations. One of their tutorials uses this mushroom dataset! You can explore the data and learn the library simultaneously: [Charticulator tutorial](https://charticulator.com/tutorials/tutorial4.html).
|
||||||
|
|
||||||
|
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/20)
|
||||||
|
|
||||||
|
## Get to know your mushrooms 🍄
|
||||||
|
|
||||||
|
Mushrooms are fascinating organisms. Let's import a dataset to study them:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pandas as pd
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
mushrooms = pd.read_csv('../../data/mushrooms.csv')
|
||||||
|
mushrooms.head()
|
||||||
|
```
|
||||||
|
A table is displayed with some great data for analysis:
|
||||||
|
|
||||||
|
| class | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | stalk-root | stalk-surface-above-ring | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat |
|
||||||
|
| --------- | --------- | ----------- | --------- | ------- | ------- | --------------- | ------------ | --------- | ---------- | ----------- | ---------- | ------------------------ | ------------------------ | ---------------------- | ---------------------- | --------- | ---------- | ----------- | --------- | ----------------- | ---------- | ------- |
|
||||||
|
| Poisonous | Convex | Smooth | Brown | Bruises | Pungent | Free | Close | Narrow | Black | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban |
|
||||||
|
| Edible | Convex | Smooth | Yellow | Bruises | Almond | Free | Close | Broad | Black | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Grasses |
|
||||||
|
| Edible | Bell | Smooth | White | Bruises | Anise | Free | Close | Broad | Brown | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Meadows |
|
||||||
|
| Poisonous | Convex | Scaly | White | Bruises | Pungent | Free | Close | Narrow | Brown | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban |
|
||||||
|
|
||||||
|
You'll notice that all the data is textual. To use it in a chart, you'll need to convert it. Most of the data is represented as an object:
|
||||||
|
|
||||||
|
```python
|
||||||
|
print(mushrooms.select_dtypes(["object"]).columns)
|
||||||
|
```
|
||||||
|
|
||||||
|
The output is:
|
||||||
|
|
||||||
|
```output
|
||||||
|
Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
|
||||||
|
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
|
||||||
|
'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
|
||||||
|
'stalk-surface-below-ring', 'stalk-color-above-ring',
|
||||||
|
'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
|
||||||
|
'ring-type', 'spore-print-color', 'population', 'habitat'],
|
||||||
|
dtype='object')
|
||||||
|
```
|
||||||
|
Convert the 'class' column into a category:
|
||||||
|
|
||||||
|
```python
|
||||||
|
cols = mushrooms.select_dtypes(["object"]).columns
|
||||||
|
mushrooms[cols] = mushrooms[cols].astype('category')
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
edibleclass=mushrooms.groupby(['class']).count()
|
||||||
|
edibleclass
|
||||||
|
```
|
||||||
|
|
||||||
|
Now, if you print the mushrooms data, you'll see it grouped into categories based on the poisonous/edible class:
|
||||||
|
|
||||||
|
| | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat |
|
||||||
|
| --------- | --------- | ----------- | --------- | ------- | ---- | --------------- | ------------ | --------- | ---------- | ----------- | --- | ------------------------ | ---------------------- | ---------------------- | --------- | ---------- | ----------- | --------- | ----------------- | ---------- | ------- |
|
||||||
|
| class | | | | | | | | | | | | | | | | | | | | | |
|
||||||
|
| Edible | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | ... | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 |
|
||||||
|
| Poisonous | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | ... | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 |
|
||||||
|
|
||||||
|
Using the order in this table to create your class category labels, you can build a pie chart:
|
||||||
|
|
||||||
|
## Pie!
|
||||||
|
|
||||||
|
```python
|
||||||
|
labels=['Edible','Poisonous']
|
||||||
|
plt.pie(edibleclass['population'],labels=labels,autopct='%.1f %%')
|
||||||
|
plt.title('Edible?')
|
||||||
|
plt.show()
|
||||||
|
```
|
||||||
|
And voilà, a pie chart showing the proportions of the two mushroom classes. It's crucial to get the label order correct, so double-check the array when building the labels!
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Donuts!
|
||||||
|
|
||||||
|
A donut chart is a visually appealing variation of a pie chart, with a hole in the center. Let's use this method to explore the habitats where mushrooms grow:
|
||||||
|
|
||||||
|
```python
|
||||||
|
habitat=mushrooms.groupby(['habitat']).count()
|
||||||
|
habitat
|
||||||
|
```
|
||||||
|
Group the data by habitat. There are seven listed habitats, so use them as labels for your donut chart:
|
||||||
|
|
||||||
|
```python
|
||||||
|
labels=['Grasses','Leaves','Meadows','Paths','Urban','Waste','Wood']
|
||||||
|
|
||||||
|
plt.pie(habitat['class'], labels=labels,
|
||||||
|
autopct='%1.1f%%', pctdistance=0.85)
|
||||||
|
|
||||||
|
center_circle = plt.Circle((0, 0), 0.40, fc='white')
|
||||||
|
fig = plt.gcf()
|
||||||
|
|
||||||
|
fig.gca().add_artist(center_circle)
|
||||||
|
|
||||||
|
plt.title('Mushroom Habitats')
|
||||||
|
|
||||||
|
plt.show()
|
||||||
|
```
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This code draws the chart and a center circle, then adds the circle to the chart. You can adjust the width of the center circle by changing `0.40` to another value.
|
||||||
|
|
||||||
|
Donut charts can be customized in various ways, especially the labels for better readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut).
|
||||||
|
|
||||||
|
Now that you know how to group data and display it as a pie or donut chart, let's explore another type of chart: the waffle chart.
|
||||||
|
|
||||||
|
## Waffles!
|
||||||
|
|
||||||
|
A waffle chart visualizes quantities as a 2D array of squares. Let's use it to examine the proportions of mushroom cap colors in the dataset. First, install the helper library [PyWaffle](https://pypi.org/project/pywaffle/) and use Matplotlib:
|
||||||
|
|
||||||
|
```python
|
||||||
|
pip install pywaffle
|
||||||
|
```
|
||||||
|
|
||||||
|
Select a segment of your data to group:
|
||||||
|
|
||||||
|
```python
|
||||||
|
capcolor=mushrooms.groupby(['cap-color']).count()
|
||||||
|
capcolor
|
||||||
|
```
|
||||||
|
|
||||||
|
Create a waffle chart by defining labels and grouping your data:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pandas as pd
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from pywaffle import Waffle
|
||||||
|
|
||||||
|
data ={'color': ['brown', 'buff', 'cinnamon', 'green', 'pink', 'purple', 'red', 'white', 'yellow'],
|
||||||
|
'amount': capcolor['class']
|
||||||
|
}
|
||||||
|
|
||||||
|
df = pd.DataFrame(data)
|
||||||
|
|
||||||
|
fig = plt.figure(
|
||||||
|
FigureClass = Waffle,
|
||||||
|
rows = 100,
|
||||||
|
values = df.amount,
|
||||||
|
labels = list(df.color),
|
||||||
|
figsize = (30,30),
|
||||||
|
colors=["brown", "tan", "maroon", "green", "pink", "purple", "red", "whitesmoke", "yellow"],
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
The waffle chart clearly shows the proportions of cap colors in the mushroom dataset. Interestingly, there are many green-capped mushrooms!
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
✅ PyWaffle supports icons within the charts, using any icon available in [Font Awesome](https://fontawesome.com/). Experiment with icons to create even more engaging waffle charts.
|
||||||
|
|
||||||
|
In this lesson, you learned three ways to visualize proportions. First, group your data into categories, then choose the best visualization method—pie, donut, or waffle. Each offers a quick and intuitive snapshot of the dataset.
|
||||||
|
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
Try recreating these charts in [Charticulator](https://charticulator.com).
|
||||||
|
|
||||||
|
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/21)
|
||||||
|
|
||||||
|
## Review & Self Study
|
||||||
|
|
||||||
|
Choosing between pie, donut, or waffle charts isn't always straightforward. Here are some articles to help you decide:
|
||||||
|
|
||||||
|
https://www.beautiful.ai/blog/battle-of-the-charts-pie-chart-vs-donut-chart
|
||||||
|
https://medium.com/@hypsypops/pie-chart-vs-donut-chart-showdown-in-the-ring-5d24fd86a9ce
|
||||||
|
https://www.mit.edu/~mbarker/formula1/f1help/11-ch-c6.htm
|
||||||
|
https://medium.datadriveninvestor.com/data-visualization-done-the-right-way-with-tableau-waffle-chart-fdf2a19be402
|
||||||
|
|
||||||
|
Do some research to learn more about this decision-making process.
|
||||||
|
|
||||||
|
## Assignment
|
||||||
|
|
||||||
|
[Try it in Excel](assignment.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "1e00fe6a244c2f8f9a794c862661dd4f",
|
||||||
|
"translation_date": "2025-08-31T11:05:29+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/11-visualization-proportions/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Try it in Excel
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
Did you know you can create donut, pie, and waffle charts in Excel? Using a dataset of your choice, create these three charts directly in an Excel spreadsheet.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
| Outstanding | Satisfactory | Needs Improvement |
|
||||||
|
| ------------------------------------------------------- | ------------------------------------------------- | ------------------------------------------------------ |
|
||||||
|
| An Excel spreadsheet is provided with all three charts | An Excel spreadsheet is provided with two charts | An Excel spreadsheet is provided with only one chart |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,186 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "cad419b574d5c35eaa417e9abfdcb0c8",
|
||||||
|
"translation_date": "2025-08-31T11:06:58+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/12-visualization-relationships/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Visualizing Relationships: All About Honey 🍯
|
||||||
|
|
||||||
|
| ](../../sketchnotes/12-Visualizing-Relationships.png)|
|
||||||
|
|:---:|
|
||||||
|
|Visualizing Relationships - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
Continuing with the nature focus of our research, let's explore fascinating ways to visualize the relationships between different types of honey, based on a dataset from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
|
||||||
|
|
||||||
|
This dataset, containing around 600 entries, showcases honey production across various U.S. states. For instance, it includes data on the number of colonies, yield per colony, total production, stocks, price per pound, and the value of honey produced in each state from 1998 to 2012, with one row per year for each state.
|
||||||
|
|
||||||
|
It would be intriguing to visualize the relationship between a state's annual production and, for example, the price of honey in that state. Alternatively, you could examine the relationship between honey yield per colony across states. This time frame also includes the emergence of the devastating 'CCD' or 'Colony Collapse Disorder' first identified in 2006 (http://npic.orst.edu/envir/ccd.html), making this dataset particularly significant to study. 🐝
|
||||||
|
|
||||||
|
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/22)
|
||||||
|
|
||||||
|
In this lesson, you can use Seaborn, a library you've worked with before, to effectively visualize relationships between variables. One particularly useful function in Seaborn is `relplot`, which enables scatter plots and line plots to quickly illustrate '[statistical relationships](https://seaborn.pydata.org/tutorial/relational.html?highlight=relationships)', helping data scientists better understand how variables interact.
|
||||||
|
|
||||||
|
## Scatterplots
|
||||||
|
|
||||||
|
Use a scatterplot to visualize how the price of honey has changed year over year in each state. Seaborn's `relplot` conveniently organizes state data and displays data points for both categorical and numeric data.
|
||||||
|
|
||||||
|
Let's begin by importing the data and Seaborn:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pandas as pd
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import seaborn as sns
|
||||||
|
honey = pd.read_csv('../../data/honey.csv')
|
||||||
|
honey.head()
|
||||||
|
```
|
||||||
|
You'll notice that the honey dataset includes several interesting columns, such as year and price per pound. Let's explore this data, grouped by U.S. state:
|
||||||
|
|
||||||
|
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
|
||||||
|
| ----- | ------ | ----------- | --------- | -------- | ---------- | --------- | ---- |
|
||||||
|
| AL | 16000 | 71 | 1136000 | 159000 | 0.72 | 818000 | 1998 |
|
||||||
|
| AZ | 55000 | 60 | 3300000 | 1485000 | 0.64 | 2112000 | 1998 |
|
||||||
|
| AR | 53000 | 65 | 3445000 | 1688000 | 0.59 | 2033000 | 1998 |
|
||||||
|
| CA | 450000 | 83 | 37350000 | 12326000 | 0.62 | 23157000 | 1998 |
|
||||||
|
| CO | 27000 | 72 | 1944000 | 1594000 | 0.7 | 1361000 | 1998 |
|
||||||
|
|
||||||
|
Create a basic scatterplot to show the relationship between the price per pound of honey and its U.S. state of origin. Adjust the `y` axis to ensure all states are visible:
|
||||||
|
|
||||||
|
```python
|
||||||
|
sns.relplot(x="priceperlb", y="state", data=honey, height=15, aspect=.5);
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
Next, use a honey-inspired color scheme to illustrate how the price changes over the years. Add a 'hue' parameter to highlight year-over-year variations:
|
||||||
|
|
||||||
|
> ✅ Learn more about the [color palettes you can use in Seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html) - try a beautiful rainbow color scheme!
|
||||||
|
|
||||||
|
```python
|
||||||
|
sns.relplot(x="priceperlb", y="state", hue="year", palette="YlOrBr", data=honey, height=15, aspect=.5);
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
With this color scheme, you can clearly see a strong upward trend in honey prices over the years. If you examine a specific state, such as Arizona, you can observe a consistent pattern of price increases year over year, with only a few exceptions:
|
||||||
|
|
||||||
|
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
|
||||||
|
| ----- | ------ | ----------- | --------- | ------- | ---------- | --------- | ---- |
|
||||||
|
| AZ | 55000 | 60 | 3300000 | 1485000 | 0.64 | 2112000 | 1998 |
|
||||||
|
| AZ | 52000 | 62 | 3224000 | 1548000 | 0.62 | 1999000 | 1999 |
|
||||||
|
| AZ | 40000 | 59 | 2360000 | 1322000 | 0.73 | 1723000 | 2000 |
|
||||||
|
| AZ | 43000 | 59 | 2537000 | 1142000 | 0.72 | 1827000 | 2001 |
|
||||||
|
| AZ | 38000 | 63 | 2394000 | 1197000 | 1.08 | 2586000 | 2002 |
|
||||||
|
| AZ | 35000 | 72 | 2520000 | 983000 | 1.34 | 3377000 | 2003 |
|
||||||
|
| AZ | 32000 | 55 | 1760000 | 774000 | 1.11 | 1954000 | 2004 |
|
||||||
|
| AZ | 36000 | 50 | 1800000 | 720000 | 1.04 | 1872000 | 2005 |
|
||||||
|
| AZ | 30000 | 65 | 1950000 | 839000 | 0.91 | 1775000 | 2006 |
|
||||||
|
| AZ | 30000 | 64 | 1920000 | 902000 | 1.26 | 2419000 | 2007 |
|
||||||
|
| AZ | 25000 | 64 | 1600000 | 336000 | 1.26 | 2016000 | 2008 |
|
||||||
|
| AZ | 20000 | 52 | 1040000 | 562000 | 1.45 | 1508000 | 2009 |
|
||||||
|
| AZ | 24000 | 77 | 1848000 | 665000 | 1.52 | 2809000 | 2010 |
|
||||||
|
| AZ | 23000 | 53 | 1219000 | 427000 | 1.55 | 1889000 | 2011 |
|
||||||
|
| AZ | 22000 | 46 | 1012000 | 253000 | 1.79 | 1811000 | 2012 |
|
||||||
|
|
||||||
|
Another way to visualize this trend is by using size instead of color. For colorblind users, this might be a better option. Modify your visualization to represent price increases with larger dot sizes:
|
||||||
|
|
||||||
|
```python
|
||||||
|
sns.relplot(x="priceperlb", y="state", size="year", data=honey, height=15, aspect=.5);
|
||||||
|
```
|
||||||
|
You can observe the dots growing larger over time.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Is this simply a case of supply and demand? Could factors like climate change and colony collapse be reducing honey availability year over year, thereby driving up prices?
|
||||||
|
|
||||||
|
To explore correlations between variables in this dataset, let's examine some line charts.
|
||||||
|
|
||||||
|
## Line charts
|
||||||
|
|
||||||
|
Question: Is there a clear upward trend in honey prices per pound year over year? The easiest way to determine this is by creating a single line chart:
|
||||||
|
|
||||||
|
```python
|
||||||
|
sns.relplot(x="year", y="priceperlb", kind="line", data=honey);
|
||||||
|
```
|
||||||
|
Answer: Yes, although there are some exceptions around 2003:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
✅ Seaborn aggregates data into one line by plotting the mean and a 95% confidence interval around the mean. [Source](https://seaborn.pydata.org/tutorial/relational.html). You can disable this behavior by adding `ci=None`.
|
||||||
|
|
||||||
|
Question: In 2003, was there also a spike in honey supply? What happens if you examine total production year over year?
|
||||||
|
|
||||||
|
```python
|
||||||
|
sns.relplot(x="year", y="totalprod", kind="line", data=honey);
|
||||||
|
```
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Answer: Not really. Total production appears to have increased in 2003, even though overall honey production has been declining during these years.
|
||||||
|
|
||||||
|
Question: In that case, what might have caused the spike in honey prices around 2003?
|
||||||
|
|
||||||
|
To investigate further, you can use a facet grid.
|
||||||
|
|
||||||
|
## Facet grids
|
||||||
|
|
||||||
|
Facet grids allow you to focus on one aspect of your dataset (e.g., 'year') and create a plot for each facet using your chosen x and y coordinates. This makes comparisons easier. Does 2003 stand out in this type of visualization?
|
||||||
|
|
||||||
|
Create a facet grid using `relplot`, as recommended by [Seaborn's documentation](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html?highlight=facetgrid#seaborn.FacetGrid).
|
||||||
|
|
||||||
|
```python
|
||||||
|
sns.relplot(
|
||||||
|
data=honey,
|
||||||
|
x="yieldpercol", y="numcol",
|
||||||
|
col="year",
|
||||||
|
col_wrap=3,
|
||||||
|
kind="line"
|
||||||
|
```
|
||||||
|
In this visualization, you can compare yield per colony and number of colonies year over year, side by side, with a column wrap set to 3:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
For this dataset, nothing particularly stands out regarding the number of colonies and their yield year over year or state by state. Is there another way to explore correlations between these variables?
|
||||||
|
|
||||||
|
## Dual-line Plots
|
||||||
|
|
||||||
|
Try a multiline plot by overlaying two line plots, using Seaborn's 'despine' to remove the top and right spines, and `ax.twinx` [from Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.twinx.html). Twinx allows a chart to share the x-axis while displaying two y-axes. Superimpose yield per colony and number of colonies:
|
||||||
|
|
||||||
|
```python
|
||||||
|
fig, ax = plt.subplots(figsize=(12,6))
|
||||||
|
lineplot = sns.lineplot(x=honey['year'], y=honey['numcol'], data=honey,
|
||||||
|
label = 'Number of bee colonies', legend=False)
|
||||||
|
sns.despine()
|
||||||
|
plt.ylabel('# colonies')
|
||||||
|
plt.title('Honey Production Year over Year');
|
||||||
|
|
||||||
|
ax2 = ax.twinx()
|
||||||
|
lineplot2 = sns.lineplot(x=honey['year'], y=honey['yieldpercol'], ax=ax2, color="r",
|
||||||
|
label ='Yield per colony', legend=False)
|
||||||
|
sns.despine(right=False)
|
||||||
|
plt.ylabel('colony yield')
|
||||||
|
ax.figure.legend();
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
While nothing particularly stands out around 2003, this visualization ends the lesson on a slightly positive note: although the number of colonies is declining overall, it seems to be stabilizing, even if their yield per colony is decreasing.
|
||||||
|
|
||||||
|
Go, bees, go!
|
||||||
|
|
||||||
|
🐝❤️
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
In this lesson, you learned more about scatterplots and line grids, including facet grids. Challenge yourself to create a facet grid using a different dataset, perhaps one you've used in previous lessons. Note how long it takes to generate and consider how many grids are practical to draw using these techniques.
|
||||||
|
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/23)
|
||||||
|
|
||||||
|
## Review & Self Study
|
||||||
|
|
||||||
|
Line plots can range from simple to complex. Spend some time reading the [Seaborn documentation](https://seaborn.pydata.org/generated/seaborn.lineplot.html) to learn about the various ways to build them. Try enhancing the line charts you created in this lesson using methods described in the documentation.
|
||||||
|
## Assignment
|
||||||
|
|
||||||
|
[Dive into the beehive](assignment.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "680419753c086eef51be86607c623945",
|
||||||
|
"translation_date": "2025-08-31T11:07:22+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/12-visualization-relationships/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Explore the Beehive
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
In this lesson, you began examining a dataset about bees and their honey production over a period marked by overall declines in bee colony populations. Dive deeper into this dataset and create a notebook that narrates the story of the bee population's health, broken down by state and year. Do you uncover anything intriguing in this dataset?
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
| Outstanding | Satisfactory | Needs Improvement |
|
||||||
|
| ------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------- | ---------------------------------------- |
|
||||||
|
| A notebook is provided with a narrative supported by at least three distinct charts illustrating aspects of the dataset, comparing states and years | The notebook is missing one of these elements | The notebook is missing two of these elements |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,182 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "4ec4747a9f4f7d194248ea29903ae165",
|
||||||
|
"translation_date": "2025-08-31T11:06:21+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/13-meaningful-visualizations/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Creating Meaningful Visualizations
|
||||||
|
|
||||||
|
| ](../../sketchnotes/13-MeaningfulViz.png)|
|
||||||
|
|:---:|
|
||||||
|
| Meaningful Visualizations - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
> "If you torture the data long enough, it will confess to anything" -- [Ronald Coase](https://en.wikiquote.org/wiki/Ronald_Coase)
|
||||||
|
|
||||||
|
One of the essential skills for a data scientist is the ability to create meaningful data visualizations that help answer specific questions. Before visualizing your data, you need to ensure it has been cleaned and prepared, as covered in previous lessons. Once that's done, you can start deciding how best to present the data.
|
||||||
|
|
||||||
|
In this lesson, you will explore:
|
||||||
|
|
||||||
|
1. How to select the appropriate chart type
|
||||||
|
2. How to avoid misleading visualizations
|
||||||
|
3. How to use color effectively
|
||||||
|
4. How to style charts for better readability
|
||||||
|
5. How to create animated or 3D visualizations
|
||||||
|
6. How to design creative visualizations
|
||||||
|
|
||||||
|
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/24)
|
||||||
|
|
||||||
|
## Selecting the appropriate chart type
|
||||||
|
|
||||||
|
In earlier lessons, you experimented with creating various types of data visualizations using Matplotlib and Seaborn. Generally, you can choose the [appropriate chart type](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) based on the question you're trying to answer using the following table:
|
||||||
|
|
||||||
|
| Task | Recommended Chart Type |
|
||||||
|
| -------------------------- | ------------------------------- |
|
||||||
|
| Show data trends over time | Line |
|
||||||
|
| Compare categories | Bar, Pie |
|
||||||
|
| Compare totals | Pie, Stacked Bar |
|
||||||
|
| Show relationships | Scatter, Line, Facet, Dual Line |
|
||||||
|
| Show distributions | Scatter, Histogram, Box |
|
||||||
|
| Show proportions | Pie, Donut, Waffle |
|
||||||
|
|
||||||
|
> ✅ Depending on the structure of your data, you may need to convert it from text to numeric format to make certain charts work.
|
||||||
|
|
||||||
|
## Avoiding misleading visualizations
|
||||||
|
|
||||||
|
Even when a data scientist carefully selects the right chart for the data, there are still ways to present data in a misleading manner, often to support a specific narrative at the expense of accuracy. There are numerous examples of deceptive charts and infographics!
|
||||||
|
|
||||||
|
[](https://www.youtube.com/watch?v=oX74Nge8Wkw "How charts lie")
|
||||||
|
|
||||||
|
> 🎥 Click the image above to watch a conference talk about misleading charts.
|
||||||
|
|
||||||
|
This chart flips the X-axis to present the opposite of the truth based on dates:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
[This chart](https://media.firstcoastnews.com/assets/WTLV/images/170ae16f-4643-438f-b689-50d66ca6a8d8/170ae16f-4643-438f-b689-50d66ca6a8d8_1140x641.jpg) is even more misleading. At first glance, it appears that COVID cases have declined over time in various counties. However, upon closer inspection, the dates have been rearranged to create a deceptive downward trend.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This infamous example uses both color and a flipped Y-axis to mislead viewers. Instead of showing that gun deaths increased after the passage of gun-friendly legislation, the chart tricks the eye into believing the opposite:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This peculiar chart manipulates proportions to a comical degree:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Another deceptive tactic is comparing things that are not truly comparable. A [fascinating website](https://tylervigen.com/spurious-correlations) showcases 'spurious correlations,' such as the divorce rate in Maine being linked to margarine consumption. A Reddit group also collects [examples of poor data usage](https://www.reddit.com/r/dataisugly/top/?t=all).
|
||||||
|
|
||||||
|
Understanding how easily the eye can be tricked by misleading charts is crucial. Even with good intentions, a poorly chosen chart type—like a pie chart with too many categories—can lead to confusion.
|
||||||
|
|
||||||
|
## Using color effectively
|
||||||
|
|
||||||
|
The 'Florida gun violence' chart above demonstrates how color can add another layer of meaning to visualizations. Libraries like Matplotlib and Seaborn come with pre-designed color palettes, but if you're creating a chart manually, it's worth studying [color theory](https://colormatters.com/color-and-design/basic-color-theory).
|
||||||
|
|
||||||
|
> ✅ Keep accessibility in mind when designing charts. Some users may be colorblind—does your chart work well for those with visual impairments?
|
||||||
|
|
||||||
|
Be cautious when selecting colors for your chart, as they can convey unintended meanings. For example, the 'pink ladies' in the 'height' chart above add a gendered implication that makes the chart even more bizarre.
|
||||||
|
|
||||||
|
While [color meanings](https://colormatters.com/color-symbolism/the-meanings-of-colors) can vary across cultures and change depending on the shade, general associations include:
|
||||||
|
|
||||||
|
| Color | Meaning |
|
||||||
|
| ------ | ------------------- |
|
||||||
|
| red | power |
|
||||||
|
| blue | trust, loyalty |
|
||||||
|
| yellow | happiness, caution |
|
||||||
|
| green | ecology, luck, envy |
|
||||||
|
| purple | happiness |
|
||||||
|
| orange | vibrance |
|
||||||
|
|
||||||
|
If you're tasked with creating a chart with custom colors, ensure that your choices align with the intended message and that the chart remains accessible.
|
||||||
|
|
||||||
|
## Styling charts for better readability
|
||||||
|
|
||||||
|
Charts lose their value if they are difficult to read. Take time to adjust the width and height of your chart to ensure it scales well with your data. For example, if you need to display all 50 states, consider showing them vertically on the Y-axis to avoid horizontal scrolling.
|
||||||
|
|
||||||
|
Label your axes, include a legend if necessary, and provide tooltips for better data comprehension.
|
||||||
|
|
||||||
|
If your data includes verbose text on the X-axis, you can angle the text for improved readability. [Matplotlib](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html) also supports 3D plotting if your data warrants it. Advanced visualizations can be created using `mpl_toolkits.mplot3d`.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Animation and 3D visualizations
|
||||||
|
|
||||||
|
Some of the most engaging visualizations today are animated. Shirley Wu has created stunning examples using D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/),' where each flower represents a movie. Another example is 'Bussed Out,' an interactive experience for the Guardian that combines visualizations with Greensock and D3, along with a scrollytelling article format, to illustrate how NYC addresses homelessness by bussing people out of the city.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
> "Bussed Out: How America Moves its Homeless" from [the Guardian](https://www.theguardian.com/us-news/ng-interactive/2017/dec/20/bussed-out-america-moves-homeless-people-country-study). Visualizations by Nadieh Bremer & Shirley Wu
|
||||||
|
|
||||||
|
While this lesson doesn't delve deeply into these powerful visualization libraries, you can experiment with D3 in a Vue.js app to create an animated visualization of the book "Dangerous Liaisons" as a social network.
|
||||||
|
|
||||||
|
> "Les Liaisons Dangereuses" is an epistolary novel, presented as a series of letters. Written in 1782 by Choderlos de Laclos, it tells the story of the morally corrupt social maneuvers of two French aristocrats, the Vicomte de Valmont and the Marquise de Merteuil. Both meet their downfall, but not before causing significant social damage. The novel unfolds through letters written to various individuals in their circles, plotting revenge or simply creating chaos. Create a visualization of these letters to identify the key players in the narrative.
|
||||||
|
|
||||||
|
You will complete a web app that displays an animated view of this social network. It uses a library designed to create a [network visualization](https://github.com/emiliorizzo/vue-d3-network) with Vue.js and D3. Once the app is running, you can drag nodes around the screen to rearrange the data.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Project: Build a network chart using D3.js
|
||||||
|
|
||||||
|
> This lesson folder includes a `solution` folder with the completed project for reference.
|
||||||
|
|
||||||
|
1. Follow the instructions in the README.md file located in the starter folder's root. Ensure you have NPM and Node.js installed on your machine before setting up the project's dependencies.
|
||||||
|
|
||||||
|
2. Open the `starter/src` folder. Inside, you'll find an `assets` folder containing a .json file with all the letters from the novel, annotated with 'to' and 'from' fields.
|
||||||
|
|
||||||
|
3. Complete the code in `components/Nodes.vue` to enable the visualization. Locate the method called `createLinks()` and add the following nested loop.
|
||||||
|
|
||||||
|
Loop through the .json object to extract the 'to' and 'from' data for the letters and build the `links` object for the visualization library:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
//loop through letters
|
||||||
|
let f = 0;
|
||||||
|
let t = 0;
|
||||||
|
for (var i = 0; i < letters.length; i++) {
|
||||||
|
for (var j = 0; j < characters.length; j++) {
|
||||||
|
|
||||||
|
if (characters[j] == letters[i].from) {
|
||||||
|
f = j;
|
||||||
|
}
|
||||||
|
if (characters[j] == letters[i].to) {
|
||||||
|
t = j;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
this.links.push({ sid: f, tid: t });
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Run your app from the terminal (npm run serve) and enjoy the visualization!
|
||||||
|
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
Explore the internet to find examples of misleading visualizations. How does the author mislead the audience, and is it intentional? Try correcting the visualizations to show how they should appear.
|
||||||
|
|
||||||
|
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/25)
|
||||||
|
|
||||||
|
## Review & Self Study
|
||||||
|
|
||||||
|
Here are some articles about misleading data visualizations:
|
||||||
|
|
||||||
|
https://gizmodo.com/how-to-lie-with-data-visualization-1563576606
|
||||||
|
|
||||||
|
http://ixd.prattsi.org/2017/12/visual-lies-usability-in-deceptive-data-visualizations/
|
||||||
|
|
||||||
|
Explore these interesting visualizations of historical assets and artifacts:
|
||||||
|
|
||||||
|
https://handbook.pubpub.org/
|
||||||
|
|
||||||
|
Read this article on how animation can enhance visualizations:
|
||||||
|
|
||||||
|
https://medium.com/@EvanSinar/use-animation-to-supercharge-data-visualization-cd905a882ad4
|
||||||
|
|
||||||
|
## Assignment
|
||||||
|
|
||||||
|
[Create your own custom visualization](assignment.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "e56df4c0f49357e30ac8fc77aa439dd4",
|
||||||
|
"translation_date": "2025-08-31T11:06:43+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/13-meaningful-visualizations/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Build your own custom vis
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
Using the code sample provided in this project, create a social network by mocking up data based on your own social interactions. You could map out your social media usage or design a diagram of your family members. Build an engaging web app that showcases a unique visualization of a social network.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
A GitHub repository is provided with code that functions correctly (consider deploying it as a static web app) and includes a well-annotated README explaining the project | The repository either does not function correctly or lacks proper documentation | The repository neither functions correctly nor includes proper documentation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,40 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "5c51a54dd89075a7a362890117b7ed9e",
|
||||||
|
"translation_date": "2025-08-31T11:06:48+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/13-meaningful-visualizations/solution/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Dangerous Liaisons data visualization project
|
||||||
|
|
||||||
|
To begin, make sure NPM and Node are installed and running on your computer. Install the dependencies (npm install) and then launch the project locally (npm run serve):
|
||||||
|
|
||||||
|
## Project setup
|
||||||
|
```
|
||||||
|
npm install
|
||||||
|
```
|
||||||
|
|
||||||
|
### Compiles and hot-reloads for development
|
||||||
|
```
|
||||||
|
npm run serve
|
||||||
|
```
|
||||||
|
|
||||||
|
### Compiles and minifies for production
|
||||||
|
```
|
||||||
|
npm run build
|
||||||
|
```
|
||||||
|
|
||||||
|
### Lints and fixes files
|
||||||
|
```
|
||||||
|
npm run lint
|
||||||
|
```
|
||||||
|
|
||||||
|
### Customize configuration
|
||||||
|
Refer to [Configuration Reference](https://cli.vuejs.org/config/).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,40 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "5c51a54dd89075a7a362890117b7ed9e",
|
||||||
|
"translation_date": "2025-08-31T11:06:52+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/13-meaningful-visualizations/starter/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Dangerous Liaisons data visualization project
|
||||||
|
|
||||||
|
To begin, make sure NPM and Node are installed and running on your computer. Install the dependencies (npm install) and then launch the project locally (npm run serve):
|
||||||
|
|
||||||
|
## Project setup
|
||||||
|
```
|
||||||
|
npm install
|
||||||
|
```
|
||||||
|
|
||||||
|
### Compiles and hot-reloads for development
|
||||||
|
```
|
||||||
|
npm run serve
|
||||||
|
```
|
||||||
|
|
||||||
|
### Compiles and minifies for production
|
||||||
|
```
|
||||||
|
npm run build
|
||||||
|
```
|
||||||
|
|
||||||
|
### Lints and fixes files
|
||||||
|
```
|
||||||
|
npm run lint
|
||||||
|
```
|
||||||
|
|
||||||
|
### Customize configuration
|
||||||
|
Refer to [Configuration Reference](https://cli.vuejs.org/config/).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,234 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "22acf28f518a4769ea14fa42f4734b9f",
|
||||||
|
"translation_date": "2025-08-31T11:03:10+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/R/09-visualization-quantities/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Visualizing Quantities
|
||||||
|
| ](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/sketchnotes/09-Visualizing-Quantities.png)|
|
||||||
|
|:---:|
|
||||||
|
| Visualizing Quantities - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
In this lesson, you'll learn how to use some of the many R packages and libraries to create engaging visualizations focused on the concept of quantity. Using a cleaned dataset about the birds of Minnesota, you can uncover fascinating insights about local wildlife.
|
||||||
|
|
||||||
|
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/16)
|
||||||
|
|
||||||
|
## Observing Wingspan with ggplot2
|
||||||
|
An excellent library for creating both simple and complex plots and charts is [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html). Generally, the process of plotting data with these libraries involves identifying the parts of your dataframe to target, performing any necessary transformations, assigning x and y axis values, choosing the type of plot, and then displaying it.
|
||||||
|
|
||||||
|
`ggplot2` is a system for declaratively creating graphics, based on The Grammar of Graphics. The [Grammar of Graphics](https://en.wikipedia.org/wiki/Ggplot2) is a general framework for data visualization that breaks graphs into semantic components like scales and layers. In simpler terms, the ease of creating plots and graphs for univariate or multivariate data with minimal code makes `ggplot2` the most popular visualization package in R. The user specifies how to map variables to aesthetics, chooses graphical primitives, and `ggplot2` handles the rest.
|
||||||
|
|
||||||
|
> ✅ Plot = Data + Aesthetics + Geometry
|
||||||
|
> - Data refers to the dataset
|
||||||
|
> - Aesthetics indicate the variables to study (x and y variables)
|
||||||
|
> - Geometry refers to the type of plot (line plot, bar plot, etc.)
|
||||||
|
|
||||||
|
Choose the best geometry (type of plot) based on your data and the story you want to tell through the visualization.
|
||||||
|
|
||||||
|
> - To analyze trends: line, column
|
||||||
|
> - To compare values: bar, column, pie, scatterplot
|
||||||
|
> - To show how parts relate to a whole: pie
|
||||||
|
> - To show data distribution: scatterplot, bar
|
||||||
|
> - To show relationships between values: line, scatterplot, bubble
|
||||||
|
|
||||||
|
✅ You can also check out this helpful [cheatsheet](https://nyu-cdsc.github.io/learningr/assets/data-visualization-2.1.pdf) for ggplot2.
|
||||||
|
|
||||||
|
## Build a Line Plot for Bird Wingspan Values
|
||||||
|
|
||||||
|
Open the R console and import the dataset.
|
||||||
|
> Note: The dataset is stored in the root of this repo in the `/data` folder.
|
||||||
|
|
||||||
|
Let's import the dataset and view the first five rows of the data.
|
||||||
|
|
||||||
|
```r
|
||||||
|
birds <- read.csv("../../data/birds.csv",fileEncoding="UTF-8-BOM")
|
||||||
|
head(birds)
|
||||||
|
```
|
||||||
|
The first few rows of the data contain a mix of text and numbers:
|
||||||
|
|
||||||
|
| | Name | ScientificName | Category | Order | Family | Genus | ConservationStatus | MinLength | MaxLength | MinBodyMass | MaxBodyMass | MinWingspan | MaxWingspan |
|
||||||
|
| ---: | :--------------------------- | :--------------------- | :-------------------- | :----------- | :------- | :---------- | :----------------- | --------: | --------: | ----------: | ----------: | ----------: | ----------: |
|
||||||
|
| 0 | Black-bellied whistling-duck | Dendrocygna autumnalis | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 47 | 56 | 652 | 1020 | 76 | 94 |
|
||||||
|
| 1 | Fulvous whistling-duck | Dendrocygna bicolor | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 45 | 53 | 712 | 1050 | 85 | 93 |
|
||||||
|
| 2 | Snow goose | Anser caerulescens | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 79 | 2050 | 4050 | 135 | 165 |
|
||||||
|
| 3 | Ross's goose | Anser rossii | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 57.3 | 64 | 1066 | 1567 | 113 | 116 |
|
||||||
|
| 4 | Greater white-fronted goose | Anser albifrons | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 81 | 1930 | 3310 | 130 | 165 |
|
||||||
|
|
||||||
|
Let's start by plotting some of the numeric data using a basic line plot. Suppose you want to visualize the maximum wingspan of these birds.
|
||||||
|
|
||||||
|
```r
|
||||||
|
install.packages("ggplot2")
|
||||||
|
library("ggplot2")
|
||||||
|
ggplot(data=birds, aes(x=Name, y=MaxWingspan,group=1)) +
|
||||||
|
geom_line()
|
||||||
|
```
|
||||||
|
Here, you install the `ggplot2` package and import it into the workspace using the `library("ggplot2")` command. To create any plot in ggplot, the `ggplot()` function is used, where you specify the dataset, x and y variables as attributes. In this case, we use the `geom_line()` function to create a line plot.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
What do you notice right away? There seems to be at least one outlier—what a wingspan! A 2000+ centimeter wingspan equals more than 20 meters—are there Pterodactyls in Minnesota? Let's investigate.
|
||||||
|
|
||||||
|
While you could sort the data in Excel to find these outliers (likely typos), let's continue the visualization process directly within the plot.
|
||||||
|
|
||||||
|
Add labels to the x-axis to show the bird species:
|
||||||
|
|
||||||
|
```r
|
||||||
|
ggplot(data=birds, aes(x=Name, y=MaxWingspan,group=1)) +
|
||||||
|
geom_line() +
|
||||||
|
theme(axis.text.x = element_text(angle = 45, hjust=1))+
|
||||||
|
xlab("Birds") +
|
||||||
|
ylab("Wingspan (CM)") +
|
||||||
|
ggtitle("Max Wingspan in Centimeters")
|
||||||
|
```
|
||||||
|
We specify the angle in the `theme` and set the x and y axis labels using `xlab()` and `ylab()`. The `ggtitle()` adds a title to the graph.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Even with the labels rotated 45 degrees, there are too many to read. Let's try a different approach: label only the outliers and place the labels within the chart. You can use a scatter plot to make room for the labels:
|
||||||
|
|
||||||
|
```r
|
||||||
|
ggplot(data=birds, aes(x=Name, y=MaxWingspan,group=1)) +
|
||||||
|
geom_point() +
|
||||||
|
geom_text(aes(label=ifelse(MaxWingspan>500,as.character(Name),'')),hjust=0,vjust=0) +
|
||||||
|
theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())
|
||||||
|
ylab("Wingspan (CM)") +
|
||||||
|
ggtitle("Max Wingspan in Centimeters") +
|
||||||
|
```
|
||||||
|
What happens here? You use the `geom_point()` function to plot scatter points. You also add labels for birds with `MaxWingspan > 500` and hide the x-axis labels to declutter the plot.
|
||||||
|
|
||||||
|
What do you discover?
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Filter Your Data
|
||||||
|
|
||||||
|
Both the Bald Eagle and the Prairie Falcon, while likely large birds, seem to have been mislabeled with an extra 0 in their maximum wingspan. A Bald Eagle with a 25-meter wingspan is unlikely, but if you see one, let us know! Let's create a new dataframe without these two outliers:
|
||||||
|
|
||||||
|
```r
|
||||||
|
birds_filtered <- subset(birds, MaxWingspan < 500)
|
||||||
|
|
||||||
|
ggplot(data=birds_filtered, aes(x=Name, y=MaxWingspan,group=1)) +
|
||||||
|
geom_point() +
|
||||||
|
ylab("Wingspan (CM)") +
|
||||||
|
xlab("Birds") +
|
||||||
|
ggtitle("Max Wingspan in Centimeters") +
|
||||||
|
geom_text(aes(label=ifelse(MaxWingspan>500,as.character(Name),'')),hjust=0,vjust=0) +
|
||||||
|
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
|
||||||
|
```
|
||||||
|
We create a new dataframe `birds_filtered` and plot a scatter plot. By filtering out outliers, your data becomes more cohesive and easier to interpret.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Now that we have a cleaner dataset in terms of wingspan, let's explore more about these birds.
|
||||||
|
|
||||||
|
While line and scatter plots can display data values and distributions, we want to think about the quantities in this dataset. You could create visualizations to answer questions like:
|
||||||
|
|
||||||
|
> How many categories of birds are there, and how many birds are in each?
|
||||||
|
> How many birds are extinct, endangered, rare, or common?
|
||||||
|
> How many birds belong to various genera and orders in Linnaeus's classification?
|
||||||
|
|
||||||
|
## Explore Bar Charts
|
||||||
|
|
||||||
|
Bar charts are useful for showing groupings of data. Let's explore the bird categories in this dataset to see which is the most common.
|
||||||
|
|
||||||
|
Let's create a bar chart using the filtered data.
|
||||||
|
|
||||||
|
```r
|
||||||
|
install.packages("dplyr")
|
||||||
|
install.packages("tidyverse")
|
||||||
|
|
||||||
|
library(lubridate)
|
||||||
|
library(scales)
|
||||||
|
library(dplyr)
|
||||||
|
library(ggplot2)
|
||||||
|
library(tidyverse)
|
||||||
|
|
||||||
|
birds_filtered %>% group_by(Category) %>%
|
||||||
|
summarise(n=n(),
|
||||||
|
MinLength = mean(MinLength),
|
||||||
|
MaxLength = mean(MaxLength),
|
||||||
|
MinBodyMass = mean(MinBodyMass),
|
||||||
|
MaxBodyMass = mean(MaxBodyMass),
|
||||||
|
MinWingspan=mean(MinWingspan),
|
||||||
|
MaxWingspan=mean(MaxWingspan)) %>%
|
||||||
|
gather("key", "value", - c(Category, n)) %>%
|
||||||
|
ggplot(aes(x = Category, y = value, group = key, fill = key)) +
|
||||||
|
geom_bar(stat = "identity") +
|
||||||
|
scale_fill_manual(values = c("#D62728", "#FF7F0E", "#8C564B","#2CA02C", "#1F77B4", "#9467BD")) +
|
||||||
|
xlab("Category")+ggtitle("Birds of Minnesota")
|
||||||
|
|
||||||
|
```
|
||||||
|
In this snippet, we install the [dplyr](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8) and [lubridate](https://www.rdocumentation.org/packages/lubridate/versions/1.8.0) packages to help manipulate and group data for a stacked bar chart. First, we group the data by the `Category` of bird and summarize columns like `MinLength`, `MaxLength`, `MinBodyMass`, `MaxBodyMass`, `MinWingspan`, and `MaxWingspan`. Then, we use the `ggplot2` package to plot the bar chart, specifying colors and labels for the categories.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This bar chart is hard to read because there's too much ungrouped data. Let's focus on the length of birds based on their category.
|
||||||
|
|
||||||
|
Filter the data to include only the bird categories.
|
||||||
|
|
||||||
|
Since there are many categories, display the chart vertically and adjust its height to fit all the data:
|
||||||
|
|
||||||
|
```r
|
||||||
|
birds_count<-dplyr::count(birds_filtered, Category, sort = TRUE)
|
||||||
|
birds_count$Category <- factor(birds_count$Category, levels = birds_count$Category)
|
||||||
|
ggplot(birds_count,aes(Category,n))+geom_bar(stat="identity")+coord_flip()
|
||||||
|
```
|
||||||
|
We count unique values in the `Category` column and sort them into a new dataframe `birds_count`. This sorted data is then factored at the same level to ensure it is plotted in order. Using `ggplot2`, we create a bar chart. The `coord_flip()` function plots horizontal bars.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This bar chart provides a clear view of the number of birds in each category. At a glance, you can see that the Ducks/Geese/Waterfowl category has the most birds. Given that Minnesota is the "land of 10,000 lakes," this makes sense!
|
||||||
|
|
||||||
|
✅ Try counting other attributes in this dataset. Do any results surprise you?
|
||||||
|
|
||||||
|
## Comparing Data
|
||||||
|
|
||||||
|
You can compare grouped data by creating new axes. For example, compare the MaxLength of birds based on their category:
|
||||||
|
|
||||||
|
```r
|
||||||
|
birds_grouped <- birds_filtered %>%
|
||||||
|
group_by(Category) %>%
|
||||||
|
summarise(
|
||||||
|
MaxLength = max(MaxLength, na.rm = T),
|
||||||
|
MinLength = max(MinLength, na.rm = T)
|
||||||
|
) %>%
|
||||||
|
arrange(Category)
|
||||||
|
|
||||||
|
ggplot(birds_grouped,aes(Category,MaxLength))+geom_bar(stat="identity")+coord_flip()
|
||||||
|
```
|
||||||
|
We group the `birds_filtered` data by `Category` and plot a bar graph.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
No surprises here: hummingbirds have the smallest MaxLength compared to pelicans or geese. It's reassuring when data aligns with logic!
|
||||||
|
|
||||||
|
You can make bar charts more interesting by superimposing data. For example, compare the Minimum and Maximum Length of birds within each category:
|
||||||
|
|
||||||
|
```r
|
||||||
|
ggplot(data=birds_grouped, aes(x=Category)) +
|
||||||
|
geom_bar(aes(y=MaxLength), stat="identity", position ="identity", fill='blue') +
|
||||||
|
geom_bar(aes(y=MinLength), stat="identity", position="identity", fill='orange')+
|
||||||
|
coord_flip()
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
This bird dataset offers a wealth of information about different bird species in a specific ecosystem. Search online for other bird-related datasets. Practice creating charts and graphs to uncover new insights about birds.
|
||||||
|
|
||||||
|
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/17)
|
||||||
|
|
||||||
|
## Review & Self-Study
|
||||||
|
|
||||||
|
This lesson introduced you to using `ggplot2` for visualizing quantities. Research other ways to work with datasets for visualization. Look for datasets you can visualize using other packages like [Lattice](https://stat.ethz.ch/R-manual/R-devel/library/lattice/html/Lattice.html) and [Plotly](https://github.com/plotly/plotly.R#readme).
|
||||||
|
|
||||||
|
## Assignment
|
||||||
|
[Lines, Scatters, and Bars](assignment.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "0ea21b6513df5ade7419c6b7d65f10b1",
|
||||||
|
"translation_date": "2025-08-31T11:03:42+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/R/09-visualization-quantities/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Lines, Scatters and Bars
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
In this lesson, you explored line charts, scatterplots, and bar charts to highlight interesting insights from this dataset. For this assignment, dive deeper into the dataset to uncover a fact about a specific type of bird. For instance, create a script that visualizes all the intriguing data you can find about Snow Geese. Use the three types of plots mentioned above to craft a narrative in your notebook.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
The script includes clear annotations, a compelling narrative, and visually appealing graphs | The script lacks one of these elements | The script lacks two of these elements
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,179 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "a33c5d4b4156a2b41788d8720b6f724c",
|
||||||
|
"translation_date": "2025-08-31T11:03:47+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/R/12-visualization-relationships/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Visualizing Relationships: All About Honey 🍯
|
||||||
|
|
||||||
|
| ](../../../sketchnotes/12-Visualizing-Relationships.png)|
|
||||||
|
|:---:|
|
||||||
|
|Visualizing Relationships - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
Continuing with the nature focus of our research, let's explore fascinating ways to visualize the relationships between different types of honey, based on a dataset from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
|
||||||
|
|
||||||
|
This dataset, containing around 600 entries, showcases honey production across various U.S. states. For instance, it includes data on the number of colonies, yield per colony, total production, stocks, price per pound, and the value of honey produced in each state from 1998 to 2012, with one row per year for each state.
|
||||||
|
|
||||||
|
It would be intriguing to visualize the relationship between a state's annual production and, for example, the price of honey in that state. Alternatively, you could examine the relationship between honey yield per colony across states. This time period also includes the emergence of the devastating 'CCD' or 'Colony Collapse Disorder' first observed in 2006 (http://npic.orst.edu/envir/ccd.html), making this dataset particularly meaningful to study. 🐝
|
||||||
|
|
||||||
|
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/22)
|
||||||
|
|
||||||
|
In this lesson, you'll use ggplot2, a library you've worked with before, to visualize relationships between variables. One of the highlights of ggplot2 is its `geom_point` and `qplot` functions, which allow you to create scatter plots and line plots to quickly visualize '[statistical relationships](https://ggplot2.tidyverse.org/)'. These tools help data scientists better understand how variables interact with one another.
|
||||||
|
|
||||||
|
## Scatterplots
|
||||||
|
|
||||||
|
Use a scatterplot to illustrate how the price of honey has changed year over year in each state. ggplot2, with its `ggplot` and `geom_point` functions, makes it easy to group state data and display data points for both categorical and numeric variables.
|
||||||
|
|
||||||
|
Let's begin by importing the data and Seaborn:
|
||||||
|
|
||||||
|
```r
|
||||||
|
honey=read.csv('../../data/honey.csv')
|
||||||
|
head(honey)
|
||||||
|
```
|
||||||
|
You'll notice that the honey dataset contains several interesting columns, including year and price per pound. Let's explore this data, grouped by U.S. state:
|
||||||
|
|
||||||
|
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
|
||||||
|
| ----- | ------ | ----------- | --------- | -------- | ---------- | --------- | ---- |
|
||||||
|
| AL | 16000 | 71 | 1136000 | 159000 | 0.72 | 818000 | 1998 |
|
||||||
|
| AZ | 55000 | 60 | 3300000 | 1485000 | 0.64 | 2112000 | 1998 |
|
||||||
|
| AR | 53000 | 65 | 3445000 | 1688000 | 0.59 | 2033000 | 1998 |
|
||||||
|
| CA | 450000 | 83 | 37350000 | 12326000 | 0.62 | 23157000 | 1998 |
|
||||||
|
| CO | 27000 | 72 | 1944000 | 1594000 | 0.7 | 1361000 | 1998 |
|
||||||
|
| FL | 230000 | 98 |22540000 | 4508000 | 0.64 | 14426000 | 1998 |
|
||||||
|
|
||||||
|
Create a basic scatterplot to show the relationship between the price per pound of honey and its U.S. state of origin. Adjust the `y` axis to ensure all states are visible:
|
||||||
|
|
||||||
|
```r
|
||||||
|
library(ggplot2)
|
||||||
|
ggplot(honey, aes(x = priceperlb, y = state)) +
|
||||||
|
geom_point(colour = "blue")
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
Next, use a honey-inspired color scheme to visualize how the price changes over the years. You can achieve this by adding the 'scale_color_gradientn' parameter to highlight year-over-year changes:
|
||||||
|
|
||||||
|
> ✅ Learn more about the [scale_color_gradientn](https://www.rdocumentation.org/packages/ggplot2/versions/0.9.1/topics/scale_colour_gradientn) - try a beautiful rainbow color scheme!
|
||||||
|
|
||||||
|
```r
|
||||||
|
ggplot(honey, aes(x = priceperlb, y = state, color=year)) +
|
||||||
|
geom_point()+scale_color_gradientn(colours = colorspace::heat_hcl(7))
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
With this color scheme, you can clearly see a strong upward trend in honey prices over the years. If you examine a specific state, such as Arizona, you'll notice a consistent pattern of price increases year over year, with only a few exceptions:
|
||||||
|
|
||||||
|
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
|
||||||
|
| ----- | ------ | ----------- | --------- | ------- | ---------- | --------- | ---- |
|
||||||
|
| AZ | 55000 | 60 | 3300000 | 1485000 | 0.64 | 2112000 | 1998 |
|
||||||
|
| AZ | 52000 | 62 | 3224000 | 1548000 | 0.62 | 1999000 | 1999 |
|
||||||
|
| AZ | 40000 | 59 | 2360000 | 1322000 | 0.73 | 1723000 | 2000 |
|
||||||
|
| AZ | 43000 | 59 | 2537000 | 1142000 | 0.72 | 1827000 | 2001 |
|
||||||
|
| AZ | 38000 | 63 | 2394000 | 1197000 | 1.08 | 2586000 | 2002 |
|
||||||
|
| AZ | 35000 | 72 | 2520000 | 983000 | 1.34 | 3377000 | 2003 |
|
||||||
|
| AZ | 32000 | 55 | 1760000 | 774000 | 1.11 | 1954000 | 2004 |
|
||||||
|
| AZ | 36000 | 50 | 1800000 | 720000 | 1.04 | 1872000 | 2005 |
|
||||||
|
| AZ | 30000 | 65 | 1950000 | 839000 | 0.91 | 1775000 | 2006 |
|
||||||
|
| AZ | 30000 | 64 | 1920000 | 902000 | 1.26 | 2419000 | 2007 |
|
||||||
|
| AZ | 25000 | 64 | 1600000 | 336000 | 1.26 | 2016000 | 2008 |
|
||||||
|
| AZ | 20000 | 52 | 1040000 | 562000 | 1.45 | 1508000 | 2009 |
|
||||||
|
| AZ | 24000 | 77 | 1848000 | 665000 | 1.52 | 2809000 | 2010 |
|
||||||
|
| AZ | 23000 | 53 | 1219000 | 427000 | 1.55 | 1889000 | 2011 |
|
||||||
|
| AZ | 22000 | 46 | 1012000 | 253000 | 1.79 | 1811000 | 2012 |
|
||||||
|
|
||||||
|
Another way to visualize this trend is by using size instead of color. For colorblind users, this might be a better option. Modify your visualization to represent price increases with larger dot sizes:
|
||||||
|
|
||||||
|
```r
|
||||||
|
ggplot(honey, aes(x = priceperlb, y = state)) +
|
||||||
|
geom_point(aes(size = year),colour = "blue") +
|
||||||
|
scale_size_continuous(range = c(0.25, 3))
|
||||||
|
```
|
||||||
|
You can observe the dots growing larger over time.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Is this simply a case of supply and demand? Could factors like climate change and colony collapse be reducing honey availability year over year, leading to price increases?
|
||||||
|
|
||||||
|
To investigate correlations between variables in this dataset, let's explore line charts.
|
||||||
|
|
||||||
|
## Line charts
|
||||||
|
|
||||||
|
Question: Is there a clear upward trend in honey prices per pound year over year? The simplest way to find out is by creating a single line chart:
|
||||||
|
|
||||||
|
```r
|
||||||
|
qplot(honey$year,honey$priceperlb, geom='smooth', span =0.5, xlab = "year",ylab = "priceperlb")
|
||||||
|
```
|
||||||
|
Answer: Yes, although there are some exceptions around 2003:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Question: In 2003, can we also observe a spike in honey supply? What happens if you examine total production year over year?
|
||||||
|
|
||||||
|
```python
|
||||||
|
qplot(honey$year,honey$totalprod, geom='smooth', span =0.5, xlab = "year",ylab = "totalprod")
|
||||||
|
```
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Answer: Not really. Total production seems to have increased in 2003, even though overall honey production appears to be declining during these years.
|
||||||
|
|
||||||
|
Question: In that case, what might have caused the spike in honey prices around 2003?
|
||||||
|
|
||||||
|
To explore this, let's use a facet grid.
|
||||||
|
|
||||||
|
## Facet grids
|
||||||
|
|
||||||
|
Facet grids allow you to focus on one aspect of your dataset (e.g., 'year') and create a plot for each facet based on your chosen x and y coordinates. This makes comparisons easier. Does 2003 stand out in this type of visualization?
|
||||||
|
|
||||||
|
Create a facet grid using `facet_wrap` as recommended by [ggplot2's documentation](https://ggplot2.tidyverse.org/reference/facet_wrap.html).
|
||||||
|
|
||||||
|
```r
|
||||||
|
ggplot(honey, aes(x=yieldpercol, y = numcol,group = 1)) +
|
||||||
|
geom_line() + facet_wrap(vars(year))
|
||||||
|
```
|
||||||
|
In this visualization, you can compare yield per colony and number of colonies year over year, with a wrap set at 3 columns:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
For this dataset, nothing particularly stands out regarding the number of colonies and their yield year over year or state by state. Is there another way to identify correlations between these two variables?
|
||||||
|
|
||||||
|
## Dual-line Plots
|
||||||
|
|
||||||
|
Try a multiline plot by overlaying two line plots using R's `par` and `plot` functions. Plot the year on the x-axis and display two y-axes: yield per colony and number of colonies, superimposed:
|
||||||
|
|
||||||
|
```r
|
||||||
|
par(mar = c(5, 4, 4, 4) + 0.3)
|
||||||
|
plot(honey$year, honey$numcol, pch = 16, col = 2,type="l")
|
||||||
|
par(new = TRUE)
|
||||||
|
plot(honey$year, honey$yieldpercol, pch = 17, col = 3,
|
||||||
|
axes = FALSE, xlab = "", ylab = "",type="l")
|
||||||
|
axis(side = 4, at = pretty(range(y2)))
|
||||||
|
mtext("colony yield", side = 4, line = 3)
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
While nothing significant stands out around 2003, this visualization ends the lesson on a slightly positive note: although the number of colonies is declining overall, it appears to be stabilizing, even if their yield per colony is decreasing.
|
||||||
|
|
||||||
|
Go, bees, go!
|
||||||
|
|
||||||
|
🐝❤️
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
In this lesson, you learned more about scatterplots and line grids, including facet grids. Challenge yourself to create a facet grid using a different dataset, perhaps one you've used in previous lessons. Note how long it takes to create and consider how many grids are practical to draw using these techniques.
|
||||||
|
|
||||||
|
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/23)
|
||||||
|
|
||||||
|
## Review & Self Study
|
||||||
|
|
||||||
|
Line plots can range from simple to complex. Spend some time reading the [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/geom_path.html#:~:text=geom_line()%20connects%20them%20in,which%20cases%20are%20connected%20together) to learn about the various ways to build them. Try enhancing the line charts you created in this lesson using other methods described in the documentation.
|
||||||
|
|
||||||
|
## Assignment
|
||||||
|
|
||||||
|
[Dive into the beehive](assignment.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,182 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "b4039f1c76548d144a0aee0bf28304ec",
|
||||||
|
"translation_date": "2025-08-31T11:04:38+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/R/13-meaningful-vizualizations/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Creating Meaningful Visualizations
|
||||||
|
|
||||||
|
| ](../../../sketchnotes/13-MeaningfulViz.png)|
|
||||||
|
|:---:|
|
||||||
|
| Meaningful Visualizations - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
> "If you torture the data long enough, it will confess to anything" -- [Ronald Coase](https://en.wikiquote.org/wiki/Ronald_Coase)
|
||||||
|
|
||||||
|
One of the essential skills for a data scientist is the ability to create meaningful data visualizations that help answer questions. Before visualizing your data, you need to ensure it has been cleaned and prepared, as covered in previous lessons. Once that's done, you can start deciding how best to present the data.
|
||||||
|
|
||||||
|
In this lesson, you will explore:
|
||||||
|
|
||||||
|
1. How to select the appropriate chart type
|
||||||
|
2. How to avoid misleading visualizations
|
||||||
|
3. How to use color effectively
|
||||||
|
4. How to style charts for better readability
|
||||||
|
5. How to create animated or 3D visualizations
|
||||||
|
6. How to design creative visualizations
|
||||||
|
|
||||||
|
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/24)
|
||||||
|
|
||||||
|
## Selecting the appropriate chart type
|
||||||
|
|
||||||
|
In earlier lessons, you experimented with creating various data visualizations using Matplotlib and Seaborn. Generally, you can choose the [appropriate chart type](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) based on the question you're trying to answer using this table:
|
||||||
|
|
||||||
|
| Task | Recommended Chart Type |
|
||||||
|
| -------------------------- | ------------------------------- |
|
||||||
|
| Show data trends over time | Line |
|
||||||
|
| Compare categories | Bar, Pie |
|
||||||
|
| Compare totals | Pie, Stacked Bar |
|
||||||
|
| Show relationships | Scatter, Line, Facet, Dual Line |
|
||||||
|
| Show distributions | Scatter, Histogram, Box |
|
||||||
|
| Show proportions | Pie, Donut, Waffle |
|
||||||
|
|
||||||
|
> ✅ Depending on the structure of your data, you may need to convert it from text to numeric format to make certain charts work.
|
||||||
|
|
||||||
|
## Avoiding misleading visualizations
|
||||||
|
|
||||||
|
Even when a data scientist carefully selects the right chart for the data, there are many ways data can be presented to support a specific narrative—sometimes at the expense of accuracy. There are countless examples of misleading charts and infographics!
|
||||||
|
|
||||||
|
[](https://www.youtube.com/watch?v=oX74Nge8Wkw "How charts lie")
|
||||||
|
|
||||||
|
> 🎥 Click the image above to watch a conference talk about misleading charts.
|
||||||
|
|
||||||
|
This chart flips the X-axis to present the opposite of the truth based on dates:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
[This chart](https://media.firstcoastnews.com/assets/WTLV/images/170ae16f-4643-438f-b689-50d66ca6a8d8/170ae16f-4643-438f-b689-50d66ca6a8d8_1140x641.jpg) is even more misleading. At first glance, it appears that COVID cases have declined over time in various counties. However, upon closer inspection, the dates have been rearranged to create a deceptive downward trend.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This infamous example uses both color and a flipped Y-axis to mislead viewers. Instead of showing that gun deaths increased after the passage of gun-friendly legislation, the chart tricks the eye into believing the opposite:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This peculiar chart demonstrates how proportions can be manipulated, often to humorous effect:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Another deceptive tactic is comparing things that aren't comparable. A [fascinating website](https://tylervigen.com/spurious-correlations) showcases 'spurious correlations,' such as the divorce rate in Maine being linked to margarine consumption. A Reddit group also collects [examples of poor data usage](https://www.reddit.com/r/dataisugly/top/?t=all).
|
||||||
|
|
||||||
|
It's crucial to understand how easily the eye can be tricked by misleading charts. Even with good intentions, a poorly chosen chart type—like a pie chart with too many categories—can lead to confusion.
|
||||||
|
|
||||||
|
## Using color effectively
|
||||||
|
|
||||||
|
The 'Florida gun violence' chart above illustrates how color can add another layer of meaning to visualizations. Libraries like ggplot2 and RColorBrewer come with pre-designed color palettes, but if you're creating a chart manually, it's worth studying [color theory](https://colormatters.com/color-and-design/basic-color-theory).
|
||||||
|
|
||||||
|
> ✅ Keep accessibility in mind when designing charts. Some users may be colorblind—does your chart work well for those with visual impairments?
|
||||||
|
|
||||||
|
Be cautious when selecting colors for your chart, as they can convey unintended meanings. For example, the 'pink ladies' in the 'height' chart above add a distinctly 'feminine' connotation, which contributes to the chart's oddness.
|
||||||
|
|
||||||
|
While [color meanings](https://colormatters.com/color-symbolism/the-meanings-of-colors) can vary across cultures and change depending on the shade, general associations include:
|
||||||
|
|
||||||
|
| Color | Meaning |
|
||||||
|
| ------ | ------------------- |
|
||||||
|
| red | power |
|
||||||
|
| blue | trust, loyalty |
|
||||||
|
| yellow | happiness, caution |
|
||||||
|
| green | ecology, luck, envy |
|
||||||
|
| purple | happiness |
|
||||||
|
| orange | vibrance |
|
||||||
|
|
||||||
|
If you're tasked with creating a chart with custom colors, ensure that your choices align with the intended message and that the chart remains accessible.
|
||||||
|
|
||||||
|
## Styling charts for better readability
|
||||||
|
|
||||||
|
Charts lose their value if they're hard to read! Take time to adjust the width and height of your chart to fit the data appropriately. For example, if you're displaying all 50 states, consider showing them vertically on the Y-axis to avoid horizontal scrolling.
|
||||||
|
|
||||||
|
Label your axes, include a legend if needed, and provide tooltips for better data comprehension.
|
||||||
|
|
||||||
|
If your data includes verbose text on the X-axis, you can angle the text for improved readability. [plot3D](https://cran.r-project.org/web/packages/plot3D/index.html) offers 3D plotting capabilities if your data supports it. This can help create more sophisticated visualizations.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Animation and 3D visualizations
|
||||||
|
|
||||||
|
Some of the most engaging data visualizations today are animated. Shirley Wu has created stunning examples using D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/),' where each flower represents a movie. Another example is 'Bussed Out,' an interactive experience combining visualizations with Greensock and D3, paired with a scrollytelling article format to explore how NYC addresses homelessness by bussing people out of the city.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
> "Bussed Out: How America Moves its Homeless" from [the Guardian](https://www.theguardian.com/us-news/ng-interactive/2017/dec/20/bussed-out-america-moves-homeless-people-country-study). Visualizations by Nadieh Bremer & Shirley Wu
|
||||||
|
|
||||||
|
While this lesson doesn't delve deeply into these advanced visualization libraries, you can experiment with D3 in a Vue.js app to create an animated social network visualization based on the book "Dangerous Liaisons."
|
||||||
|
|
||||||
|
> "Les Liaisons Dangereuses" is an epistolary novel, meaning it is presented as a series of letters. Written in 1782 by Choderlos de Laclos, it tells the story of two morally corrupt French aristocrats, the Vicomte de Valmont and the Marquise de Merteuil, who engage in manipulative social schemes. Both ultimately meet their downfall, but not before causing significant social damage. The novel unfolds through letters exchanged among various characters, revealing plots for revenge and mischief. Create a visualization of these letters to identify the key players in the narrative.
|
||||||
|
|
||||||
|
You will build a web app that displays an animated view of this social network. The app uses a library designed to create a [network visualization](https://github.com/emiliorizzo/vue-d3-network) with Vue.js and D3. Once the app is running, you can drag nodes around the screen to rearrange the data.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Project: Create a network visualization using D3.js
|
||||||
|
|
||||||
|
> The lesson folder includes a `solution` folder with the completed project for reference.
|
||||||
|
|
||||||
|
1. Follow the instructions in the README.md file located in the starter folder's root. Ensure you have NPM and Node.js installed on your machine before setting up the project's dependencies.
|
||||||
|
|
||||||
|
2. Open the `starter/src` folder. Inside, you'll find an `assets` folder containing a .json file with all the letters from the novel, annotated with 'to' and 'from' fields.
|
||||||
|
|
||||||
|
3. Complete the code in `components/Nodes.vue` to enable the visualization. Locate the method called `createLinks()` and add the following nested loop.
|
||||||
|
|
||||||
|
Loop through the .json object to extract the 'to' and 'from' data for the letters, building the `links` object for the visualization library:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
//loop through letters
|
||||||
|
let f = 0;
|
||||||
|
let t = 0;
|
||||||
|
for (var i = 0; i < letters.length; i++) {
|
||||||
|
for (var j = 0; j < characters.length; j++) {
|
||||||
|
|
||||||
|
if (characters[j] == letters[i].from) {
|
||||||
|
f = j;
|
||||||
|
}
|
||||||
|
if (characters[j] == letters[i].to) {
|
||||||
|
t = j;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
this.links.push({ sid: f, tid: t });
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Run your app from the terminal (npm run serve) and enjoy the visualization!
|
||||||
|
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
Explore the internet to find examples of misleading visualizations. How does the author mislead the audience, and is it intentional? Try correcting the visualizations to show how they should appear.
|
||||||
|
|
||||||
|
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/25)
|
||||||
|
|
||||||
|
## Review & Self Study
|
||||||
|
|
||||||
|
Here are some articles about misleading data visualizations:
|
||||||
|
|
||||||
|
https://gizmodo.com/how-to-lie-with-data-visualization-1563576606
|
||||||
|
|
||||||
|
http://ixd.prattsi.org/2017/12/visual-lies-usability-in-deceptive-data-visualizations/
|
||||||
|
|
||||||
|
Check out these interesting visualizations of historical assets and artifacts:
|
||||||
|
|
||||||
|
https://handbook.pubpub.org/
|
||||||
|
|
||||||
|
Read this article on how animation can enhance visualizations:
|
||||||
|
|
||||||
|
https://medium.com/@EvanSinar/use-animation-to-supercharge-data-visualization-cd905a882ad4
|
||||||
|
|
||||||
|
## Assignment
|
||||||
|
|
||||||
|
[Create your own custom visualization](assignment.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,42 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "1441550a0d789796b2821e04f7f4cc94",
|
||||||
|
"translation_date": "2025-08-31T11:02:05+00:00",
|
||||||
|
"source_file": "3-Data-Visualization/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Visualizations
|
||||||
|
|
||||||
|

|
||||||
|
> Photo by <a href="https://unsplash.com/@jenna2980?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Jenna Lee</a> on <a href="https://unsplash.com/s/photos/bees-in-a-meadow?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
|
||||||
|
|
||||||
|
Visualizing data is one of the most essential tasks for a data scientist. A picture is worth a thousand words, and a visualization can help you uncover various interesting aspects of your data, such as spikes, anomalies, clusters, trends, and more, enabling you to understand the story your data is telling.
|
||||||
|
|
||||||
|
In these five lessons, you will work with data sourced from nature and create engaging and visually appealing visualizations using different techniques.
|
||||||
|
|
||||||
|
| Topic Number | Topic | Linked Lesson | Author |
|
||||||
|
| :-----------: | :--: | :-----------: | :----: |
|
||||||
|
| 1. | Visualizing quantities | <ul> <li> [Python](09-visualization-quantities/README.md)</li> <li>[R](../../../3-Data-Visualization/R/09-visualization-quantities) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
|
||||||
|
| 2. | Visualizing distribution | <ul> <li> [Python](10-visualization-distributions/README.md)</li> <li>[R](../../../3-Data-Visualization/R/10-visualization-distributions) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
|
||||||
|
| 3. | Visualizing proportions | <ul> <li> [Python](11-visualization-proportions/README.md)</li> <li>[R](../../../3-Data-Visualization) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
|
||||||
|
| 4. | Visualizing relationships | <ul> <li> [Python](12-visualization-relationships/README.md)</li> <li>[R](../../../3-Data-Visualization) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
|
||||||
|
| 5. | Making Meaningful Visualizations | <ul> <li> [Python](13-meaningful-visualizations/README.md)</li> <li>[R](../../../3-Data-Visualization) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
|
||||||
|
|
||||||
|
### Credits
|
||||||
|
|
||||||
|
These visualization lessons were created with 🌸 by [Jen Looper](https://twitter.com/jenlooper), [Jasleen Sondhi](https://github.com/jasleen101010), and [Vidushi Gupta](https://github.com/Vidushi-Gupta).
|
||||||
|
|
||||||
|
🍯 Data on US Honey Production is sourced from Jessica Li's project on [Kaggle](https://www.kaggle.com/jessicali9530/honey-production). The [data](https://usda.library.cornell.edu/concern/publications/rn301137d) originates from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
|
||||||
|
|
||||||
|
🍄 Data on mushrooms is also sourced from [Kaggle](https://www.kaggle.com/hatterasdunton/mushroom-classification-updated-dataset), revised by Hatteras Dunton. This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. The mushrooms are drawn from *The Audubon Society Field Guide to North American Mushrooms* (1981). This dataset was donated to UCI ML 27 in 1987.
|
||||||
|
|
||||||
|
🦆 Data on Minnesota Birds is sourced from [Kaggle](https://www.kaggle.com/hannahcollins/minnesota-birds), scraped from [Wikipedia](https://en.wikipedia.org/wiki/List_of_birds_of_Minnesota) by Hannah Collins.
|
||||||
|
|
||||||
|
All these datasets are licensed under [CC0: Creative Commons](https://creativecommons.org/publicdomain/zero/1.0/).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,37 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "564445c39ad29a491abcb9356fc4d47d",
|
||||||
|
"translation_date": "2025-08-31T11:01:12+00:00",
|
||||||
|
"source_file": "4-Data-Science-Lifecycle/14-Introduction/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Assessing a Dataset
|
||||||
|
|
||||||
|
A client has reached out to your team for assistance in analyzing the seasonal spending habits of taxi customers in New York City.
|
||||||
|
|
||||||
|
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
|
||||||
|
|
||||||
|
Your team is currently in the [Capturing](Readme.md#Capturing) phase of the Data Science Lifecycle, and you are responsible for managing the dataset. You have been provided with a notebook and [data](../../../../data/taxi.csv) to examine.
|
||||||
|
|
||||||
|
In this directory, there is a [notebook](../../../../4-Data-Science-Lifecycle/14-Introduction/notebook.ipynb) that uses Python to load yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets).
|
||||||
|
You can also open the taxi data file using a text editor or spreadsheet software like Excel.
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
- Evaluate whether the data in this dataset is sufficient to answer the question.
|
||||||
|
- Explore the [NYC Open Data catalog](https://data.cityofnewyork.us/browse?sortBy=most_accessed&utf8=%E2%9C%93). Identify an additional dataset that might be useful in addressing the client's question.
|
||||||
|
- Formulate 3 questions to ask the client for further clarification and a deeper understanding of the problem.
|
||||||
|
|
||||||
|
Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more details about the data.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,36 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "fcc7547171f4530f159676dd73ed772e",
|
||||||
|
"translation_date": "2025-08-31T11:00:18+00:00",
|
||||||
|
"source_file": "4-Data-Science-Lifecycle/15-analyzing/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Exploring for answers
|
||||||
|
|
||||||
|
This is a continuation of the previous lesson's [assignment](../14-Introduction/assignment.md), where we briefly examined the dataset. Now, we will dive deeper into the data.
|
||||||
|
|
||||||
|
Once again, the question the client wants answered is: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
|
||||||
|
|
||||||
|
Your team is currently in the [Analyzing](README.md) stage of the Data Science Lifecycle, where you are tasked with performing exploratory data analysis (EDA) on the dataset. You have been provided with a notebook and a dataset containing 200 taxi transactions from January and July 2019.
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
In this directory, you will find a [notebook](../../../../4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb) and data from the [Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets). For more details about the data, refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf).
|
||||||
|
|
||||||
|
Use some of the techniques covered in this lesson to conduct your own EDA in the notebook (feel free to add cells if needed) and answer the following questions:
|
||||||
|
|
||||||
|
- What other factors in the data might influence the tip amount?
|
||||||
|
- Which columns are likely unnecessary for answering the client's question?
|
||||||
|
- Based on the data provided so far, does it appear to show any evidence of seasonal tipping patterns?
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,26 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "8980d7efd101c82d6d6ffc3458214120",
|
||||||
|
"translation_date": "2025-08-31T11:02:00+00:00",
|
||||||
|
"source_file": "4-Data-Science-Lifecycle/16-communication/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Tell a story
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
Data Science is all about storytelling. Choose any dataset and write a brief paper about the story it can tell. What insights do you hope to uncover from your dataset? How will you handle it if the findings are challenging or unexpected? What if the data doesn't easily reveal its patterns? Consider the different scenarios your dataset might present and document them.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | -- |
|
||||||
|
|
||||||
|
A one-page essay is submitted in .doc format, with the dataset clearly explained, properly documented, credited, and a well-structured story is developed using detailed examples from the data.| A shorter essay is submitted with less detail | The essay is missing one or more of the required elements.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,30 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "dd173fd30fc039a7a299898920680723",
|
||||||
|
"translation_date": "2025-08-31T10:59:58+00:00",
|
||||||
|
"source_file": "4-Data-Science-Lifecycle/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# The Data Science Lifecycle
|
||||||
|
|
||||||
|

|
||||||
|
> Photo by <a href="https://unsplash.com/@headwayio?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Headway</a> on <a href="https://unsplash.com/s/photos/communication?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
|
||||||
|
|
||||||
|
In these lessons, you'll dive into various aspects of the Data Science lifecycle, including data analysis and effective communication.
|
||||||
|
|
||||||
|
### Topics
|
||||||
|
|
||||||
|
1. [Introduction](14-Introduction/README.md)
|
||||||
|
2. [Analyzing](15-analyzing/README.md)
|
||||||
|
3. [Communication](16-communication/README.md)
|
||||||
|
|
||||||
|
### Credits
|
||||||
|
|
||||||
|
These lessons were created with ❤️ by [Jalen McGee](https://twitter.com/JalenMCG) and [Jasmine Greenaway](https://twitter.com/paladique)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "96f3696153d9ed54b19a1bb65438c104",
|
||||||
|
"translation_date": "2025-08-31T10:56:22+00:00",
|
||||||
|
"source_file": "5-Data-Science-In-Cloud/17-Introduction/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Market Research
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
In this lesson, you learned that there are several major cloud providers. Conduct market research to explore what each one offers to Data Scientists. Are their offerings similar? Write a paper describing the services provided by three or more of these cloud providers.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | --- |
|
||||||
|
A one-page paper thoroughly describes the data science offerings of three cloud providers and highlights the differences between them. | A shorter paper is provided. | A paper is submitted but lacks a complete analysis.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "8fdc4a5fd9bc27a8d2ebef995dfbf73f",
|
||||||
|
"translation_date": "2025-08-31T10:55:35+00:00",
|
||||||
|
"source_file": "5-Data-Science-In-Cloud/18-Low-Code/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Low code/No code Data Science project on Azure ML
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
We explored how to use the Azure ML platform to train, deploy, and consume a model in a Low code/No code manner. Now, find some data that you can use to train another model, deploy it, and consume it. You can search for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
| Exemplary | Adequate | Needs Improvement |
|
||||||
|
|-----------|----------|-------------------|
|
||||||
|
|When uploading the data, you ensured to change the feature's type if necessary. You also cleaned the data if needed. You trained a model on a dataset using AutoML, and you reviewed the model explanations. You deployed the best model and successfully consumed it. | When uploading the data, you ensured to change the feature's type if necessary. You trained a model on a dataset using AutoML, deployed the best model, and successfully consumed it. | You deployed the best model trained by AutoML and successfully consumed it. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "386efdbc19786951341f6956247ee990",
|
||||||
|
"translation_date": "2025-08-31T10:57:04+00:00",
|
||||||
|
"source_file": "5-Data-Science-In-Cloud/19-Azure/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Data Science project using Azure ML SDK
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
We explored how to use the Azure ML platform to train, deploy, and consume a model with the Azure ML SDK. Now, find some data that you can use to train another model, deploy it, and consume it. You can search for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
| Exemplary | Adequate | Needs Improvement |
|
||||||
|
|-----------|----------|-------------------|
|
||||||
|
|When configuring AutoML, you referred to the SDK documentation to explore the parameters you could use. You trained a dataset using AutoML with the Azure ML SDK, reviewed the model explanations, deployed the best model, and successfully consumed it using the Azure ML SDK. | You trained a dataset using AutoML with the Azure ML SDK, reviewed the model explanations, deployed the best model, and successfully consumed it using the Azure ML SDK. | You trained a dataset using AutoML with the Azure ML SDK, deployed the best model, and successfully consumed it using the Azure ML SDK. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,35 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "8dfe141a0f46f7d253e07f74913c7f44",
|
||||||
|
"translation_date": "2025-08-31T10:54:38+00:00",
|
||||||
|
"source_file": "5-Data-Science-In-Cloud/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Data Science in the Cloud
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
> Photo by [Jelleke Vanooteghem](https://unsplash.com/@ilumire) from [Unsplash](https://unsplash.com/s/photos/cloud?orientation=landscape)
|
||||||
|
|
||||||
|
When working with big data in data science, the cloud can be a game changer. In the next three lessons, we will explore what the cloud is and why it can be incredibly useful. We will also analyze a heart failure dataset and build a model to estimate the likelihood of someone experiencing heart failure. Using the cloud's capabilities, we will train, deploy, and utilize the model in two different ways: one using only the user interface in a Low code/No code approach, and the other using the Azure Machine Learning Software Developer Kit (Azure ML SDK).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### Topics
|
||||||
|
|
||||||
|
1. [Why use Cloud for Data Science?](17-Introduction/README.md)
|
||||||
|
2. [Data Science in the Cloud: The "Low code/No code" way](18-Low-Code/README.md)
|
||||||
|
3. [Data Science in the Cloud: The "Azure ML SDK" way](19-Azure/README.md)
|
||||||
|
|
||||||
|
### Credits
|
||||||
|
These lessons were created with ☁️ and 💕 by [Maud Levy](https://twitter.com/maudstweets) and [Tiffany Souterre](https://twitter.com/TiffanySouterre).
|
||||||
|
|
||||||
|
The data for the Heart Failure Prediction project comes from [
|
||||||
|
Larxel](https://www.kaggle.com/andrewmvd) on [Kaggle](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data). It is licensed under the [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,155 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "67076ed50f54e7d26ba1ba378d6078f1",
|
||||||
|
"translation_date": "2025-08-31T11:11:55+00:00",
|
||||||
|
"source_file": "6-Data-Science-In-Wild/20-Real-World-Examples/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Data Science in the Real World
|
||||||
|
|
||||||
|
|  ](../../sketchnotes/20-DataScience-RealWorld.png) |
|
||||||
|
| :--------------------------------------------------------------------------------------------------------------: |
|
||||||
|
| Data Science In The Real World - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
We're nearing the end of this learning journey!
|
||||||
|
|
||||||
|
We began by defining data science and ethics, explored tools and techniques for data analysis and visualization, reviewed the data science lifecycle, and examined how to scale and automate workflows using cloud computing services. Now, you might be wondering: _"How do I apply all these learnings to real-world scenarios?"_
|
||||||
|
|
||||||
|
In this lesson, we'll delve into real-world applications of data science across industries and explore specific examples in research, digital humanities, and sustainability. We'll also discuss student project opportunities and wrap up with resources to help you continue your learning journey.
|
||||||
|
|
||||||
|
## Pre-Lecture Quiz
|
||||||
|
|
||||||
|
[Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/38)
|
||||||
|
|
||||||
|
## Data Science + Industry
|
||||||
|
|
||||||
|
The democratization of AI has made it easier for developers to design and integrate AI-driven decision-making and data-driven insights into user experiences and development workflows. Here are some examples of how data science is applied in real-world industry scenarios:
|
||||||
|
|
||||||
|
* [Google Flu Trends](https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/) used data science to correlate search terms with flu trends. Although the approach had flaws, it highlighted the potential (and challenges) of data-driven healthcare predictions.
|
||||||
|
|
||||||
|
* [UPS Routing Predictions](https://www.technologyreview.com/2018/11/21/139000/how-ups-uses-ai-to-outsmart-bad-weather/) - explains how UPS uses data science and machine learning to predict optimal delivery routes, factoring in weather, traffic, deadlines, and more.
|
||||||
|
|
||||||
|
* [NYC Taxicab Route Visualization](http://chriswhong.github.io/nyctaxi/) - data obtained through [Freedom Of Information Laws](https://chriswhong.com/open-data/foil_nyc_taxi/) was used to visualize a day in the life of NYC cabs, providing insights into navigation, earnings, and trip durations over a 24-hour period.
|
||||||
|
|
||||||
|
* [Uber Data Science Workbench](https://eng.uber.com/dsw/) - leverages data from millions of daily Uber trips (pickup/dropoff locations, trip durations, preferred routes, etc.) to build analytics tools for pricing, safety, fraud detection, and navigation decisions.
|
||||||
|
|
||||||
|
* [Sports Analytics](https://towardsdatascience.com/scope-of-analytics-in-sports-world-37ed09c39860) - focuses on _predictive analytics_ (team and player analysis, e.g., [Moneyball](https://datasciencedegree.wisconsin.edu/blog/moneyball-proves-importance-big-data-big-ideas/)) and _data visualization_ (team dashboards, fan engagement, etc.) with applications like talent scouting, sports betting, and venue management.
|
||||||
|
|
||||||
|
* [Data Science in Banking](https://data-flair.training/blogs/data-science-in-banking/) - showcases the role of data science in finance, including risk modeling, fraud detection, customer segmentation, real-time predictions, and recommender systems. Predictive analytics also support critical measures like [credit scores](https://dzone.com/articles/using-big-data-and-predictive-analytics-for-credit).
|
||||||
|
|
||||||
|
* [Data Science in Healthcare](https://data-flair.training/blogs/data-science-in-healthcare/) - highlights applications such as medical imaging (MRI, X-Ray, CT-Scan), genomics (DNA sequencing), drug development (risk assessment, success prediction), predictive analytics (patient care and logistics), and disease tracking/prevention.
|
||||||
|
|
||||||
|
 Image Credit: [Data Flair: 6 Amazing Data Science Applications ](https://data-flair.training/blogs/data-science-applications/)
|
||||||
|
|
||||||
|
The figure illustrates other domains and examples of data science applications. Interested in exploring more? Check out the [Review & Self Study](../../../../6-Data-Science-In-Wild/20-Real-World-Examples) section below.
|
||||||
|
|
||||||
|
## Data Science + Research
|
||||||
|
|
||||||
|
|  ](../../sketchnotes/20-DataScience-Research.png) |
|
||||||
|
| :---------------------------------------------------------------------------------------------------------------: |
|
||||||
|
| Data Science & Research - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
While industry applications often focus on large-scale use cases, research projects can provide valuable insights in two key areas:
|
||||||
|
|
||||||
|
* _Innovation opportunities_ - rapid prototyping of advanced concepts and testing user experiences for next-generation applications.
|
||||||
|
* _Deployment challenges_ - identifying potential harms or unintended consequences of data science technologies in real-world contexts.
|
||||||
|
|
||||||
|
For students, research projects offer learning and collaboration opportunities that deepen understanding and foster connections with experts in areas of interest. What do research projects look like, and how can they make an impact?
|
||||||
|
|
||||||
|
Consider the [MIT Gender Shades Study](http://gendershades.org/overview.html) by Joy Buolamwini (MIT Media Labs), co-authored with Timnit Gebru (then at Microsoft Research). This study focused on:
|
||||||
|
|
||||||
|
* **What:** Evaluating bias in automated facial analysis algorithms and datasets based on gender and skin type.
|
||||||
|
* **Why:** Facial analysis is used in critical areas like law enforcement, airport security, and hiring systems, where inaccuracies (e.g., due to bias) can lead to economic and social harm. Addressing bias is essential for fairness.
|
||||||
|
* **How:** Researchers noted that existing benchmarks predominantly featured lighter-skinned subjects. They curated a new dataset (1000+ images) balanced by gender and skin type, which was used to evaluate the accuracy of three gender classification products (Microsoft, IBM, Face++).
|
||||||
|
|
||||||
|
Results revealed that while overall accuracy was good, error rates varied significantly across subgroups, with **misgendering** being higher for females and individuals with darker skin tones, indicating bias.
|
||||||
|
|
||||||
|
**Key Outcomes:** The study emphasized the need for more _representative datasets_ (balanced subgroups) and _inclusive teams_ (diverse backgrounds) to identify and address biases early in AI solutions. Such research has influenced organizations to adopt principles and practices for _responsible AI_ to enhance fairness in their AI products and processes.
|
||||||
|
|
||||||
|
**Interested in Microsoft research efforts?**
|
||||||
|
|
||||||
|
* Explore [Microsoft Research Projects](https://www.microsoft.com/research/research-area/artificial-intelligence/?facet%5Btax%5D%5Bmsr-research-area%5D%5B%5D=13556&facet%5Btax%5D%5Bmsr-content-type%5D%5B%5D=msr-project) on Artificial Intelligence.
|
||||||
|
* Check out student projects from [Microsoft Research Data Science Summer School](https://www.microsoft.com/en-us/research/academic-program/data-science-summer-school/).
|
||||||
|
* Learn about the [Fairlearn](https://fairlearn.org/) project and [Responsible AI](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1%3aprimaryr6) initiatives.
|
||||||
|
|
||||||
|
## Data Science + Humanities
|
||||||
|
|
||||||
|
|  ](../../sketchnotes/20-DataScience-Humanities.png) |
|
||||||
|
| :---------------------------------------------------------------------------------------------------------------: |
|
||||||
|
| Data Science & Digital Humanities - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
Digital Humanities [is defined](https://digitalhumanities.stanford.edu/about-dh-stanford) as "a collection of practices and approaches combining computational methods with humanistic inquiry." [Stanford projects](https://digitalhumanities.stanford.edu/projects) like _"rebooting history"_ and _"poetic thinking"_ demonstrate the connection between [Digital Humanities and Data Science](https://digitalhumanities.stanford.edu/digital-humanities-and-data-science), using techniques like network analysis, information visualization, spatial analysis, and text analysis to revisit historical and literary datasets for new insights.
|
||||||
|
|
||||||
|
*Want to explore a project in this field?*
|
||||||
|
|
||||||
|
Check out ["Emily Dickinson and the Meter of Mood"](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671) by [Jen Looper](https://twitter.com/jenlooper). This project examines how data science can reinterpret familiar poetry and reevaluate its meaning and the author's contributions. For example, _can we predict the season in which a poem was written by analyzing its tone or sentiment?_ What does this reveal about the author's mindset during that time?
|
||||||
|
|
||||||
|
To explore this, follow the data science lifecycle:
|
||||||
|
* [`Data Acquisition`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#acquiring-the-dataset) - collect relevant datasets using APIs (e.g., [Poetry DB API](https://poetrydb.org/index.html)) or web scraping tools (e.g., [Project Gutenberg](https://www.gutenberg.org/files/12242/12242-h/12242-h.htm)).
|
||||||
|
* [`Data Cleaning`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#clean-the-data) - format and sanitize text using tools like Visual Studio Code and Microsoft Excel.
|
||||||
|
* [`Data Analysis`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#working-with-the-data-in-a-notebook) - import the dataset into "Notebooks" for analysis using Python packages (e.g., pandas, numpy, matplotlib) to organize and visualize the data.
|
||||||
|
* [`Sentiment Analysis`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#sentiment-analysis-using-cognitive-services) - integrate cloud services like Text Analytics and use low-code tools like [Power Automate](https://flow.microsoft.com/en-us/) for automated workflows.
|
||||||
|
|
||||||
|
This workflow allows you to explore seasonal impacts on poem sentiment and develop your own interpretations of the author. Try it out, then extend the notebook to ask new questions or visualize the data differently!
|
||||||
|
|
||||||
|
> Use tools from the [Digital Humanities toolkit](https://github.com/Digital-Humanities-Toolkit) to pursue similar inquiries.
|
||||||
|
|
||||||
|
## Data Science + Sustainability
|
||||||
|
|
||||||
|
|  ](../../sketchnotes/20-DataScience-Sustainability.png) |
|
||||||
|
| :---------------------------------------------------------------------------------------------------------------: |
|
||||||
|
| Data Science & Sustainability - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
|
||||||
|
|
||||||
|
The [2030 Agenda For Sustainable Development](https://sdgs.un.org/2030agenda), adopted by all United Nations members in 2015, outlines 17 goals, including those aimed at **Protecting the Planet** from degradation and climate change. The [Microsoft Sustainability](https://www.microsoft.com/en-us/sustainability) initiative supports these goals by leveraging technology to build a more sustainable future, focusing on 4 key objectives: being carbon negative, water positive, zero waste, and bio-diverse by 2030.
|
||||||
|
|
||||||
|
Addressing these challenges requires large-scale data and cloud-based solutions. The [Planetary Computer](https://planetarycomputer.microsoft.com/) initiative provides four components to assist data scientists and developers:
|
||||||
|
|
||||||
|
* [Data Catalog](https://planetarycomputer.microsoft.com/catalog) - offers petabytes of Earth Systems data (free and Azure-hosted).
|
||||||
|
* [Planetary API](https://planetarycomputer.microsoft.com/docs/reference/stac/) - enables users to search for relevant data across space and time.
|
||||||
|
* [Hub](https://planetarycomputer.microsoft.com/docs/overview/environment/) - provides a managed environment for processing massive geospatial datasets.
|
||||||
|
* [Applications](https://planetarycomputer.microsoft.com/applications) - showcases use cases and tools for sustainability insights.
|
||||||
|
**The Planetary Computer Project is currently in preview (as of Sep 2021)** - here's how you can start contributing to sustainability solutions using data science.
|
||||||
|
|
||||||
|
* [Request access](https://planetarycomputer.microsoft.com/account/request) to begin exploring and connect with others.
|
||||||
|
* [Explore documentation](https://planetarycomputer.microsoft.com/docs/overview/about) to learn about supported datasets and APIs.
|
||||||
|
* Check out applications like [Ecosystem Monitoring](https://analytics-lab.org/ecosystemmonitoring/) for inspiration on project ideas.
|
||||||
|
|
||||||
|
Consider how you can use data visualization to highlight or amplify insights into issues like climate change and deforestation. Or think about how these insights can be leveraged to design new user experiences that encourage behavioral changes for more sustainable living.
|
||||||
|
|
||||||
|
## Data Science + Students
|
||||||
|
|
||||||
|
We've discussed real-world applications in industry and research, and looked at examples of data science applications in digital humanities and sustainability. So how can you develop your skills and share your knowledge as data science beginners?
|
||||||
|
|
||||||
|
Here are some examples of student data science projects to inspire you:
|
||||||
|
|
||||||
|
* [MSR Data Science Summer School](https://www.microsoft.com/en-us/research/academic-program/data-science-summer-school/#!projects) with GitHub [projects](https://github.com/msr-ds3) exploring topics such as:
|
||||||
|
- [Racial Bias in Police Use of Force](https://www.microsoft.com/en-us/research/video/data-science-summer-school-2019-replicating-an-empirical-analysis-of-racial-differences-in-police-use-of-force/) | [Github](https://github.com/msr-ds3/stop-question-frisk)
|
||||||
|
- [Reliability of NYC Subway System](https://www.microsoft.com/en-us/research/video/data-science-summer-school-2018-exploring-the-reliability-of-the-nyc-subway-system/) | [Github](https://github.com/msr-ds3/nyctransit)
|
||||||
|
* [Digitizing Material Culture: Exploring socio-economic distributions in Sirkap](https://claremont.maps.arcgis.com/apps/Cascade/index.html?appid=bdf2aef0f45a4674ba41cd373fa23afc) - by [Ornella Altunyan](https://twitter.com/ornelladotcom) and her team at Claremont, using [ArcGIS StoryMaps](https://storymaps.arcgis.com/).
|
||||||
|
|
||||||
|
## 🚀 Challenge
|
||||||
|
|
||||||
|
Look for articles that suggest beginner-friendly data science projects - like [these 50 topic areas](https://www.upgrad.com/blog/data-science-project-ideas-topics-beginners/), [these 21 project ideas](https://www.intellspot.com/data-science-project-ideas), or [these 16 projects with source code](https://data-flair.training/blogs/data-science-project-ideas/) that you can analyze and remix. And don't forget to blog about your learning experiences and share your insights with the community.
|
||||||
|
|
||||||
|
## Post-Lecture Quiz
|
||||||
|
|
||||||
|
[Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/39)
|
||||||
|
|
||||||
|
## Review & Self Study
|
||||||
|
|
||||||
|
Want to dive deeper into use cases? Here are some relevant articles:
|
||||||
|
* [17 Data Science Applications and Examples](https://builtin.com/data-science/data-science-applications-examples) - Jul 2021
|
||||||
|
* [11 Breathtaking Data Science Applications in Real World](https://myblindbird.com/data-science-applications-real-world/) - May 2021
|
||||||
|
* [Data Science In The Real World](https://towardsdatascience.com/data-science-in-the-real-world/home) - Article Collection
|
||||||
|
* Data Science In: [Education](https://data-flair.training/blogs/data-science-in-education/), [Agriculture](https://data-flair.training/blogs/data-science-in-agriculture/), [Finance](https://data-flair.training/blogs/data-science-in-finance/), [Movies](https://data-flair.training/blogs/data-science-at-movies/) & more.
|
||||||
|
|
||||||
|
## Assignment
|
||||||
|
|
||||||
|
[Explore A Planetary Computer Dataset](assignment.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,50 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "d1e05715f9d97de6c4f1fb0c5a4702c0",
|
||||||
|
"translation_date": "2025-08-31T11:12:29+00:00",
|
||||||
|
"source_file": "6-Data-Science-In-Wild/20-Real-World-Examples/assignment.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Explore a Planetary Computer Dataset
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
In this lesson, we discussed various domains of data science applications, diving deeper into examples related to research, sustainability, and digital humanities. In this assignment, you'll explore one of these examples in greater detail and apply your knowledge of data visualizations and analysis to extract insights about sustainability data.
|
||||||
|
|
||||||
|
The [Planetary Computer](https://planetarycomputer.microsoft.com/) project offers datasets and APIs that can be accessed with an account—request one if you'd like to try the bonus step of the assignment. The site also includes an [Explorer](https://planetarycomputer.microsoft.com/explore) feature that you can use without needing to create an account.
|
||||||
|
|
||||||
|
`Steps:`
|
||||||
|
The Explorer interface (shown in the screenshot below) allows you to select a dataset (from the available options), a preset query (to filter the data), and a rendering option (to generate a relevant visualization). For this assignment, your task is to:
|
||||||
|
|
||||||
|
1. Read the [Explorer documentation](https://planetarycomputer.microsoft.com/docs/overview/explorer/) to understand the available options.
|
||||||
|
2. Explore the dataset [Catalog](https://planetarycomputer.microsoft.com/catalog) to learn the purpose of each dataset.
|
||||||
|
3. Use the Explorer to choose a dataset of interest, select a relevant query, and pick a rendering option.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
`Your Task:`
|
||||||
|
Once you've studied the visualization rendered in the browser, answer the following questions:
|
||||||
|
* What _features_ does the dataset include?
|
||||||
|
* What _insights_ or results does the visualization reveal?
|
||||||
|
* What are the _implications_ of those insights for the sustainability goals of the project?
|
||||||
|
* What are the _limitations_ of the visualization (i.e., what insights were not provided)?
|
||||||
|
* If you had access to the raw data, what _alternative visualizations_ would you create, and why?
|
||||||
|
|
||||||
|
`Bonus Points:`
|
||||||
|
Apply for an account and log in once accepted.
|
||||||
|
* Use the _Launch Hub_ option to open the raw data in a Notebook.
|
||||||
|
* Explore the data interactively and implement the alternative visualizations you envisioned.
|
||||||
|
* Analyze your custom visualizations—were you able to uncover the insights you missed earlier?
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | ---
|
||||||
|
All five core questions were answered. The student clearly identified how current and alternative visualizations could provide insights into sustainability objectives or outcomes. | The student answered at least the top three questions in detail, demonstrating practical experience with the Explorer. | The student failed to answer multiple questions or provided insufficient detail, indicating that no meaningful attempt was made for the project.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,25 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "07faf02ff163e609edf0b0308dc5d4e6",
|
||||||
|
"translation_date": "2025-08-31T11:11:50+00:00",
|
||||||
|
"source_file": "6-Data-Science-In-Wild/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Data Science in the Wild
|
||||||
|
|
||||||
|
Practical applications of data science across various industries.
|
||||||
|
|
||||||
|
### Topics
|
||||||
|
|
||||||
|
1. [Data Science in the Real World](20-Real-World-Examples/README.md)
|
||||||
|
|
||||||
|
### Credits
|
||||||
|
|
||||||
|
Written with ❤️ by [Nitya Narasimhan](https://twitter.com/nitya)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,23 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "c06b12caf3c901eb3156e3dd5b0aea56",
|
||||||
|
"translation_date": "2025-08-31T10:54:34+00:00",
|
||||||
|
"source_file": "CODE_OF_CONDUCT.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Microsoft Open Source Code of Conduct
|
||||||
|
|
||||||
|
This project follows the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
||||||
|
|
||||||
|
Resources:
|
||||||
|
|
||||||
|
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
|
||||||
|
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
|
||||||
|
- For questions or concerns, contact [opencode@microsoft.com](mailto:opencode@microsoft.com)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,21 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "61aff2b3273d4ab66709493b43f91ca1",
|
||||||
|
"translation_date": "2025-08-31T10:54:09+00:00",
|
||||||
|
"source_file": "CONTRIBUTING.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Contributing
|
||||||
|
|
||||||
|
This project encourages contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA), which confirms that you have the rights to, and are granting us the rights to, use your contribution. For more details, visit https://cla.microsoft.com.
|
||||||
|
|
||||||
|
When you submit a pull request, a CLA-bot will automatically check whether you need to provide a CLA and will update the PR accordingly (e.g., with a label or comment). Just follow the instructions provided by the bot. You only need to complete this process once for all repositories that use our CLA.
|
||||||
|
|
||||||
|
This project follows the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information, refer to the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or reach out to [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or feedback.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,51 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "0d575483100c332b2dbaefef915bb3c4",
|
||||||
|
"translation_date": "2025-08-31T10:54:14+00:00",
|
||||||
|
"source_file": "SECURITY.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
## Security
|
||||||
|
|
||||||
|
Microsoft prioritizes the security of its software products and services, including all source code repositories managed through our GitHub organizations, such as [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
|
||||||
|
|
||||||
|
If you believe you've identified a security vulnerability in any Microsoft-owned repository that aligns with [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us using the instructions below.
|
||||||
|
|
||||||
|
## Reporting Security Issues
|
||||||
|
|
||||||
|
**Do not report security vulnerabilities through public GitHub issues.**
|
||||||
|
|
||||||
|
Instead, report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
|
||||||
|
|
||||||
|
If you'd prefer to submit without logging in, you can email [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message using our PGP key, which can be downloaded from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
|
||||||
|
|
||||||
|
You should receive a response within 24 hours. If you don't, please follow up via email to ensure we received your original message. Additional details can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
|
||||||
|
|
||||||
|
Please include the following information (as much as you can provide) to help us better understand the nature and scope of the issue:
|
||||||
|
|
||||||
|
* Type of issue (e.g., buffer overflow, SQL injection, cross-site scripting, etc.)
|
||||||
|
* Full paths of source file(s) related to the issue
|
||||||
|
* The location of the affected source code (tag/branch/commit or direct URL)
|
||||||
|
* Any special configuration needed to reproduce the issue
|
||||||
|
* Step-by-step instructions to reproduce the issue
|
||||||
|
* Proof-of-concept or exploit code (if available)
|
||||||
|
* Impact of the issue, including how an attacker might exploit it
|
||||||
|
|
||||||
|
Providing this information will help us process your report more efficiently.
|
||||||
|
|
||||||
|
If you're submitting a report for a bug bounty, more detailed reports may result in a higher bounty award. Visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more information about our active programs.
|
||||||
|
|
||||||
|
## Preferred Languages
|
||||||
|
|
||||||
|
We prefer all communications to be in English.
|
||||||
|
|
||||||
|
## Policy
|
||||||
|
|
||||||
|
Microsoft adheres to the principles of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,24 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "872be8bc1b93ef1dd9ac3d6e8f99f6ab",
|
||||||
|
"translation_date": "2025-08-31T10:54:04+00:00",
|
||||||
|
"source_file": "SUPPORT.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
# Support
|
||||||
|
## How to report issues and seek assistance
|
||||||
|
|
||||||
|
This project utilizes GitHub Issues to manage bug reports and feature requests. Before submitting a new issue, please check the existing ones to avoid duplicates. If you need to report a new issue, submit your bug or feature request as a new Issue.
|
||||||
|
|
||||||
|
For assistance or questions regarding the use of this project, please create an issue.
|
||||||
|
|
||||||
|
## Microsoft Support Policy
|
||||||
|
|
||||||
|
Support for this repository is restricted to the resources mentioned above.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,40 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "3767555b3cc28a2865c79202f4374204",
|
||||||
|
"translation_date": "2025-08-31T10:59:49+00:00",
|
||||||
|
"source_file": "docs/_sidebar.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
- Introduction
|
||||||
|
- [Defining Data Science](../1-Introduction/01-defining-data-science/README.md)
|
||||||
|
- [Ethics of Data Science](../1-Introduction/02-ethics/README.md)
|
||||||
|
- [Defining Data](../1-Introduction/03-defining-data/README.md)
|
||||||
|
- [Probability and Stats](../1-Introduction/04-stats-and-probability/README.md)
|
||||||
|
- Working With Data
|
||||||
|
- [Relational Databases](../2-Working-With-Data/05-relational-databases/README.md)
|
||||||
|
- [Nonrelational Databases](../2-Working-With-Data/06-non-relational/README.md)
|
||||||
|
- [Python](../2-Working-With-Data/07-python/README.md)
|
||||||
|
- [Data Preparation](../2-Working-With-Data/08-data-preparation/README.md)
|
||||||
|
- Data Visualization
|
||||||
|
- [Visualizing Quantities](../3-Data-Visualization/09-visualization-quantities/README.md)
|
||||||
|
- [Visualizing Distributions](../3-Data-Visualization/10-visualization-distributions/README.md)
|
||||||
|
- [Visualizing Proportions](../3-Data-Visualization/11-visualization-proportions/README.md)
|
||||||
|
- [Visualizing Relationships](../3-Data-Visualization/12-visualization-relationships/README.md)
|
||||||
|
- [Meaningful Visualizations](../3-Data-Visualization/13-meaningful-visualizations/README.md)
|
||||||
|
- Data Science Lifecycle
|
||||||
|
- [Introduction](../4-Data-Science-Lifecycle/14-Introduction/README.md)
|
||||||
|
- [Analyzing](../4-Data-Science-Lifecycle/15-analyzing/README.md)
|
||||||
|
- [Communication](../4-Data-Science-Lifecycle/16-communication/README.md)
|
||||||
|
- Data Science in the Cloud
|
||||||
|
- [Introduction](../5-Data-Science-In-Cloud/17-Introduction/README.md)
|
||||||
|
- [Low Code](../5-Data-Science-In-Cloud/18-Low-Code/README.md)
|
||||||
|
- [Azure](../5-Data-Science-In-Cloud/19-Azure/README.md)
|
||||||
|
- Data Science in the Wild
|
||||||
|
- [DS In The Wild](../6-Data-Science-In-Wild/README.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,78 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "87f157ea00d36c1d12c14390d9852b50",
|
||||||
|
"translation_date": "2025-08-31T10:54:23+00:00",
|
||||||
|
"source_file": "for-teachers.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
## For Educators
|
||||||
|
|
||||||
|
Would you like to use this curriculum in your classroom? Go ahead!
|
||||||
|
|
||||||
|
In fact, you can use it directly on GitHub by leveraging GitHub Classroom.
|
||||||
|
|
||||||
|
To do this, fork this repository. You'll need to create a separate repository for each lesson, so you'll have to extract each folder into its own repository. This way, [GitHub Classroom](https://classroom.github.com/classrooms) can handle each lesson individually.
|
||||||
|
|
||||||
|
These [detailed instructions](https://github.blog/2020-03-18-set-up-your-digital-classroom-with-github-classroom/) will guide you on how to set up your classroom.
|
||||||
|
|
||||||
|
## Using the repository as is
|
||||||
|
|
||||||
|
If you'd prefer to use this repository as it currently exists, without GitHub Classroom, that's also possible. You'll need to coordinate with your students on which lesson to work through together.
|
||||||
|
|
||||||
|
In an online format (Zoom, Teams, or similar), you could create breakout rooms for quizzes and mentor students to prepare them for learning. Then, invite students to complete the quizzes and submit their answers as 'issues' at a designated time. You could follow the same approach for assignments if you want students to collaborate openly.
|
||||||
|
|
||||||
|
If you'd rather use a more private format, ask your students to fork the curriculum, lesson by lesson, into their own private GitHub repositories and grant you access. This way, they can complete quizzes and assignments privately and submit them to you via issues on your classroom repository.
|
||||||
|
|
||||||
|
There are many ways to adapt this for an online classroom setting. Let us know what works best for you!
|
||||||
|
|
||||||
|
## Included in this curriculum:
|
||||||
|
|
||||||
|
20 lessons, 40 quizzes, and 20 assignments. Sketchnotes are included to support visual learners. Many lessons are available in both Python and R and can be completed using Jupyter notebooks in VS Code. Learn more about setting up your classroom to use this tech stack: https://code.visualstudio.com/docs/datascience/jupyter-notebooks.
|
||||||
|
|
||||||
|
All sketchnotes, including a large-format poster, are located in [this folder](../../sketchnotes).
|
||||||
|
|
||||||
|
The entire curriculum is available [as a PDF](../../pdf/readme.pdf).
|
||||||
|
|
||||||
|
You can also run this curriculum as a standalone, offline-friendly website using [Docsify](https://docsify.js.org/#/). [Install Docsify](https://docsify.js.org/#/quickstart) on your local machine, then in the root folder of your local copy of this repository, type `docsify serve`. The website will be served on port 3000 on your localhost: `localhost:3000`.
|
||||||
|
|
||||||
|
An offline-friendly version of the curriculum will open as a standalone web page: https://localhost:3000
|
||||||
|
|
||||||
|
Lessons are organized into six parts:
|
||||||
|
|
||||||
|
- 1: Introduction
|
||||||
|
- 1: Defining Data Science
|
||||||
|
- 2: Ethics
|
||||||
|
- 3: Defining Data
|
||||||
|
- 4: Probability and Statistics Overview
|
||||||
|
- 2: Working with Data
|
||||||
|
- 5: Relational Databases
|
||||||
|
- 6: Non-Relational Databases
|
||||||
|
- 7: Python
|
||||||
|
- 8: Data Preparation
|
||||||
|
- 3: Data Visualization
|
||||||
|
- 9: Visualization of Quantities
|
||||||
|
- 10: Visualization of Distributions
|
||||||
|
- 11: Visualization of Proportions
|
||||||
|
- 12: Visualization of Relationships
|
||||||
|
- 13: Meaningful Visualizations
|
||||||
|
- 4: Data Science Lifecycle
|
||||||
|
- 14: Introduction
|
||||||
|
- 15: Analyzing
|
||||||
|
- 16: Communication
|
||||||
|
- 5: Data Science in the Cloud
|
||||||
|
- 17: Introduction
|
||||||
|
- 18: Low-Code Options
|
||||||
|
- 19: Azure
|
||||||
|
- 6: Data Science in the Wild
|
||||||
|
- 20: Overview
|
||||||
|
|
||||||
|
## Please give us your thoughts!
|
||||||
|
|
||||||
|
We want this curriculum to work for you and your students. Share your feedback in the discussion boards! Feel free to create a classroom section on the discussion boards for your students.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
@ -0,0 +1,21 @@
|
|||||||
|
<!--
|
||||||
|
CO_OP_TRANSLATOR_METADATA:
|
||||||
|
{
|
||||||
|
"original_hash": "3a848466cb63aff1a93411affb152c2a",
|
||||||
|
"translation_date": "2025-08-31T11:12:38+00:00",
|
||||||
|
"source_file": "sketchnotes/README.md",
|
||||||
|
"language_code": "en"
|
||||||
|
}
|
||||||
|
-->
|
||||||
|
Find all sketchnotes here!
|
||||||
|
|
||||||
|
## Credits
|
||||||
|
|
||||||
|
Nitya Narasimhan, artist
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Disclaimer**:
|
||||||
|
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
|
Loading…
Reference in new issue