Merge pull request #634 from microsoft/update-translations

🌐 Update translations via Co-op Translator
pull/635/head
Lee Stott 2 weeks ago committed by GitHub
commit c934b074fd
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -0,0 +1,50 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "2583a9894af7123b2fcae3376b14c035",
"translation_date": "2025-08-31T11:09:34+00:00",
"source_file": "1-Introduction/01-defining-data-science/README.md",
"language_code": "en"
}
-->
We can also analyze the test results to identify which questions are most often answered incorrectly. This could indicate areas where the material might need to be clarified or expanded. Additionally, we could track how students interact with the course content—such as which videos they replay, which sections they skip, or how often they participate in discussions. This data could help us understand how students engage with the material and identify opportunities to make the course more engaging and effective.
By collecting and analyzing this data, we are essentially digitizing the learning process. Once we have this data, we can apply data science techniques to gain insights and make informed decisions about how to improve the course. This is an example of digital transformation in education.
Digital transformation is not limited to education—it can be applied to virtually any industry. For example:
- In **healthcare**, digital transformation might involve using patient data to predict disease outbreaks or personalize treatment plans.
- In **retail**, it could mean analyzing customer purchase data to optimize inventory or create personalized marketing campaigns.
- In **manufacturing**, it might involve using sensor data from machines to predict maintenance needs and reduce downtime.
The key idea is that by digitizing processes and applying data science, businesses can gain valuable insights, improve efficiency, and make better decisions.
You might say this method isn't perfect, as modules can vary in length. It might be more reasonable to divide the time by the module's length (measured in the number of characters) and compare those results instead.
When we start analyzing the results of multiple-choice tests, we can try to identify which concepts students struggle to understand and use that information to improve the content. To achieve this, we need to design tests so that each question corresponds to a specific concept or piece of knowledge.
If we want to go a step further, we can compare the time taken for each module with the age category of the students. We might discover that for certain age groups, it takes an unusually long time to complete the module, or that students drop out before finishing it. This can help us provide age-appropriate recommendations for the module and reduce dissatisfaction caused by unmet expectations.
## 🚀 Challenge
In this challenge, we will try to identify concepts relevant to the field of Data Science by analyzing texts. We will take a Wikipedia article on Data Science, download and process the text, and then create a word cloud like this one:
![Word Cloud for Data Science](../../../../1-Introduction/01-defining-data-science/images/ds_wordcloud.png)
Visit [`notebook.ipynb`](../../../../../../../../../1-Introduction/01-defining-data-science/notebook.ipynb ':ignore') to review the code. You can also run the code and observe how it performs all the data transformations in real time.
> If you are unfamiliar with running code in a Jupyter Notebook, check out [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/1)
## Assignments
* **Task 1**: Modify the code above to identify related concepts for the fields of **Big Data** and **Machine Learning**.
* **Task 2**: [Think About Data Science Scenarios](assignment.md)
## Credits
This lesson was created with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,46 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "4e0f1773b9bee1be3b28f9fe2c71b3de",
"translation_date": "2025-08-31T11:09:48+00:00",
"source_file": "1-Introduction/01-defining-data-science/assignment.md",
"language_code": "en"
}
-->
# Assignment: Data Science Scenarios
In this first assignment, we ask you to think about some real-life processes or problems in different domains, and how you can improve them using the Data Science process. Consider the following:
1. What data can you collect?
1. How would you collect it?
1. How would you store the data? How large is the data likely to be?
1. What insights might you be able to derive from this data? What decisions could be made based on the data?
Try to think about 3 different problems/processes and describe each of the points above for each domain.
Here are some domains and problems to help you start thinking:
1. How can data be used to improve the education process for children in schools?
1. How can data be used to manage vaccination during a pandemic?
1. How can data be used to ensure productivity at work?
## Instructions
Fill in the following table (replace the suggested domains with your own if needed):
| Problem Domain | Problem | What data to collect | How to store the data | What insights/decisions we can make |
|----------------|---------|-----------------------|-----------------------|--------------------------------------|
| Education | | | | |
| Vaccination | | | | |
| Productivity | | | | |
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
The solution identifies reasonable data sources, methods of storing data, and possible decisions/insights for all domains | Some aspects of the solution lack detail, data storage is not discussed, at least 2 domains are described | Only parts of the data solution are described, only one domain is considered.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,48 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a8f79b9c0484c35b4f26e8aec7fc4d56",
"translation_date": "2025-08-31T11:09:55+00:00",
"source_file": "1-Introduction/01-defining-data-science/solution/assignment.md",
"language_code": "en"
}
-->
# Assignment: Data Science Scenarios
In this first assignment, we ask you to think about some real-life processes or problems in different domains, and how you can improve them using the Data Science process. Consider the following:
1. What data can you collect?
1. How would you collect it?
1. How would you store the data? How large is the data likely to be?
1. What insights might you be able to derive from this data? What decisions could be made based on the data?
Try to think about three different problems or processes and describe each of the points above for each domain.
Here are some domains and problems to help you start thinking:
1. How can you use data to improve the education process for children in schools?
1. How can you use data to manage vaccination during a pandemic?
1. How can you use data to ensure you are being productive at work?
## Instructions
Fill in the following table (replace the suggested domains with your own if needed):
| Problem Domain | Problem | What data to collect | How to store the data | What insights/decisions we can make |
|----------------|---------|-----------------------|-----------------------|--------------------------------------|
| Education | At universities, lecture attendance is often low, and we hypothesize that students who attend lectures more frequently tend to perform better in exams. We want to encourage attendance and test this hypothesis. | Attendance can be tracked using photos taken by security cameras in classrooms or by tracking the Bluetooth/Wi-Fi addresses of students' mobile phones in class. Exam data is already available in the university database. | If we use security camera images, we need to store a few (5-10) photos taken during class (unstructured data) and then use AI to identify students' faces (convert data to structured form). | We can calculate average attendance for each student and check for correlations with exam grades. We'll discuss correlation further in the [probability and statistics](../../04-stats-and-probability/README.md) section. To encourage attendance, we can publish weekly attendance rankings on the school portal and hold prize draws for students with the highest attendance. |
| Vaccination | | | | |
| Productivity | | | | |
> *We provide just one example answer to give you an idea of what is expected in this assignment.*
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
Reasonable data sources, storage methods, and possible decisions/insights are identified for all domains | Some aspects of the solution lack detail, data storage is not discussed, at least two domains are described | Only parts of the data solution are described, and only one domain is considered.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,267 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "8796f41f566a0a8ebb72863a83d558ed",
"translation_date": "2025-08-31T11:10:27+00:00",
"source_file": "1-Introduction/02-ethics/README.md",
"language_code": "en"
}
-->
# Introduction to Data Ethics
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/02-Ethics.png)|
|:---:|
| Data Science Ethics - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
---
We are all data citizens living in a world shaped by data.
Market trends suggest that by 2022, 1 in 3 large organizations will buy and sell their data through online [Marketplaces and Exchanges](https://www.gartner.com/smarterwithgartner/gartner-top-10-trends-in-data-and-analytics-for-2020/). As **App Developers**, it will become easier and more affordable to integrate data-driven insights and algorithmic automation into everyday user experiences. However, as AI becomes more widespread, we must also understand the potential harms caused by the [weaponization](https://www.youtube.com/watch?v=TQHs8SA1qpk) of such algorithms on a large scale.
Trends also show that by 2025, we will create and consume over [180 zettabytes](https://www.statista.com/statistics/871513/worldwide-data-created/) of data. As **Data Scientists**, this gives us unprecedented access to personal data. This allows us to build behavioral profiles of users and influence decision-making in ways that create an [illusion of free choice](https://www.datasciencecentral.com/profiles/blogs/the-illusion-of-choice), potentially nudging users toward outcomes we prefer. It also raises broader questions about data privacy and user protections.
Data ethics now serve as _necessary guardrails_ for data science and engineering, helping us minimize potential harms and unintended consequences of our data-driven actions. The [Gartner Hype Cycle for AI](https://www.gartner.com/smarterwithgartner/2-megatrends-dominate-the-gartner-hype-cycle-for-artificial-intelligence-2020/) highlights trends in digital ethics, responsible AI, and AI governance as key drivers for larger megatrends around the _democratization_ and _industrialization_ of AI.
![Gartner's Hype Cycle for AI - 2020](https://images-cdn.newscred.com/Zz1mOWJhNzlkNDA2ZTMxMWViYjRiOGFiM2IyMjQ1YmMwZQ==)
In this lesson, well explore the fascinating field of data ethics—from core concepts and challenges to case studies and applied AI concepts like governance—that help establish an ethics culture in teams and organizations working with data and AI.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/2) 🎯
## Basic Definitions
Lets begin by understanding some basic terminology.
The word "ethics" comes from the [Greek word "ethikos"](https://en.wikipedia.org/wiki/Ethics) (and its root "ethos"), meaning _character or moral nature_.
**Ethics** refers to the shared values and moral principles that govern our behavior in society. Ethics are not based on laws but on widely accepted norms of what is "right vs. wrong." However, ethical considerations can influence corporate governance initiatives and government regulations, creating incentives for compliance.
**Data Ethics** is a [new branch of ethics](https://royalsocietypublishing.org/doi/full/10.1098/rsta.2016.0360#sec-1) that "studies and evaluates moral problems related to _data, algorithms, and corresponding practices_." Here, **"data"** focuses on actions like generation, recording, curation, processing, dissemination, sharing, and usage; **"algorithms"** focuses on AI, agents, machine learning, and robots; and **"practices"** focuses on topics like responsible innovation, programming, hacking, and ethics codes.
**Applied Ethics** is the [practical application of moral considerations](https://en.wikipedia.org/wiki/Applied_ethics). It involves actively investigating ethical issues in the context of _real-world actions, products, and processes_ and taking corrective measures to ensure alignment with defined ethical values.
**Ethics Culture** is about [_operationalizing_ applied ethics](https://hbr.org/2019/05/how-to-design-an-ethical-organization) to ensure that ethical principles and practices are adopted consistently and at scale across an organization. Successful ethics cultures define organization-wide ethical principles, provide meaningful incentives for compliance, and reinforce ethical norms by encouraging and amplifying desired behaviors at every level of the organization.
## Ethics Concepts
In this section, well discuss concepts like **shared values** (principles) and **ethical challenges** (problems) for data ethics—and explore **case studies** to help you understand these concepts in real-world contexts.
### 1. Ethics Principles
Every data ethics strategy begins by defining _ethical principles_—the "shared values" that describe acceptable behaviors and guide compliant actions in our data and AI projects. These can be defined at an individual or team level. However, most large organizations outline these in an _ethical AI_ mission statement or framework, defined at the corporate level and enforced consistently across all teams.
**Example:** Microsoft's [Responsible AI](https://www.microsoft.com/en-us/ai/responsible-ai) mission statement reads: _"We are committed to the advancement of AI driven by ethical principles that put people first"_—identifying six ethical principles in the framework below:
![Responsible AI at Microsoft](https://docs.microsoft.com/en-gb/azure/cognitive-services/personalizer/media/ethics-and-responsible-use/ai-values-future-computed.png)
Lets briefly explore these principles. _Transparency_ and _accountability_ are foundational values upon which other principles are built—so lets start there:
* [**Accountability**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) makes practitioners _responsible_ for their data and AI operations and compliance with these ethical principles.
* [**Transparency**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) ensures that data and AI actions are _understandable_ (interpretable) to users, explaining the what and why behind decisions.
* [**Fairness**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1%3aprimaryr6) focuses on ensuring AI treats _all people_ fairly, addressing any systemic or implicit socio-technical biases in data and systems.
* [**Reliability & Safety**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) ensures that AI behaves _consistently_ with defined values, minimizing potential harms or unintended consequences.
* [**Privacy & Security**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) is about understanding data lineage and providing _data privacy and related protections_ to users.
* [**Inclusiveness**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) is about designing AI solutions with intention, adapting them to meet a _broad range of human needs_ and capabilities.
> 🚨 Think about what your data ethics mission statement could be. Explore ethical AI frameworks from other organizations—here are examples from [IBM](https://www.ibm.com/cloud/learn/ai-ethics), [Google](https://ai.google/principles), and [Facebook](https://ai.facebook.com/blog/facebooks-five-pillars-of-responsible-ai/). What shared values do they have in common? How do these principles relate to the AI product or industry they operate in?
### 2. Ethics Challenges
Once ethical principles are defined, the next step is to evaluate our data and AI actions to see if they align with those shared values. Consider your actions in two categories: _data collection_ and _algorithm design_.
For data collection, actions often involve **personal data** or personally identifiable information (PII) for identifiable living individuals. This includes [various types of non-personal data](https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en) that, when combined, can identify an individual. Ethical challenges may relate to _data privacy_, _data ownership_, and topics like _informed consent_ and _intellectual property rights_ for users.
For algorithm design, actions involve collecting and curating **datasets**, then using them to train and deploy **data models** that predict outcomes or automate decisions in real-world contexts. Ethical challenges may arise from _dataset bias_, _data quality_ issues, _unfairness_, and _misrepresentation_ in algorithms—including systemic issues.
In both cases, ethical challenges highlight areas where actions may conflict with shared values. To detect, mitigate, minimize, or eliminate these concerns, we need to ask moral "yes/no" questions about our actions and take corrective measures as needed. Lets examine some ethical challenges and the moral questions they raise:
#### 2.1 Data Ownership
Data collection often involves personal data that can identify individuals. [Data ownership](https://permission.io/blog/data-ownership) is about _control_ and [_user rights_](https://permission.io/blog/data-ownership) related to the creation, processing, and dissemination of data.
Moral questions to consider:
* Who owns the data? (user or organization)
* What rights do data subjects have? (e.g., access, erasure, portability)
* What rights do organizations have? (e.g., rectifying malicious user reviews)
#### 2.2 Informed Consent
[Informed consent](https://legaldictionary.net/informed-consent/) refers to users agreeing to an action (like data collection) with a _full understanding_ of relevant facts, including the purpose, potential risks, and alternatives.
Questions to explore:
* Did the user (data subject) give permission for data capture and usage?
* Did the user understand the purpose for which the data was captured?
* Did the user understand the potential risks of their participation?
#### 2.3 Intellectual Property
[Intellectual property](https://en.wikipedia.org/wiki/Intellectual_property) refers to intangible creations resulting from human initiative that may _have economic value_ to individuals or businesses.
Questions to explore:
* Did the collected data have economic value to a user or business?
* Does the **user** have intellectual property here?
* Does the **organization** have intellectual property here?
* If these rights exist, how are we protecting them?
#### 2.4 Data Privacy
[Data privacy](https://www.northeastern.edu/graduate/blog/what-is-data-privacy/) refers to preserving user privacy and protecting user identity with respect to personally identifiable information.
Questions to explore:
* Is users (personal) data secured against hacks and leaks?
* Is users data accessible only to authorized users and contexts?
* Is users anonymity preserved when data is shared or disseminated?
* Can a user be de-identified from anonymized datasets?
#### 2.5 Right To Be Forgotten
The [Right To Be Forgotten](https://en.wikipedia.org/wiki/Right_to_be_forgotten) or [Right to Erasure](https://www.gdpreu.org/right-to-be-forgotten/) provides additional personal data protection to users. It allows users to request deletion or removal of personal data from Internet searches and other locations, _under specific circumstances_—giving them a fresh start online without past actions being held against them.
Questions to explore:
* Does the system allow data subjects to request erasure?
* Should the withdrawal of user consent trigger automated erasure?
* Was data collected without consent or by unlawful means?
* Are we compliant with government regulations for data privacy?
#### 2.6 Dataset Bias
Dataset or [Collection Bias](http://researcharticles.com/index.php/bias-in-data-collection-in-research/) refers to selecting a _non-representative_ subset of data for algorithm development, potentially creating unfair outcomes for diverse groups. Types of bias include selection or sampling bias, volunteer bias, and instrument bias.
Questions to explore:
* Did we recruit a representative set of data subjects?
* Did we test our collected or curated dataset for various biases?
* Can we mitigate or remove any discovered biases?
#### 2.7 Data Quality
[Data Quality](https://lakefs.io/data-quality-testing/) examines the validity of the curated dataset used to develop algorithms, ensuring features and records meet the accuracy and consistency requirements for the AIs purpose.
Questions to explore:
* Did we capture valid _features_ for our use case?
* Was data captured _consistently_ across diverse data sources?
* Is the dataset _complete_ for diverse conditions or scenarios?
* Is information captured _accurately_ to reflect reality?
#### 2.8 Algorithm Fairness
[Algorithm Fairness](https://towardsdatascience.com/what-is-algorithm-fairness-3182e161cf9f) examines whether the design of an algorithm systematically discriminates against specific subgroups of individuals, potentially causing [harm](https://docs.microsoft.com/en-us/azure/machine-learning/concept-fairness-ml) in areas like _allocation_ (where resources are denied or withheld from certain groups) and _quality of service_ (where AI performs less accurately for some subgroups compared to others).
Questions to consider:
* Have we assessed the model's accuracy across diverse subgroups and conditions?
* Have we analyzed the system for potential harms (e.g., stereotyping)?
* Can we adjust the data or retrain the models to address identified harms?
Explore resources like [AI Fairness checklists](https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4t6dA) for further learning.
#### 2.9 Misrepresentation
[Data Misrepresentation](https://www.sciencedirect.com/topics/computer-science/misrepresentation) involves questioning whether insights derived from data are being presented in a misleading way to support a specific narrative.
Questions to consider:
* Are we reporting incomplete or inaccurate data?
* Are we visualizing data in ways that lead to false conclusions?
* Are we using selective statistical methods to manipulate outcomes?
* Are there alternative explanations that could lead to different conclusions?
#### 2.10 Free Choice
The [Illusion of Free Choice](https://www.datasciencecentral.com/profiles/blogs/the-illusion-of-choice) arises when decision-making algorithms in "choice architectures" subtly push users toward a preferred outcome while giving the appearance of options and control. These [dark patterns](https://www.darkpatterns.org/) can result in social and economic harm to users. Since user decisions influence behavior profiles, these choices can amplify or perpetuate the impact of such harms over time.
Questions to consider:
* Did the user fully understand the consequences of their choice?
* Was the user aware of alternative options and the pros and cons of each?
* Can the user later reverse an automated or influenced decision?
### 3. Case Studies
To understand these ethical challenges in real-world scenarios, case studies can illustrate the potential harms and societal consequences when ethical principles are ignored.
Here are some examples:
| Ethics Challenge | Case Study |
|--- |--- |
| **Informed Consent** | 1972 - [Tuskegee Syphilis Study](https://en.wikipedia.org/wiki/Tuskegee_Syphilis_Study) - African American men were promised free medical care but were deceived by researchers who withheld their diagnosis and treatment options. Many participants died, and their families were affected. The study lasted 40 years. |
| **Data Privacy** | 2007 - The [Netflix data prize](https://www.wired.com/2007/12/why-anonymous-data-sometimes-isnt/) provided researchers with _10M anonymized movie ratings from 50K customers_ to improve recommendation algorithms. However, researchers were able to link anonymized data to personally identifiable information in _external datasets_ (e.g., IMDb comments), effectively "de-anonymizing" some Netflix users.|
| **Collection Bias** | 2013 - The City of Boston [developed Street Bump](https://www.boston.gov/transportation/street-bump), an app for citizens to report potholes, helping the city collect better roadway data. However, [lower-income groups had less access to cars and smartphones](https://hbr.org/2013/04/the-hidden-biases-in-big-data), making their roadway issues invisible in the app. Developers collaborated with academics to address _equitable access and digital divide_ issues for fairness. |
| **Algorithmic Fairness** | 2018 - The MIT [Gender Shades Study](http://gendershades.org/overview.html) revealed accuracy gaps in gender classification AI products for women and people of color. A [2019 Apple Card](https://www.wired.com/story/the-apple-card-didnt-see-genderand-thats-the-problem/) reportedly offered less credit to women than men, highlighting algorithmic bias and its socio-economic impacts.|
| **Data Misrepresentation** | 2020 - The [Georgia Department of Public Health released COVID-19 charts](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) that misled citizens about case trends by using non-chronological ordering on the x-axis. This demonstrates misrepresentation through visualization techniques. |
| **Illusion of free choice** | 2020 - Learning app [ABCmouse paid $10M to settle an FTC complaint](https://www.washingtonpost.com/business/2020/09/04/abcmouse-10-million-ftc-settlement/) where parents were trapped into paying for subscriptions they couldn't cancel. This illustrates dark patterns in choice architectures, nudging users toward harmful decisions. |
| **Data Privacy & User Rights** | 2021 - Facebook [Data Breach](https://www.npr.org/2021/04/09/986005820/after-data-breach-exposes-530-million-facebook-says-it-will-not-notify-users) exposed data from 530M users, resulting in a $5B settlement to the FTC. Facebook refused to notify users of the breach, violating their rights to data transparency and access. |
Want to explore more case studies? Check out these resources:
* [Ethics Unwrapped](https://ethicsunwrapped.utexas.edu/case-studies) - ethics dilemmas across various industries.
* [Data Science Ethics course](https://www.coursera.org/learn/data-science-ethics#syllabus) - landmark case studies explored.
* [Where things have gone wrong](https://deon.drivendata.org/examples/) - deon checklist with examples.
> 🚨 Reflect on the case studies you've reviewed. Have you experienced or been affected by a similar ethical challenge? Can you think of another case study that illustrates one of the ethical challenges discussed here?
## Applied Ethics
We've explored ethical concepts, challenges, and case studies in real-world contexts. But how can we start _applying_ ethical principles in our projects? And how can we _operationalize_ these practices for better governance? Lets look at some practical solutions:
### 1. Professional Codes
Professional Codes provide a way for organizations to "encourage" members to align with their ethical principles and mission. These codes act as _moral guidelines_ for professional behavior, helping employees or members make decisions consistent with the organization's values. Their effectiveness depends on voluntary compliance, but many organizations offer rewards or penalties to motivate adherence.
Examples include:
* [Oxford Munich](http://www.code-of-ethics.org/code-of-conduct/) Code of Ethics
* [Data Science Association](http://datascienceassn.org/code-of-conduct.html) Code of Conduct (created 2013)
* [ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics) (since 1993)
> 🚨 Are you part of a professional engineering or data science organization? Check their website to see if they have a professional code of ethics. What does it say about their ethical principles? How do they "encourage" members to follow the code?
### 2. Ethics Checklists
While professional codes outline required _ethical behavior_, they [have limitations](https://resources.oreilly.com/examples/0636920203964/blob/master/of_oaths_and_checklists.md) in enforcement, especially for large-scale projects. Many data science experts [recommend checklists](https://resources.oreilly.com/examples/0636920203964/blob/master/of_oaths_and_checklists.md) to **translate principles into actionable practices**.
Checklists turn questions into "yes/no" tasks that can be integrated into standard workflows, making them easier to track during product development.
Examples include:
* [Deon](https://deon.drivendata.org/) - a general-purpose data ethics checklist based on [industry recommendations](https://deon.drivendata.org/#checklist-citations), with a command-line tool for easy integration.
* [Privacy Audit Checklist](https://cyber.harvard.edu/ecommerce/privacyaudit.html) - offers general guidance on handling information from legal and social perspectives.
* [AI Fairness Checklist](https://www.microsoft.com/en-us/research/project/ai-fairness-checklist/) - created by AI practitioners to integrate fairness checks into AI development cycles.
* [22 questions for ethics in data and AI](https://medium.com/the-organization/22-questions-for-ethics-in-data-and-ai-efb68fd19429) - an open-ended framework for exploring ethical issues in design, implementation, and organizational contexts.
### 3. Ethics Regulations
Ethics involves defining shared values and voluntarily doing the right thing. **Compliance** refers to _following the law_ where applicable. **Governance** encompasses all organizational efforts to enforce ethical principles and comply with legal requirements.
Governance in organizations takes two forms. First, it involves defining **ethical AI** principles and implementing practices to ensure adoption across all AI-related projects. Second, it requires compliance with government-mandated **data protection regulations** in the regions where the organization operates.
Examples of data protection and privacy regulations:
* `1974`, [US Privacy Act](https://www.justice.gov/opcl/privacy-act-1974) - regulates _federal government_ collection, use, and disclosure of personal information.
* `1996`, [US Health Insurance Portability & Accountability Act (HIPAA)](https://www.cdc.gov/phlp/publications/topic/hipaa.html) - protects personal health data.
* `1998`, [US Children's Online Privacy Protection Act (COPPA)](https://www.ftc.gov/enforcement/rules/rulemaking-regulatory-reform-proceedings/childrens-online-privacy-protection-rule) - safeguards the data privacy of children under 13.
* `2018`, [General Data Protection Regulation (GDPR)](https://gdpr-info.eu/) - provides user rights, data protection, and privacy.
* `2018`, [California Consumer Privacy Act (CCPA)](https://www.oag.ca.gov/privacy/ccpa) - grants consumers more control over their personal data.
* `2021`, China's [Personal Information Protection Law](https://www.reuters.com/world/china/china-passes-new-personal-data-privacy-law-take-effect-nov-1-2021-08-20/) - one of the strongest online data privacy regulations globally.
> 🚨 The European Union's GDPR (General Data Protection Regulation) is one of the most influential data privacy regulations. Did you know it defines [8 user rights](https://www.freeprivacypolicy.com/blog/8-user-rights-gdpr) to protect citizens' digital privacy and personal data? Learn about these rights and why they matter.
### 4. Ethics Culture
There is often a gap between _compliance_ (meeting legal requirements) and addressing [systemic issues](https://www.coursera.org/learn/data-science-ethics/home/week/4) (like ossification, information asymmetry, and distributional unfairness) that can accelerate the misuse of AI.
Addressing these issues requires [collaborative efforts to build ethics cultures](https://towardsdatascience.com/why-ai-ethics-requires-a-culture-driven-approach-26f451afa29f) that foster emotional connections and shared values across organizations and industries. This involves creating [formalized data ethics cultures](https://www.codeforamerica.org/news/formalizing-an-ethical-data-culture/) within organizations, empowering _anyone_ to [raise concerns early](https://en.wikipedia.org/wiki/Andon_(manufacturing)) and making ethical considerations (e.g., in hiring) a core part of team formation for AI projects.
---
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/3) 🎯
## Review & Self Study
Courses and books help build an understanding of core ethics concepts and challenges, while case studies and tools provide practical insights into applying ethics in real-world scenarios. Here are some resources to get started:
* [Machine Learning For Beginners](https://github.com/microsoft/ML-For-Beginners/blob/main/1-Introduction/3-fairness/README.md) - lesson on Fairness, from Microsoft.
* [Principles of Responsible AI](https://docs.microsoft.com/en-us/learn/modules/responsible-ai-principles/) - free learning path from Microsoft Learn.
* [Ethics and Data Science](https://resources.oreilly.com/examples/0636920203964) - O'Reilly EBook (M. Loukides, H. Mason et. al)
* [Data Science Ethics](https://www.coursera.org/learn/data-science-ethics#syllabus) - online course from the University of Michigan.
* [Ethics Unwrapped](https://ethicsunwrapped.utexas.edu/case-studies) - case studies from the University of Texas.
# Assignment
[Write A Data Ethics Case Study](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,35 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "b588c0fc73014f52520c666efc3e0cc3",
"translation_date": "2025-08-31T11:11:32+00:00",
"source_file": "1-Introduction/02-ethics/assignment.md",
"language_code": "en"
}
-->
## Write A Data Ethics Case Study
## Instructions
You've learned about various [Data Ethics Challenges](README.md#2-ethics-challenges) and seen some examples of [Case Studies](README.md#3-case-studies) reflecting data ethics challenges in real-world contexts.
In this assignment, you'll write your own case study reflecting a data ethics challenge from your own experience, or from a relevant real-world context you are familiar with. Just follow these steps:
1. `Pick a Data Ethics Challenge`. Look at [the lesson examples](README.md#2-ethics-challenges) or explore online examples like [the Deon Checklist](https://deon.drivendata.org/examples/) to get inspiration.
2. `Describe a Real World Example`. Think about a situation you have heard of (headlines, research study etc.) or experienced (local community), where this specific challenge occurred. Think about the data ethics questions related to the challenge - and discuss the potential harms or unintended consequences that arise because of this issue. Bonus points: think about potential solutions or processes that may be applied here to help eliminate or mitigate the adverse impact of this challenge.
3. `Provide a Related Resources list`. Share one or more resources (links to an article, a personal blog post or image, online research paper etc.) to prove this was a real-world occurrence. Bonus points: share resources that also showcase the potential harms & consequences from the incident, or highlight positive steps taken to prevent its recurrence.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
One or more data ethics challenges are identified. <br/> <br/> The case study clearly describes a real-world incident reflecting that challenge, and highlights undesirable consequences or harms it caused. <br/><br/> There is at least one linked resource to prove this occurred. | One data ethics challenge is identified. <br/><br/> At least one relevant harm or consequence is discussed briefly. <br/><br/> However discussion is limited or lacks proof of real-world occurrence. | A data challenge is identified. <br/><br/> However the description or resources do not adequately reflect the challenge or prove its real-world occurrence. |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,85 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "356d12cffc3125db133a2d27b827a745",
"translation_date": "2025-08-31T11:10:03+00:00",
"source_file": "1-Introduction/03-defining-data/README.md",
"language_code": "en"
}
-->
# Defining Data
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/03-DefiningData.png)|
|:---:|
|Defining Data - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Data consists of facts, information, observations, and measurements that are used to make discoveries and support informed decisions. A data point is a single unit of data within a dataset, which is a collection of data points. Datasets can come in various formats and structures, often depending on their source or origin. For instance, a company's monthly earnings might be stored in a spreadsheet, while hourly heart rate data from a smartwatch might be in [JSON](https://stackoverflow.com/a/383699) format. It's common for data scientists to work with different types of data within a dataset.
This lesson focuses on identifying and classifying data based on its characteristics and sources.
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/4)
## How Data is Described
### Raw Data
Raw data refers to data in its original state, directly from its source, without any analysis or organization. To make sense of a dataset, it needs to be organized into a format that can be understood by humans and the technology used for further analysis. The structure of a dataset describes how it is organized and can be classified as structured, unstructured, or semi-structured. These classifications depend on the source but ultimately fall into one of these three categories.
### Quantitative Data
Quantitative data consists of numerical observations within a dataset that can typically be analyzed, measured, and used mathematically. Examples of quantitative data include a country's population, a person's height, or a company's quarterly earnings. With further analysis, quantitative data can be used to identify seasonal trends in the Air Quality Index (AQI) or estimate the likelihood of rush hour traffic on a typical workday.
### Qualitative Data
Qualitative data, also known as categorical data, cannot be measured objectively like quantitative data. It often consists of subjective information that captures the quality of something, such as a product or process. Sometimes, qualitative data is numerical but not typically used mathematically, like phone numbers or timestamps. Examples of qualitative data include video comments, the make and model of a car, or your closest friends' favorite color. Qualitative data can be used to understand which products consumers prefer or to identify popular keywords in job application resumes.
### Structured Data
Structured data is organized into rows and columns, where each row has the same set of columns. Columns represent specific types of values and are identified by names describing what the values represent, while rows contain the actual data. Columns often have rules or restrictions to ensure the values accurately represent the column. For example, imagine a spreadsheet of customers where each row must include a phone number, and the phone numbers cannot contain alphabetical characters. Rules might be applied to ensure the phone number column is never empty and only contains numbers.
One advantage of structured data is that it can be organized in a way that allows it to relate to other structured data. However, because the data is designed to follow a specific structure, making changes to its overall organization can require significant effort. For instance, adding an email column to the customer spreadsheet that cannot be empty would require figuring out how to populate this column for existing rows.
Examples of structured data: spreadsheets, relational databases, phone numbers, bank statements.
### Unstructured Data
Unstructured data cannot typically be organized into rows or columns and lacks a defined format or set of rules. Because unstructured data has fewer restrictions, it is easier to add new information compared to structured datasets. For example, if a sensor measuring barometric pressure every two minutes receives an update to also record temperature, it wouldn't require altering the existing data if it's unstructured. However, analyzing or investigating unstructured data can take longer. For instance, a scientist trying to calculate the average temperature for the previous month might find that the sensor recorded an "e" in some entries to indicate it was broken, resulting in incomplete data.
Examples of unstructured data: text files, text messages, video files.
### Semi-structured Data
Semi-structured data combines features of both structured and unstructured data. It doesn't typically conform to rows and columns but is organized in a way that is considered structured and may follow a fixed format or set of rules. The structure can vary between sources, ranging from a well-defined hierarchy to something more flexible that allows for easy integration of new information. Metadata provides indicators for how the data is organized and stored, with various names depending on the type of data. Common names for metadata include tags, elements, entities, and attributes. For example, a typical email message includes a subject, body, and recipients, and can be organized by sender or date.
Examples of semi-structured data: HTML, CSV files, JavaScript Object Notation (JSON).
## Sources of Data
A data source refers to the original location where the data was generated or "lives," and it varies based on how and when it was collected. Data generated by its user(s) is known as primary data, while secondary data comes from a source that has collected data for general use. For example, scientists collecting observations in a rainforest would be considered primary data, and if they share it with other scientists, it becomes secondary data for those users.
Databases are a common data source and rely on a database management system to host and maintain the data. Users explore the data using commands called queries. Files can also serve as data sources, including audio, image, and video files, as well as spreadsheets like Excel. The internet is another common location for hosting data, where both databases and files can be found. Application programming interfaces (APIs) allow programmers to create ways to share data with external users over the internet, while web scraping extracts data from web pages. The [lessons in Working with Data](../../../../../../../../../2-Working-With-Data) focus on how to use various data sources.
## Conclusion
In this lesson, we have learned:
- What data is
- How data is described
- How data is classified and categorized
- Where data can be found
## 🚀 Challenge
Kaggle is an excellent source of open datasets. Use the [dataset search tool](https://www.kaggle.com/datasets) to find some interesting datasets and classify 3-5 datasets using the following criteria:
- Is the data quantitative or qualitative?
- Is the data structured, unstructured, or semi-structured?
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/5)
## Review & Self Study
- This Microsoft Learn unit, titled [Classify your Data](https://docs.microsoft.com/en-us/learn/modules/choose-storage-approach-in-azure/2-classify-data), provides a detailed breakdown of structured, semi-structured, and unstructured data.
## Assignment
[Classifying Datasets](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,79 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "2e5cacb967c1e9dfd07809bfc441a0b4",
"translation_date": "2025-08-31T11:10:19+00:00",
"source_file": "1-Introduction/03-defining-data/assignment.md",
"language_code": "en"
}
-->
# Classifying Datasets
## Instructions
Follow the prompts in this assignment to identify and classify the data with one of each of the following data types:
**Structure Types**: Structured, Semi-Structured, or Unstructured
**Value Types**: Qualitative or Quantitative
**Source Types**: Primary or Secondary
1. A company has been acquired and now has a parent company. The data scientists have received a spreadsheet of customer phone numbers from the parent company.
Structure Type:
Value Type:
Source Type:
---
2. A smart watch has been collecting heart rate data from its wearer, and the raw data is in JSON format.
Structure Type:
Value Type:
Source Type:
---
3. A workplace survey of employee morale that is stored in a CSV file.
Structure Type:
Value Type:
Source Type:
---
4. Astrophysicists are accessing a database of galaxies that has been collected by a space probe. The data contains the number of planets within in each galaxy.
Structure Type:
Value Type:
Source Type:
---
5. A personal finance app uses APIs to connect to a user's financial accounts in order to calculate their net worth. They can see all of their transactions in a format of rows and columns and looks similar to a spreadsheet.
Structure Type:
Value Type:
Source Type:
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
Correctly identifies all structure, value, and sources |Correctly identifies 3 all structure, value, and sources|Correctly identifies 2 or less all structure, value, and sources|
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,277 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "b706a07cfa87ba091cbb91e0aa775600",
"translation_date": "2025-08-31T11:08:08+00:00",
"source_file": "1-Introduction/04-stats-and-probability/README.md",
"language_code": "en"
}
-->
# A Brief Introduction to Statistics and Probability
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/04-Statistics-Probability.png)|
|:---:|
| Statistics and Probability - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Statistics and Probability Theory are two closely related branches of Mathematics that are highly relevant to Data Science. While it is possible to work with data without a deep understanding of mathematics, it is still beneficial to grasp some basic concepts. Here, we provide a brief introduction to help you get started.
[![Intro Video](../../../../1-Introduction/04-stats-and-probability/images/video-prob-and-stats.png)](https://youtu.be/Z5Zy85g4Yjw)
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/6)
## Probability and Random Variables
**Probability** is a number between 0 and 1 that indicates how likely an **event** is to occur. It is calculated as the number of favorable outcomes (leading to the event) divided by the total number of possible outcomes, assuming all outcomes are equally likely. For example, when rolling a die, the probability of getting an even number is 3/6 = 0.5.
When discussing events, we use **random variables**. For instance, the random variable representing the number rolled on a die can take values from 1 to 6. This set of numbers (1 to 6) is called the **sample space**. We can calculate the probability of a random variable taking a specific value, such as P(X=3)=1/6.
The random variable in the example above is called **discrete** because its sample space is countable, meaning it consists of distinct values that can be listed. In other cases, the sample space might be a range of real numbers or the entire set of real numbers. Such variables are called **continuous**. A good example is the time a bus arrives.
## Probability Distribution
For discrete random variables, it is straightforward to describe the probability of each event using a function P(X). For every value *s* in the sample space *S*, the function assigns a number between 0 and 1, such that the sum of all P(X=s) values for all events equals 1.
The most well-known discrete distribution is the **uniform distribution**, where the sample space consists of N elements, each with an equal probability of 1/N.
Describing the probability distribution of a continuous variable, which may take values from an interval [a, b] or the entire set of real numbers , is more complex. Consider the example of bus arrival times. The probability of the bus arriving at an exact time *t* is actually 0!
> Now you know that events with 0 probability can and do happen—every time the bus arrives, for instance!
Instead, we talk about the probability of a variable falling within a specific interval, e.g., P(t<sub>1</sub>≤X<t<sub>2</sub>). In this case, the probability distribution is described by a **probability density function** p(x), such that:
![P(t_1\le X<t_2)=\int_{t_1}^{t_2}p(x)dx](../../../../1-Introduction/04-stats-and-probability/images/probability-density.png)
The continuous counterpart of the uniform distribution is called the **continuous uniform distribution**, which is defined over a finite interval. The probability that the value X falls within an interval of length l is proportional to l and can reach up to 1.
Another important distribution is the **normal distribution**, which we will explore in more detail later.
## Mean, Variance, and Standard Deviation
Suppose we draw a sequence of n samples from a random variable X: x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>n</sub>. The **mean** (or **arithmetic average**) of the sequence is calculated as (x<sub>1</sub>+x<sub>2</sub>+...+x<sub>n</sub>)/n. As the sample size increases (n→∞), the mean approaches the **expectation** of the distribution, denoted as **E**(x).
> It can be shown that for any discrete distribution with values {x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>N</sub>} and corresponding probabilities p<sub>1</sub>, p<sub>2</sub>, ..., p<sub>N</sub>, the expectation is given by E(X)=x<sub>1</sub>p<sub>1</sub>+x<sub>2</sub>p<sub>2</sub>+...+x<sub>N</sub>p<sub>N</sub>.
To measure how spread out the values are, we calculate the variance σ<sup>2</sup> = ∑(x<sub>i</sub> - μ)<sup>2</sup>/n, where μ is the mean of the sequence. The square root of the variance, σ, is called the **standard deviation**.
## Mode, Median, and Quartiles
Sometimes, the mean does not adequately represent the "typical" value of the data, especially when there are extreme outliers. In such cases, the **median**—the value that divides the data into two equal halves—can be a better indicator.
To further understand the data distribution, we use **quartiles**:
* The first quartile (Q1) is the value below which 25% of the data falls.
* The third quartile (Q3) is the value below which 75% of the data falls.
The relationship between the median and quartiles can be visualized using a **box plot**:
<img src="images/boxplot_explanation.png" width="50%"/>
We also calculate the **inter-quartile range** (IQR=Q3-Q1) and identify **outliers**—values outside the range [Q1-1.5*IQR, Q3+1.5*IQR].
For small, finite distributions, the **mode**—the most frequently occurring value—can be a good "typical" value. This is especially useful for categorical data, such as colors. For example, if two groups of people strongly prefer red and blue, the mean of their preferences (if coded numerically) might fall in the orange-green range, which doesn't represent either group's preference. The mode, however, would correctly identify the most popular colors.
## Real-world Data
When analyzing real-world data, the values are often not random variables in the strict sense, as they are not the result of experiments with unknown outcomes. For example, consider the heights, weights, and ages of a baseball team. These values are not truly random, but we can still apply the same mathematical concepts. For instance, the sequence of players' weights can be treated as samples from a random variable. Below is a sequence of weights from actual Major League Baseball players, taken from [this dataset](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights) (only the first 20 values are shown):
```
[180.0, 215.0, 210.0, 210.0, 188.0, 176.0, 209.0, 200.0, 231.0, 180.0, 188.0, 180.0, 185.0, 160.0, 180.0, 185.0, 197.0, 189.0, 185.0, 219.0]
```
> **Note**: For an example of working with this dataset, check out the [accompanying notebook](../../../../1-Introduction/04-stats-and-probability/notebook.ipynb). There are also challenges throughout this lesson that you can complete by adding code to the notebook. If you're unsure how to work with data, don't worry—we'll revisit this topic later. If you don't know how to run code in Jupyter Notebook, see [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
Here is a box plot showing the mean, median, and quartiles for the data:
![Weight Box Plot](../../../../1-Introduction/04-stats-and-probability/images/weight-boxplot.png)
Since the dataset includes player **roles**, we can create a box plot by role to see how the values differ across roles. This time, we'll consider height:
![Box plot by role](../../../../1-Introduction/04-stats-and-probability/images/boxplot_byrole.png)
This diagram suggests that, on average, first basemen are taller than second basemen. Later in this lesson, we'll learn how to formally test this hypothesis and determine whether the data is statistically significant.
> When working with real-world data, we assume that all data points are samples drawn from some probability distribution. This assumption allows us to apply machine learning techniques and build predictive models.
To visualize the data distribution, we can create a **histogram**. The X-axis represents weight intervals (or **bins**), and the Y-axis shows the frequency of values within each interval.
![Histogram of real-world data](../../../../1-Introduction/04-stats-and-probability/images/weight-histogram.png)
From this histogram, we see that most values cluster around a certain mean weight, with fewer values as we move further from the mean. This indicates that extreme weights are less likely. The variance shows how much the weights deviate from the mean.
> If we analyzed weights from a different population (e.g., university students), the distribution might differ in mean and variance, but the overall shape would remain similar. However, a model trained on baseball players might perform poorly on students due to differences in the underlying distribution.
## Normal Distribution
The weight distribution we observed is typical of many real-world measurements, which often follow a similar pattern but with different means and variances. This pattern is called the **normal distribution**, and it plays a crucial role in statistics.
To simulate random weights for potential baseball players, we can use the normal distribution. Given the mean weight `mean` and standard deviation `std`, we can generate 1000 weight samples as follows:
```python
samples = np.random.normal(mean,std,1000)
```
If we plot a histogram of the generated samples, it will resemble the earlier histogram. By increasing the number of samples and bins, we can create a graph that closely approximates the ideal normal distribution:
![Normal Distribution with mean=0 and std.dev=1](../../../../1-Introduction/04-stats-and-probability/images/normal-histogram.png)
*Normal Distribution with mean=0 and std.dev=1*
## Confidence Intervals
When analyzing baseball players' weights, we assume there is a **random variable W** representing the ideal probability distribution of all players' weights (the **population**). Our dataset represents a subset of players, or a **sample**. A key question is whether we can determine the population's distribution parameters, such as mean and variance.
The simplest approach is to calculate the sample's mean and variance. However, the sample may not perfectly represent the population. This is where **confidence intervals** come into play.
> **Confidence interval** is the estimation of the true mean of the population based on our sample, with a certain probability (or **level of confidence**) of being accurate.
Suppose we have a sample X<sub>1</sub>, ..., X<sub>n</sub> from our distribution. Each time we draw a sample from our distribution, we would end up with a different mean value μ. Thus, μ can be considered a random variable. A **confidence interval** with confidence p is a pair of values (L<sub>p</sub>,R<sub>p</sub>), such that **P**(L<sub>p</sub>≤μ≤R<sub>p</sub>) = p, i.e., the probability of the measured mean value falling within the interval equals p.
It goes beyond our brief introduction to discuss in detail how these confidence intervals are calculated. More details can be found [on Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval). In short, we define the distribution of the computed sample mean relative to the true mean of the population, which is called the **Student's t-distribution**.
> **Interesting fact**: The Student's t-distribution is named after mathematician William Sealy Gosset, who published his paper under the pseudonym "Student." He worked at the Guinness brewery, and, according to one version, his employer did not want the general public to know they were using statistical tests to determine the quality of raw materials.
If we want to estimate the mean μ of our population with confidence p, we need to take the *(1-p)/2-th percentile* of a Student's t-distribution A, which can either be taken from tables or computed using built-in functions in statistical software (e.g., Python, R, etc.). Then the interval for μ would be given by X±A*D/√n, where X is the obtained mean of the sample, and D is the standard deviation.
> **Note**: We are also omitting the discussion of an important concept called [degrees of freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)), which is relevant to the Student's t-distribution. You can refer to more comprehensive books on statistics to understand this concept in depth.
An example of calculating confidence intervals for weights and heights is provided in the [accompanying notebooks](../../../../1-Introduction/04-stats-and-probability/notebook.ipynb).
| p | Weight mean |
|------|---------------|
| 0.85 | 201.73±0.94 |
| 0.90 | 201.73±1.08 |
| 0.95 | 201.73±1.28 |
Notice that the higher the confidence probability, the wider the confidence interval.
## Hypothesis Testing
In our baseball players dataset, there are different player roles, which can be summarized below (refer to the [accompanying notebook](../../../../1-Introduction/04-stats-and-probability/notebook.ipynb) to see how this table was calculated):
| Role | Height | Weight | Count |
|-------------------|------------|------------|-------|
| Catcher | 72.723684 | 204.328947 | 76 |
| Designated_Hitter | 74.222222 | 220.888889 | 18 |
| First_Baseman | 74.000000 | 213.109091 | 55 |
| Outfielder | 73.010309 | 199.113402 | 194 |
| Relief_Pitcher | 74.374603 | 203.517460 | 315 |
| Second_Baseman | 71.362069 | 184.344828 | 58 |
| Shortstop | 71.903846 | 182.923077 | 52 |
| Starting_Pitcher | 74.719457 | 205.163636 | 221 |
| Third_Baseman | 73.044444 | 200.955556 | 45 |
We can observe that the mean height of first basemen is greater than that of second basemen. Thus, we might be tempted to conclude that **first basemen are taller than second basemen**.
> This statement is called **a hypothesis**, because we do not know whether the fact is actually true or not.
However, it is not always obvious whether we can make this conclusion. From the discussion above, we know that each mean has an associated confidence interval, and thus this difference could just be a statistical error. We need a more formal way to test our hypothesis.
Let's compute confidence intervals separately for the heights of first and second basemen:
| Confidence | First Basemen | Second Basemen |
|------------|-----------------|-----------------|
| 0.85 | 73.62..74.38 | 71.04..71.69 |
| 0.90 | 73.56..74.44 | 70.99..71.73 |
| 0.95 | 73.47..74.53 | 70.92..71.81 |
We can see that under no confidence level do the intervals overlap. This proves our hypothesis that first basemen are taller than second basemen.
More formally, the problem we are solving is to determine if **two probability distributions are the same**, or at least have the same parameters. Depending on the distribution, we need to use different tests for this. If we know that our distributions are normal, we can apply the **[Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)**.
In the Student's t-test, we compute the so-called **t-value**, which indicates the difference between means, taking into account the variance. It has been shown that the t-value follows the **Student's t-distribution**, which allows us to get the threshold value for a given confidence level **p** (this can be computed or looked up in numerical tables). We then compare the t-value to this threshold to accept or reject the hypothesis.
In Python, we can use the **SciPy** package, which includes the `ttest_ind` function (along with many other useful statistical functions!). This function computes the t-value for us and also performs the reverse lookup of the confidence p-value, so we can simply look at the confidence to draw a conclusion.
For example, our comparison between the heights of first and second basemen gives us the following results:
```python
from scipy.stats import ttest_ind
tval, pval = ttest_ind(df.loc[df['Role']=='First_Baseman',['Height']], df.loc[df['Role']=='Designated_Hitter',['Height']],equal_var=False)
print(f"T-value = {tval[0]:.2f}\nP-value: {pval[0]}")
```
```
T-value = 7.65
P-value: 9.137321189738925e-12
```
In our case, the p-value is very low, meaning there is strong evidence supporting that first basemen are taller.
There are also other types of hypotheses we might want to test, for example:
* To prove that a given sample follows a specific distribution. In our case, we assumed that heights are normally distributed, but this needs formal statistical verification.
* To prove that the mean value of a sample corresponds to some predefined value.
* To compare the means of multiple samples (e.g., differences in happiness levels among different age groups).
## Law of Large Numbers and Central Limit Theorem
One of the reasons why the normal distribution is so important is the **central limit theorem**. Suppose we have a large sample of independent N values X<sub>1</sub>, ..., X<sub>N</sub>, sampled from any distribution with mean μ and variance σ<sup>2</sup>. Then, for sufficiently large N (in other words, as N→∞), the mean Σ<sub>i</sub>X<sub>i</sub> will be normally distributed, with mean μ and variance σ<sup>2</sup>/N.
> Another way to interpret the central limit theorem is to say that regardless of the original distribution, when you compute the mean of a sum of random variable values, you end up with a normal distribution.
From the central limit theorem, it also follows that as N→∞, the probability of the sample mean being equal to μ becomes 1. This is known as **the law of large numbers**.
## Covariance and Correlation
One of the tasks in Data Science is finding relationships between data. We say that two sequences **correlate** when they exhibit similar behavior at the same time, i.e., they either rise/fall simultaneously, or one sequence rises when the other falls and vice versa. In other words, there seems to be some relationship between the two sequences.
> Correlation does not necessarily indicate a causal relationship between two sequences; sometimes both variables can depend on an external cause, or it can be purely by chance that the two sequences correlate. However, strong mathematical correlation is a good indication that two variables are somehow connected.
Mathematically, the main concept that shows the relationship between two random variables is **covariance**, which is computed as: Cov(X,Y) = **E**\[(X-**E**(X))(Y-**E**(Y))\]. We compute the deviation of both variables from their mean values, and then take the product of those deviations. If both variables deviate together, the product will always be positive, resulting in positive covariance. If both variables deviate out-of-sync (i.e., one falls below average when the other rises above average), we will always get negative numbers, resulting in negative covariance. If the deviations are independent, they will sum to roughly zero.
The absolute value of covariance does not tell us much about the strength of the correlation, as it depends on the magnitude of the actual values. To normalize it, we can divide covariance by the standard deviation of both variables to get **correlation**. The advantage of correlation is that it is always in the range [-1,1], where 1 indicates strong positive correlation, -1 indicates strong negative correlation, and 0 indicates no correlation at all (variables are independent).
**Example**: We can compute the correlation between the weights and heights of baseball players from the dataset mentioned above:
```python
print(np.corrcoef(weights,heights))
```
As a result, we get a **correlation matrix** like this one:
```
array([[1. , 0.52959196],
[0.52959196, 1. ]])
```
> A correlation matrix C can be computed for any number of input sequences S<sub>1</sub>, ..., S<sub>n</sub>. The value of C<sub>ij</sub> is the correlation between S<sub>i</sub> and S<sub>j</sub>, and diagonal elements are always 1 (which represents the self-correlation of S<sub>i</sub>).
In our case, the value 0.53 indicates that there is some correlation between a person's weight and height. We can also create a scatter plot of one value against the other to visualize the relationship:
![Relationship between weight and height](../../../../1-Introduction/04-stats-and-probability/images/weight-height-relationship.png)
> More examples of correlation and covariance can be found in the [accompanying notebook](../../../../1-Introduction/04-stats-and-probability/notebook.ipynb).
## Conclusion
In this section, we have learned:
* Basic statistical properties of data, such as mean, variance, mode, and quartiles.
* Different distributions of random variables, including the normal distribution.
* How to find correlations between different properties.
* How to use mathematical and statistical tools to prove hypotheses.
* How to compute confidence intervals for random variables given a data sample.
While this is not an exhaustive list of topics in probability and statistics, it should provide a solid foundation for this course.
## 🚀 Challenge
Use the sample code in the notebook to test the following hypotheses:
1. First basemen are older than second basemen.
2. First basemen are taller than third basemen.
3. Shortstops are taller than second basemen.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/7)
## Review & Self Study
Probability and statistics is such a broad topic that it deserves its own course. If you are interested in diving deeper into the theory, you may want to explore the following resources:
1. [Carlos Fernandez-Granda](https://cims.nyu.edu/~cfgranda/) from New York University has excellent lecture notes: [Probability and Statistics for Data Science](https://cims.nyu.edu/~cfgranda/pages/stuff/probability_stats_for_DS.pdf) (available online).
2. [Peter and Andrew Bruce. Practical Statistics for Data Scientists.](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/) [[Sample code in R](https://github.com/andrewgbruce/statistics-for-data-scientists)].
3. [James D. Miller. Statistics for Data Science](https://www.packtpub.com/product/statistics-for-data-science/9781788290678) [[Sample code in R](https://github.com/PacktPublishing/Statistics-for-Data-Science)].
## Assignment
[Small Diabetes Study](assignment.md)
## Credits
This lesson was authored with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,40 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "01d1b493e8b51a6ebb42524f6b1bcfff",
"translation_date": "2025-08-31T11:09:12+00:00",
"source_file": "1-Introduction/04-stats-and-probability/assignment.md",
"language_code": "en"
}
-->
# Small Diabetes Study
In this assignment, we will work with a small dataset of diabetes patients taken from [here](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html).
| | AGE | SEX | BMI | BP | S1 | S2 | S3 | S4 | S5 | S6 | Y |
|---|-----|-----|-----|----|----|----|----|----|----|----|----|
| 0 | 59 | 2 | 32.1 | 101. | 157 | 93.2 | 38.0 | 4. | 4.8598 | 87 | 151 |
| 1 | 48 | 1 | 21.6 | 87.0 | 183 | 103.2 | 70. | 3. | 3.8918 | 69 | 75 |
| 2 | 72 | 2 | 30.5 | 93.0 | 156 | 93.6 | 41.0 | 4.0 | 4. | 85 | 141 |
| ... | ... | ... | ... | ...| ...| ...| ...| ...| ...| ...| ... |
## Instructions
* Open the [assignment notebook](../../../../1-Introduction/04-stats-and-probability/assignment.ipynb) in a Jupyter notebook environment
* Complete all tasks listed in the notebook, namely:
* [ ] Calculate the mean values and variance for all variables
* [ ] Create boxplots for BMI, BP, and Y based on gender
* [ ] Analyze the distribution of Age, Sex, BMI, and Y variables
* [ ] Examine the correlation between different variables and disease progression (Y)
* [ ] Test the hypothesis that the progression of diabetes differs between men and women
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
All required tasks are completed, visually represented, and explained | Most tasks are completed, but explanations or insights from graphs and/or calculated values are missing | Only basic tasks like calculating mean/variance and creating simple plots are completed, with no conclusions drawn from the data
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,31 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "696a8474a01054281704cbfb09148949",
"translation_date": "2025-08-31T11:08:01+00:00",
"source_file": "1-Introduction/README.md",
"language_code": "en"
}
-->
# Introduction to Data Science
![data in action](../../../1-Introduction/images/data.jpg)
> Photo by <a href="https://unsplash.com/@dawson2406?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Stephen Dawson</a> on <a href="https://unsplash.com/s/photos/data?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
In these lessons, you will explore what Data Science is and learn about the ethical responsibilities that a data scientist must keep in mind. You will also understand what data is and get an introduction to statistics and probability, which are the foundational academic fields of Data Science.
### Topics
1. [Defining Data Science](01-defining-data-science/README.md)
2. [Data Science Ethics](02-ethics/README.md)
3. [Defining Data](03-defining-data/README.md)
4. [Introduction to Statistics and Probability](04-stats-and-probability/README.md)
### Credits
These lessons were created with ❤️ by [Nitya Narasimhan](https://twitter.com/nitya) and [Dmitry Soshnikov](https://twitter.com/shwars).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,195 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "870a0086adbc313a8eea5489bdcb2522",
"translation_date": "2025-08-31T10:58:39+00:00",
"source_file": "2-Working-With-Data/05-relational-databases/README.md",
"language_code": "en"
}
-->
# Working with Data: Relational Databases
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/05-RelationalData.png)|
|:---:|
| Working With Data: Relational Databases - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Youve probably used a spreadsheet before to store information. It consists of rows and columns, where the rows hold the data and the columns describe the data (sometimes referred to as metadata). A relational database builds on this concept of rows and columns in tables, enabling you to spread information across multiple tables. This approach allows you to work with more complex data, reduce duplication, and gain flexibility in how you analyze the data. Lets dive into the basics of relational databases.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/8)
## It all starts with tables
At the heart of a relational database are tables. Similar to a spreadsheet, a table is a collection of rows and columns. Rows contain the data you want to work with, such as the name of a city or the amount of rainfall, while columns describe the type of data stored.
Lets start by creating a table to store information about cities. For example, we might want to store their name and country. This could look like the following table:
| City | Country |
| -------- | ------------- |
| Tokyo | Japan |
| Atlanta | United States |
| Auckland | New Zealand |
Notice how the column names **city**, **country**, and **population** describe the data being stored, and each row contains information about a specific city.
## The shortcomings of a single table approach
The table above might look familiar to you. Now, lets add more data to our growing database—annual rainfall (in millimeters) for the years 2018, 2019, and 2020. If we were to add this data for Tokyo, it might look like this:
| City | Country | Year | Amount |
| ----- | ------- | ---- | ------ |
| Tokyo | Japan | 2020 | 1690 |
| Tokyo | Japan | 2019 | 1874 |
| Tokyo | Japan | 2018 | 1445 |
What do you notice about this table? You might see that were repeating the name and country of the city multiple times. This repetition can take up unnecessary storage space. After all, Tokyo only has one name and one country.
Lets try a different approach by adding new columns for each year:
| City | Country | 2018 | 2019 | 2020 |
| -------- | ------------- | ---- | ---- | ---- |
| Tokyo | Japan | 1445 | 1874 | 1690 |
| Atlanta | United States | 1779 | 1111 | 1683 |
| Auckland | New Zealand | 1386 | 942 | 1176 |
While this eliminates row duplication, it introduces other challenges. For instance, wed need to modify the table structure every time a new year is added. Additionally, as the dataset grows, having years as columns makes it harder to retrieve and calculate values.
This is why relational databases use multiple tables and relationships. By breaking data into separate tables, we can avoid duplication and gain more flexibility in how we work with the data.
## The concepts of relationships
Lets revisit our data and decide how to split it into multiple tables. We know we want to store the name and country of each city, so this information can go into one table:
| City | Country |
| -------- | ------------- |
| Tokyo | Japan |
| Atlanta | United States |
| Auckland | New Zealand |
Before creating the next table, we need a way to reference each city. This requires an identifier, often called an ID or, in database terminology, a primary key. A primary key is a unique value used to identify a specific row in a table. While we could use the city name as the identifier, its better to use a number or another unique value that wont change. Most primary keys are auto-generated numbers.
> ✅ Primary key is often abbreviated as PK
### cities
| city_id | City | Country |
| ------- | -------- | ------------- |
| 1 | Tokyo | Japan |
| 2 | Atlanta | United States |
| 3 | Auckland | New Zealand |
> ✅ Throughout this lesson, youll notice we use the terms "id" and "primary key" interchangeably. These concepts also apply to DataFrames, which youll explore later. While DataFrames dont use the term "primary key," they function similarly.
With our cities table created, lets store the rainfall data. Instead of duplicating city information, we can use the city ID. The new table should also have its own ID or primary key.
### rainfall
| rainfall_id | city_id | Year | Amount |
| ----------- | ------- | ---- | ------ |
| 1 | 1 | 2018 | 1445 |
| 2 | 1 | 2019 | 1874 |
| 3 | 1 | 2020 | 1690 |
| 4 | 2 | 2018 | 1779 |
| 5 | 2 | 2019 | 1111 |
| 6 | 2 | 2020 | 1683 |
| 7 | 3 | 2018 | 1386 |
| 8 | 3 | 2019 | 942 |
| 9 | 3 | 2020 | 1176 |
Notice the **city_id** column in the **rainfall** table. This column contains values that reference the IDs in the **cities** table. In relational database terms, this is called a **foreign key**—a primary key from another table. You can think of it as a reference or pointer. For example, **city_id** 1 refers to Tokyo.
> [!NOTE] Foreign key is often abbreviated as FK
## Retrieving the data
With our data split into two tables, you might wonder how to retrieve it. Relational databases like MySQL, SQL Server, or Oracle use a language called Structured Query Language (SQL) for this purpose. SQL (sometimes pronounced "sequel") is a standard language for retrieving and modifying data in relational databases.
To retrieve data, you use the `SELECT` command. Essentially, you **select** the columns you want to view **from** the table they belong to. For example, to display just the names of the cities, you could use the following:
```sql
SELECT city
FROM cities;
-- Output:
-- Tokyo
-- Atlanta
-- Auckland
```
`SELECT` specifies the columns, and `FROM` specifies the table.
> [NOTE] SQL syntax is case-insensitive, meaning `select` and `SELECT` are treated the same. However, depending on the database, column and table names might be case-sensitive. As a best practice, always treat everything in programming as case-sensitive. In SQL, its common to write keywords in uppercase.
The query above will display all cities. If you only want to display cities in New Zealand, you can use a filter. The SQL keyword for filtering is `WHERE`, which specifies conditions.
```sql
SELECT city
FROM cities
WHERE country = 'New Zealand';
-- Output:
-- Auckland
```
## Joining data
So far, weve retrieved data from a single table. Now, lets combine data from both **cities** and **rainfall**. This is done by *joining* the tables. Essentially, you create a connection between the two tables by matching values in specific columns.
In our example, well match the **city_id** column in **rainfall** with the **city_id** column in **cities**. This will link rainfall data to its corresponding city. The type of join well use is called an *inner* join, which only displays rows that have matching values in both tables. Since every city has rainfall data, all rows will be displayed.
Lets retrieve the rainfall data for 2019 for all cities.
Well do this step by step. First, join the tables by specifying the columns to connect—**city_id** in both tables.
```sql
SELECT cities.city
rainfall.amount
FROM cities
INNER JOIN rainfall ON cities.city_id = rainfall.city_id
```
Weve highlighted the columns to join and specified the connection using **city_id**. Now, we can add a `WHERE` clause to filter for the year 2019.
```sql
SELECT cities.city
rainfall.amount
FROM cities
INNER JOIN rainfall ON cities.city_id = rainfall.city_id
WHERE rainfall.year = 2019
-- Output
-- city | amount
-- -------- | ------
-- Tokyo | 1874
-- Atlanta | 1111
-- Auckland | 942
```
## Summary
Relational databases are designed to divide information across multiple tables, which can then be combined for analysis and display. This approach offers flexibility for calculations and data manipulation. Youve learned the core concepts of relational databases and how to join data from two tables.
## 🚀 Challenge
There are many relational databases available online. Use the skills youve learned to explore and analyze data.
## Post-Lecture Quiz
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/9)
## Review & Self Study
Microsoft Learn offers several resources to deepen your understanding of SQL and relational database concepts:
- [Describe concepts of relational data](https://docs.microsoft.com//learn/modules/describe-concepts-of-relational-data?WT.mc_id=academic-77958-bethanycheum)
- [Get Started Querying with Transact-SQL](https://docs.microsoft.com//learn/paths/get-started-querying-with-transact-sql?WT.mc_id=academic-77958-bethanycheum) (Transact-SQL is a version of SQL)
- [SQL content on Microsoft Learn](https://docs.microsoft.com/learn/browse/?products=azure-sql-database%2Csql-server&expanded=azure&WT.mc_id=academic-77958-bethanycheum)
## Assignment
[Assignment Title](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,73 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "2f2d7693f28e4b2675f275e489dc5aac",
"translation_date": "2025-08-31T10:59:00+00:00",
"source_file": "2-Working-With-Data/05-relational-databases/assignment.md",
"language_code": "en"
}
-->
# Displaying airport data
You have been provided a [database](https://raw.githubusercontent.com/Microsoft/Data-Science-For-Beginners/main/2-Working-With-Data/05-relational-databases/airports.db) built on [SQLite](https://sqlite.org/index.html) which contains information about airports. The schema is shown below. You will use the [SQLite extension](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite&WT.mc_id=academic-77958-bethanycheum) in [Visual Studio Code](https://code.visualstudio.com?WT.mc_id=academic-77958-bethanycheum) to display information about airports in various cities.
## Instructions
To begin the assignment, you'll need to complete a few steps. This involves installing some tools and downloading the sample database.
### Set up your system
You can use Visual Studio Code and the SQLite extension to interact with the database.
1. Go to [code.visualstudio.com](https://code.visualstudio.com?WT.mc_id=academic-77958-bethanycheum) and follow the instructions to install Visual Studio Code.
1. Install the [SQLite extension](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite&WT.mc_id=academic-77958-bethanycheum) as described on the Marketplace page.
### Download and open the database
Next, download and open the database.
1. Download the [database file from GitHub](https://raw.githubusercontent.com/Microsoft/Data-Science-For-Beginners/main/2-Working-With-Data/05-relational-databases/airports.db) and save it to a folder.
1. Open Visual Studio Code.
1. Open the database in the SQLite extension by pressing **Ctrl-Shift-P** (or **Cmd-Shift-P** on a Mac) and typing `SQLite: Open database`.
1. Select **Choose database from file** and open the **airports.db** file you downloaded earlier.
1. After opening the database (you won't see any visible changes on the screen), create a new query window by pressing **Ctrl-Shift-P** (or **Cmd-Shift-P** on a Mac) and typing `SQLite: New query`.
Once the query window is open, you can use it to execute SQL statements against the database. Use the command **Ctrl-Shift-Q** (or **Cmd-Shift-Q** on a Mac) to run queries on the database.
> [!NOTE] For more details about the SQLite extension, refer to the [documentation](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite&WT.mc_id=academic-77958-bethanycheum).
## Database schema
A database's schema defines its table design and structure. The **airports** database contains two tables: `cities`, which lists cities in the United Kingdom and Ireland, and `airports`, which lists all airports. Since some cities may have multiple airports, two separate tables were created to store this information. In this exercise, you will use joins to display data for various cities.
| Cities |
| ---------------- |
| id (PK, integer) |
| city (text) |
| country (text) |
| Airports |
| -------------------------------- |
| id (PK, integer) |
| name (text) |
| code (text) |
| city_id (FK to id in **Cities**) |
## Assignment
Write queries to retrieve the following information:
1. All city names in the `Cities` table.
1. All cities in Ireland from the `Cities` table.
1. All airport names along with their city and country.
1. All airports located in London, United Kingdom.
## Rubric
| Exemplary | Adequate | Needs Improvement |
| --------- | -------- | ----------------- |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,158 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "32ddfef8121650f2ca2f3416fd283c37",
"translation_date": "2025-08-31T10:57:17+00:00",
"source_file": "2-Working-With-Data/06-non-relational/README.md",
"language_code": "en"
}
-->
# Working with Data: Non-Relational Data
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/06-NoSQL.png)|
|:---:|
|Working with NoSQL Data - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/10)
Data isn't limited to relational databases. This lesson focuses on non-relational data and introduces the basics of spreadsheets and NoSQL.
## Spreadsheets
Spreadsheets are a widely used method for storing and analyzing data because they require minimal setup to get started. In this lesson, you'll learn the fundamental components of a spreadsheet, along with formulas and functions. Examples will be demonstrated using Microsoft Excel, but most spreadsheet software will have similar features and steps.
![An empty Microsoft Excel workbook with two worksheets](../../../../2-Working-With-Data/06-non-relational/images/parts-of-spreadsheet.png)
A spreadsheet is a file that can be accessed on a computer, device, or cloud-based file system. The software itself might be browser-based or require installation as an application or app. In Excel, these files are referred to as **workbooks**, and this term will be used throughout the lesson.
A workbook contains one or more **worksheets**, each labeled with tabs. Within a worksheet are rectangles called **cells**, which hold the actual data. A cell is the intersection of a row and column, with columns labeled alphabetically and rows labeled numerically. Some spreadsheets include headers in the first few rows to describe the data in the cells.
Using these basic elements of an Excel workbook, we'll explore an example from [Microsoft Templates](https://templates.office.com/) focused on inventory management to dive deeper into spreadsheet features.
### Managing an Inventory
The spreadsheet file named "InventoryExample" is a formatted inventory spreadsheet containing three worksheets, with tabs labeled "Inventory List," "Inventory Pick List," and "Bin Lookup." Row 4 of the Inventory List worksheet serves as the header, describing the value of each cell in the corresponding column.
![A highlighted formula from an example inventory list in Microsoft Excel](../../../../2-Working-With-Data/06-non-relational/images/formula-excel.png)
Sometimes, a cell's value depends on other cells to calculate its own value. For example, the Inventory List spreadsheet tracks the cost of each item in the inventory, but what if we need to calculate the total value of the inventory? [**Formulas**](https://support.microsoft.com/en-us/office/overview-of-formulas-34519a4e-1e8d-4f4b-84d4-d642c4f63263) perform operations on cell data, and in this case, a formula is used in the Inventory Value column to calculate the value of each item by multiplying the quantity (under the QTY header) by the cost (under the COST header). Double-clicking or highlighting a cell reveals the formula. Formulas always start with an equals sign, followed by the calculation or operation.
![A highlighted function from an example inventory list in Microsoft Excel](../../../../2-Working-With-Data/06-non-relational/images/function-excel.png)
To find the total inventory value, we can use another formula to sum up all the values in the Inventory Value column. While manually adding each cell is possible, it can be tedious. Excel provides [**functions**](https://support.microsoft.com/en-us/office/sum-function-043e1c7d-7726-4e80-8f32-07b23e057f89), which are predefined formulas for performing calculations on cell values. Functions require arguments—the values needed for the calculation. If a function requires multiple arguments, they must be listed in the correct order to ensure accurate results. In this example, the SUM function is used to add up the values in the Inventory Value column, with the total displayed in row 3, column B (B3).
## NoSQL
NoSQL is a broad term encompassing various methods for storing non-relational data. It can be interpreted as "non-SQL," "non-relational," or "not only SQL." These database systems are categorized into four main types.
![Graphical representation of a key-value data store showing 4 unique numerical keys that are associated with 4 various values](../../../../2-Working-With-Data/06-non-relational/images/kv-db.png)
> Source from [Michał Białecki Blog](https://www.michalbialecki.com/2018/03/18/azure-cosmos-db-key-value-database-cloud/)
[Key-value](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#keyvalue-data-stores) databases pair unique keys (identifiers) with associated values. These pairs are stored using a [hash table](https://www.hackerearth.com/practice/data-structures/hash-tables/basics-of-hash-tables/tutorial/) and an appropriate hashing function.
![Graphical representation of a graph data store showing the relationships between people, their interests and locations](../../../../2-Working-With-Data/06-non-relational/images/graph-db.png)
> Source from [Microsoft](https://docs.microsoft.com/en-us/azure/cosmos-db/graph/graph-introduction#graph-database-by-example)
[Graph](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#graph-data-stores) databases represent relationships in data as collections of nodes and edges. Nodes represent entities (e.g., a student or bank statement), while edges represent relationships between entities. Both nodes and edges have properties that provide additional information.
![Graphical representation of a columnar data store showing a customer database with two column families named Identity and Contact Info](../../../../2-Working-With-Data/06-non-relational/images/columnar-db.png)
[Columnar](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#columnar-data-stores) data stores organize data into rows and columns, similar to relational databases, but group columns into column families. All data within a column family is related and can be retrieved or modified as a single unit.
### Document Data Stores with the Azure Cosmos DB
[Document](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#document-data-stores) data stores expand on the concept of key-value stores, consisting of fields and objects. This section explores document databases using the Cosmos DB emulator.
Cosmos DB fits the "Not Only SQL" definition, as its document database uses SQL for querying data. The [previous lesson](../05-relational-databases/README.md) on SQL covers the basics of the language, which can be applied to document databases here. We'll use the Cosmos DB Emulator to create and explore a document database locally. Learn more about the Emulator [here](https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator?tabs=ssl-netstd21).
A document is a collection of fields and object values, where fields describe the object values. Below is an example of a document.
```json
{
"firstname": "Eva",
"age": 44,
"id": "8c74a315-aebf-4a16-bb38-2430a9896ce5",
"_rid": "bHwDAPQz8s0BAAAAAAAAAA==",
"_self": "dbs/bHwDAA==/colls/bHwDAPQz8s0=/docs/bHwDAPQz8s0BAAAAAAAAAA==/",
"_etag": "\"00000000-0000-0000-9f95-010a691e01d7\"",
"_attachments": "attachments/",
"_ts": 1630544034
}
```
Key fields in this document include `firstname`, `id`, and `age`. Other fields with underscores are generated by Cosmos DB.
#### Exploring Data with the Cosmos DB Emulator
You can download and install the emulator [for Windows here](https://aka.ms/cosmosdb-emulator). For macOS and Linux, refer to this [documentation](https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator?tabs=ssl-netstd21#run-on-linux-macos).
The Emulator opens in a browser window, where the Explorer view lets you explore documents.
![The Explorer view of the Cosmos DB Emulator](../../../../2-Working-With-Data/06-non-relational/images/cosmosdb-emulator-explorer.png)
If you're following along, click "Start with Sample" to generate a sample database called SampleDB. Expanding SampleDB reveals a container called `Persons`, which holds a collection of items (documents). You can explore the four individual documents under `Items`.
![Exploring sample data in the Cosmos DB Emulator](../../../../2-Working-With-Data/06-non-relational/images/cosmosdb-emulator-persons.png)
#### Querying Document Data with the Cosmos DB Emulator
You can query the sample data by clicking the "New SQL Query" button (second button from the left).
`SELECT * FROM c` retrieves all documents in the container. Adding a WHERE clause allows filtering, such as finding everyone younger than 40:
`SELECT * FROM c where c.age < 40`
![Running a SELECT query on sample data in the Cosmos DB Emulator to find documents that have an age field value that is less than 40](../../../../2-Working-With-Data/06-non-relational/images/cosmosdb-emulator-persons-query.png)
The query returns two documents, both with age values less than 40.
#### JSON and Documents
If you're familiar with JavaScript Object Notation (JSON), you'll notice that documents resemble JSON. A `PersonsData.json` file in this directory contains additional data that can be uploaded to the Persons container in the Emulator using the `Upload Item` button.
APIs that return JSON data can often be directly stored in document databases. Below is another document, representing tweets from the Microsoft Twitter account retrieved via the Twitter API and inserted into Cosmos DB.
```json
{
"created_at": "2021-08-31T19:03:01.000Z",
"id": "1432780985872142341",
"text": "Blank slate. Like this tweet if youve ever painted in Microsoft Paint before. https://t.co/cFeEs8eOPK",
"_rid": "dhAmAIUsA4oHAAAAAAAAAA==",
"_self": "dbs/dhAmAA==/colls/dhAmAIUsA4o=/docs/dhAmAIUsA4oHAAAAAAAAAA==/",
"_etag": "\"00000000-0000-0000-9f84-a0958ad901d7\"",
"_attachments": "attachments/",
"_ts": 1630537000
```
Key fields in this document include `created_at`, `id`, and `text`.
## 🚀 Challenge
A `TwitterData.json` file can be uploaded to the SampleDB database. It's recommended to add it to a separate container. To do this:
1. Click the "New Container" button in the top right.
2. Select the existing database (SampleDB), create a container ID for the container.
3. Set the partition key to `/id`.
4. Click OK (you can ignore the rest of the information since this is a small dataset running locally).
5. Open your new container and upload the Twitter Data file using the `Upload Item` button.
Try running a few SELECT queries to find documents containing "Microsoft" in the text field. Hint: Use the [LIKE keyword](https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-keywords#using-like-with-the--wildcard-character).
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/11)
## Review & Self Study
- This lesson doesn't cover all the formatting and features available in spreadsheets. Microsoft offers a [comprehensive library of documentation and videos](https://support.microsoft.com/excel) for Excel if you'd like to learn more.
- Learn more about the characteristics of different types of non-relational data in this architectural documentation: [Non-relational Data and NoSQL](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data).
- Cosmos DB is a cloud-based non-relational database that supports the NoSQL types discussed in this lesson. Explore these types further in this [Cosmos DB Microsoft Learn Module](https://docs.microsoft.com/en-us/learn/paths/work-with-nosql-data-in-azure-cosmos-db/).
## Assignment
[Soda Profits](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,33 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "f824bfdb8b12d33293913f76f5c787c5",
"translation_date": "2025-08-31T10:57:41+00:00",
"source_file": "2-Working-With-Data/06-non-relational/assignment.md",
"language_code": "en"
}
-->
# Soda Profits
## Instructions
The [Coca Cola Co spreadsheet](../../../../2-Working-With-Data/06-non-relational/CocaColaCo.xlsx) is missing some calculations. Your task is to:
1. Calculate the Gross profits for FY '15, '16, '17, and '18
- Gross Profit = Net Operating revenues - Cost of goods sold
1. Calculate the average of all the gross profits. Try to do this using a function.
- Average = Sum of gross profits divided by the number of fiscal years (10)
- Documentation on the [AVERAGE function](https://support.microsoft.com/en-us/office/average-function-047bac88-d466-426c-a32b-8f33eb960cf6)
1. This is an Excel file, but it should be editable in any spreadsheet platform
[Data source credit to Yiyi Wang](https://www.kaggle.com/yiyiwang0826/cocacola-excel)
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,290 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "116c5d361fbe812e59a73f37ce721d36",
"translation_date": "2025-08-31T10:57:48+00:00",
"source_file": "2-Working-With-Data/07-python/README.md",
"language_code": "en"
}
-->
# Working with Data: Python and the Pandas Library
| ![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/07-WorkWithPython.png) |
| :-------------------------------------------------------------------------------------------------------: |
| Working With Python - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
[![Intro Video](../../../../2-Working-With-Data/07-python/images/video-ds-python.png)](https://youtu.be/dZjWOGbsN4Y)
While databases provide highly efficient methods for storing and querying data using query languages, the most flexible way to process data is by writing your own program to manipulate it. In many cases, using a database query is more effective. However, when more complex data processing is required, it may not be easily achievable with SQL.
Data processing can be done in any programming language, but some languages are better suited for working with data. Data scientists often prefer one of the following languages:
* **[Python](https://www.python.org/)**: A general-purpose programming language often considered one of the best options for beginners due to its simplicity. Python has many additional libraries that can help solve practical problems, such as extracting data from ZIP archives or converting images to grayscale. Beyond data science, Python is also widely used for web development.
* **[R](https://www.r-project.org/)**: A traditional toolset designed specifically for statistical data processing. It has a large repository of libraries (CRAN), making it a strong choice for data analysis. However, R is not a general-purpose programming language and is rarely used outside the data science domain.
* **[Julia](https://julialang.org/)**: A language developed specifically for data science, designed to offer better performance than Python, making it an excellent tool for scientific experimentation.
In this lesson, we will focus on using Python for simple data processing. We assume you have basic familiarity with the language. If you'd like a deeper dive into Python, you can explore the following resources:
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse) - A quick introductory course on Python programming hosted on GitHub.
* [Take your First Steps with Python](https://docs.microsoft.com/en-us/learn/paths/python-first-steps/?WT.mc_id=academic-77958-bethanycheum) - A learning path available on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=academic-77958-bethanycheum).
Data can come in various forms. In this lesson, we will focus on three types of data: **tabular data**, **text**, and **images**.
Rather than providing a comprehensive overview of all related libraries, we will focus on a few examples of data processing. This approach will help you grasp the main concepts and equip you with the knowledge to find solutions to your problems when needed.
> **Most useful advice**: When you need to perform a specific operation on data but don't know how, try searching for it online. [Stackoverflow](https://stackoverflow.com/) often contains many useful Python code samples for common tasks.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/12)
## Tabular Data and Dataframes
Youve already encountered tabular data when we discussed relational databases. When dealing with large datasets stored in multiple linked tables, SQL is often the best tool for the job. However, there are many situations where you have a single table of data and need to derive **insights** or **understanding** from it, such as analyzing distributions or correlations between values. In data science, its common to transform the original data and then visualize it. Both steps can be easily accomplished using Python.
Two key libraries in Python are particularly useful for working with tabular data:
* **[Pandas](https://pandas.pydata.org/)**: Enables manipulation of **DataFrames**, which are similar to relational tables. You can work with named columns and perform various operations on rows, columns, and entire DataFrames.
* **[Numpy](https://numpy.org/)**: A library for working with **tensors**, or multi-dimensional **arrays**. Arrays contain values of the same type and are simpler than DataFrames, offering more mathematical operations with less overhead.
Additionally, there are other libraries worth knowing:
* **[Matplotlib](https://matplotlib.org/)**: Used for data visualization and graph plotting.
* **[SciPy](https://www.scipy.org/)**: Provides additional scientific functions. Weve already encountered this library when discussing probability and statistics.
Heres a typical code snippet for importing these libraries at the start of a Python program:
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import ... # you need to specify exact sub-packages that you need
```
Pandas revolves around a few fundamental concepts.
### Series
A **Series** is a sequence of values, similar to a list or numpy array. The key difference is that a Series also has an **index**, which is considered during operations (e.g., addition). The index can be as simple as an integer row number (the default when creating a Series from a list or array) or more complex, such as a date range.
> **Note**: Some introductory Pandas code is available in the accompanying notebook [`notebook.ipynb`](../../../../2-Working-With-Data/07-python/notebook.ipynb). Well outline a few examples here, but feel free to explore the full notebook.
For example, lets analyze sales data for an ice cream shop. Well generate a Series of sales numbers (items sold each day) over a specific time period:
```python
start_date = "Jan 1, 2020"
end_date = "Mar 31, 2020"
idx = pd.date_range(start_date,end_date)
print(f"Length of index is {len(idx)}")
items_sold = pd.Series(np.random.randint(25,50,size=len(idx)),index=idx)
items_sold.plot()
```
![Time Series Plot](../../../../2-Working-With-Data/07-python/images/timeseries-1.png)
Now, suppose we host a weekly party for friends and take an additional 10 packs of ice cream for the event. We can create another Series, indexed by week, to represent this:
```python
additional_items = pd.Series(10,index=pd.date_range(start_date,end_date,freq="W"))
```
When we add the two Series together, we get the total number:
```python
total_items = items_sold.add(additional_items,fill_value=0)
total_items.plot()
```
![Time Series Plot](../../../../2-Working-With-Data/07-python/images/timeseries-2.png)
> **Note**: We dont use the simple syntax `total_items + additional_items`. If we did, the resulting Series would contain many `NaN` (*Not a Number*) values. This happens because some index points in the `additional_items` Series lack values, and adding `NaN` to anything results in `NaN`. To avoid this, we specify the `fill_value` parameter during addition.
With time series, we can also **resample** the data using different time intervals. For instance, to calculate the average monthly sales volume, we can use the following code:
```python
monthly = total_items.resample("1M").mean()
ax = monthly.plot(kind='bar')
```
![Monthly Time Series Averages](../../../../2-Working-With-Data/07-python/images/timeseries-3.png)
### DataFrame
A DataFrame is essentially a collection of Series with the same index. We can combine multiple Series into a DataFrame:
```python
a = pd.Series(range(1,10))
b = pd.Series(["I","like","to","play","games","and","will","not","change"],index=range(0,9))
df = pd.DataFrame([a,b])
```
This creates a horizontal table like this:
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| --- | --- | ---- | --- | --- | ------ | --- | ------ | ---- | ---- |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 1 | I | like | to | use | Python | and | Pandas | very | much |
We can also use Series as columns and specify column names using a dictionary:
```python
df = pd.DataFrame({ 'A' : a, 'B' : b })
```
This results in the following table:
| | A | B |
| --- | --- | ------ |
| 0 | 1 | I |
| 1 | 2 | like |
| 2 | 3 | to |
| 3 | 4 | use |
| 4 | 5 | Python |
| 5 | 6 | and |
| 6 | 7 | Pandas |
| 7 | 8 | very |
| 8 | 9 | much |
**Note**: We can also achieve this table layout by transposing the previous table using:
```python
df = pd.DataFrame([a,b]).T..rename(columns={ 0 : 'A', 1 : 'B' })
```
Here, `.T` performs the transposition (swapping rows and columns), and the `rename` operation allows us to rename columns to match the previous example.
Here are some key operations you can perform on DataFrames:
**Column selection**: Select individual columns using `df['A']` (returns a Series). To select a subset of columns into another DataFrame, use `df[['B', 'A']]`.
**Filtering rows by criteria**: For example, to keep only rows where column `A` is greater than 5, use `df[df['A'] > 5]`.
> **Note**: Filtering works as follows: The expression `df['A'] < 5` returns a boolean Series indicating whether the condition is `True` or `False` for each element in `df['A']`. When a boolean Series is used as an index, it returns a subset of rows in the DataFrame. You cannot use arbitrary Python boolean expressions like `df[df['A'] > 5 and df['A'] < 7]`. Instead, use the special `&` operator for boolean Series: `df[(df['A'] > 5) & (df['A'] < 7)]` (*brackets are essential*).
**Creating new computed columns**: Easily create new columns using expressions like:
```python
df['DivA'] = df['A']-df['A'].mean()
```
This example calculates the divergence of `A` from its mean value. Here, we compute a Series and assign it to the left-hand side, creating a new column. However, operations incompatible with Series will result in errors, such as:
```python
# Wrong code -> df['ADescr'] = "Low" if df['A'] < 5 else "Hi"
df['LenB'] = len(df['B']) # <- Wrong result
```
This example, while syntactically correct, produces incorrect results because it assigns the length of Series `B` to all values in the column, rather than the length of individual elements.
For complex expressions, use the `apply` function. The previous example can be rewritten as:
```python
df['LenB'] = df['B'].apply(lambda x : len(x))
# or
df['LenB'] = df['B'].apply(len)
```
After these operations, the resulting DataFrame will look like this:
| | A | B | DivA | LenB |
| --- | --- | ------ | ---- | ---- |
| 0 | 1 | I | -4.0 | 1 |
| 1 | 2 | like | -3.0 | 4 |
| 2 | 3 | to | -2.0 | 2 |
| 3 | 4 | use | -1.0 | 3 |
| 4 | 5 | Python | 0.0 | 6 |
| 5 | 6 | and | 1.0 | 3 |
| 6 | 7 | Pandas | 2.0 | 6 |
| 7 | 8 | very | 3.0 | 4 |
| 8 | 9 | much | 4.0 | 4 |
**Selecting rows by index**: Use the `iloc` construct to select rows by their position. For example, to select the first 5 rows:
```python
df.iloc[:5]
```
**Grouping**: Often used to create results similar to *pivot tables* in Excel. For instance, to compute the mean value of column `A` for each unique value in `LenB`, group the DataFrame by `LenB` and call `mean`:
```python
df.groupby(by='LenB').mean()
```
To compute both the mean and the count of elements in each group, use the `aggregate` function:
```python
df.groupby(by='LenB') \
.aggregate({ 'DivA' : len, 'A' : lambda x: x.mean() }) \
.rename(columns={ 'DivA' : 'Count', 'A' : 'Mean'})
```
This produces the following table:
| LenB | Count | Mean |
| ---- | ----- | -------- |
| 1 | 1 | 1.000000 |
| 2 | 1 | 3.000000 |
| 3 | 2 | 5.000000 |
| 4 | 3 | 6.333333 |
| 6 | 2 | 6.000000 |
### Getting Data
We have seen how simple it is to create Series and DataFrames from Python objects. However, data is often stored in text files or Excel tables. Fortunately, Pandas provides an easy way to load data from disk. For example, reading a CSV file is as straightforward as this:
```python
df = pd.read_csv('file.csv')
```
We will explore more examples of loading data, including retrieving it from external websites, in the "Challenge" section.
### Printing and Plotting
A Data Scientist frequently needs to explore data, so being able to visualize it is crucial. When working with large DataFrames, we often want to ensure everything is functioning correctly by printing the first few rows. This can be done using `df.head()`. If you run this in Jupyter Notebook, it will display the DataFrame in a neat tabular format.
Weve also seen how to use the `plot` function to visualize specific columns. While `plot` is highly versatile and supports various graph types via the `kind=` parameter, you can always use the raw `matplotlib` library for more complex visualizations. We will delve deeper into data visualization in separate course lessons.
This overview covers the key concepts of Pandas, but the library is incredibly rich, and the possibilities are endless! Lets now apply this knowledge to solve specific problems.
## 🚀 Challenge 1: Analyzing COVID Spread
The first problem well tackle is modeling the spread of the COVID-19 epidemic. To do this, well use data on the number of infected individuals in various countries, provided by the [Center for Systems Science and Engineering](https://systems.jhu.edu/) (CSSE) at [Johns Hopkins University](https://jhu.edu/). The dataset is available in [this GitHub Repository](https://github.com/CSSEGISandData/COVID-19).
To demonstrate how to work with data, we encourage you to open [`notebook-covidspread.ipynb`](../../../../2-Working-With-Data/07-python/notebook-covidspread.ipynb) and go through it from start to finish. You can also execute the cells and try out some challenges weve included at the end.
![COVID Spread](../../../../2-Working-With-Data/07-python/images/covidspread.png)
> If youre unfamiliar with running code in Jupyter Notebook, check out [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
## Working with Unstructured Data
While data often comes in tabular form, there are cases where we need to work with less structured data, such as text or images. In these situations, to apply the data processing techniques weve discussed, we need to **extract** structured data. Here are a few examples:
* Extracting keywords from text and analyzing their frequency
* Using neural networks to identify objects in images
* Detecting emotions in people from video camera feeds
## 🚀 Challenge 2: Analyzing COVID Papers
In this challenge, well continue exploring the COVID pandemic by focusing on processing scientific papers on the topic. The [CORD-19 Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) contains over 7,000 papers (at the time of writing) on COVID, along with metadata and abstracts (and full text for about half of them).
A complete example of analyzing this dataset using the [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health/?WT.mc_id=academic-77958-bethanycheum) cognitive service is described [in this blog post](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/). Well discuss a simplified version of this analysis.
> **NOTE**: This repository does not include a copy of the dataset. You may need to download the [`metadata.csv`](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv) file from [this Kaggle dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Registration with Kaggle may be required. Alternatively, you can download the dataset without registration [from here](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html), which includes all full texts in addition to the metadata file.
Open [`notebook-papers.ipynb`](../../../../2-Working-With-Data/07-python/notebook-papers.ipynb) and go through it from start to finish. You can also execute the cells and try out some challenges weve included at the end.
![Covid Medical Treatment](../../../../2-Working-With-Data/07-python/images/covidtreat.png)
## Processing Image Data
Recently, powerful AI models have been developed to analyze images. Many tasks can be accomplished using pre-trained neural networks or cloud services. Examples include:
* **Image Classification**, which categorizes images into predefined classes. You can train your own image classifiers using services like [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum).
* **Object Detection**, which identifies various objects in an image. Services like [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) can detect common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum) models to detect specific objects of interest.
* **Face Detection**, including age, gender, and emotion analysis. This can be achieved using [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum).
These cloud services can be accessed via [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=academic-77958-bethanycheum), making it easy to integrate them into your data exploration workflow.
Here are some examples of working with image data sources:
* In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/), we analyze Instagram photos to understand what makes people like a photo more. We extract information from images using [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) and use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=academic-77958-bethanycheum) to build an interpretable model.
* In the [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies), we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum) to analyze emotions in event photographs to understand what makes people happy.
## Conclusion
Whether youre working with structured or unstructured data, Python allows you to perform all steps related to data processing and analysis. Its one of the most flexible tools for data processing, which is why most data scientists use Python as their primary tool. If youre serious about pursuing data science, learning Python in depth is highly recommended!
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/13)
## Review & Self Study
**Books**
* [Wes McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython](https://www.amazon.com/gp/product/1491957662)
**Online Resources**
* Official [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) tutorial
* [Documentation on Pandas Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)
**Learning Python**
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse)
* [Take your First Steps with Python](https://docs.microsoft.com/learn/paths/python-first-steps/?WT.mc_id=academic-77958-bethanycheum) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=academic-77958-bethanycheum)
## Assignment
[Perform more detailed data study for the challenges above](assignment.md)
## Credits
This lesson was created with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,37 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "dc8f035ce92e4eaa078ab19caa68267a",
"translation_date": "2025-08-31T10:58:32+00:00",
"source_file": "2-Working-With-Data/07-python/assignment.md",
"language_code": "en"
}
-->
# Assignment for Data Processing in Python
In this assignment, we will ask you to expand upon the code we started developing in our challenges. The assignment consists of two parts:
## COVID-19 Spread Modeling
- [ ] Plot *R* graphs for 5-6 different countries on one plot for comparison, or using several plots side-by-side.
- [ ] Analyze how the number of deaths and recoveries correlates with the number of infected cases.
- [ ] Determine how long a typical disease lasts by visually correlating infection rates and death rates, and identifying any anomalies. You may need to examine data from different countries to figure this out.
- [ ] Calculate the fatality rate and observe how it changes over time. *You may want to account for the duration of the disease in days to shift one time series before performing calculations.*
## COVID-19 Papers Analysis
- [ ] Build a co-occurrence matrix for different medications and identify which medications are frequently mentioned together (i.e., in the same abstract). You can adapt the code for building a co-occurrence matrix for medications and diagnoses.
- [ ] Visualize this matrix using a heatmap.
- [ ] As a stretch goal, visualize the co-occurrence of medications using [chord diagram](https://en.wikipedia.org/wiki/Chord_diagram). [This library](https://pypi.org/project/chord/) may assist you in creating a chord diagram.
- [ ] As another stretch goal, extract dosages of different medications (e.g., **400mg** in *take 400mg of chloroquine daily*) using regular expressions, and build a dataframe that displays various dosages for different medications. **Note**: Consider numeric values that are located near the medication name in the text.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
All tasks are completed, graphically illustrated, and explained, including at least one of the two stretch goals | More than 5 tasks are completed, no stretch goals are attempted, or the results are unclear | Fewer than 5 (but more than 3) tasks are completed, and visualizations do not effectively demonstrate the point
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,324 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "3ade580a06b5f04d57cc83a768a8fb77",
"translation_date": "2025-08-31T10:59:11+00:00",
"source_file": "2-Working-With-Data/08-data-preparation/README.md",
"language_code": "en"
}
-->
# Working with Data: Data Preparation
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/08-DataPreparation.png)|
|:---:|
|Data Preparation - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/14)
Raw data, depending on its source, may have inconsistencies that make analysis and modeling difficult. This type of data is often referred to as "dirty" and requires cleaning. This lesson focuses on techniques for cleaning and transforming data to address issues like missing, inaccurate, or incomplete data. The topics covered will use Python and the Pandas library and will be [demonstrated in the notebook](../../../../2-Working-With-Data/08-data-preparation/notebook.ipynb) in this directory.
## The importance of cleaning data
- **Ease of use and reuse**: Properly organized and normalized data is easier to search, use, and share with others.
- **Consistency**: Data science often involves working with multiple datasets, which may need to be combined. Ensuring that each dataset follows common standards makes the merged data more useful.
- **Model accuracy**: Clean data improves the accuracy of models that depend on it.
## Common cleaning goals and strategies
- **Exploring a dataset**: Data exploration, covered in a [later lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/4-Data-Science-Lifecycle/15-analyzing), helps identify data that needs cleaning. Observing values visually can set expectations or highlight problems to address. Exploration can involve querying, visualizations, and sampling.
- **Formatting**: Data from different sources may have inconsistencies in presentation, which can affect searches and visualizations. Common formatting issues include whitespace, dates, and data types. Resolving these issues often depends on the user's needs, as standards for dates and numbers vary by region.
- **Duplications**: Duplicate data can lead to inaccurate results and often needs to be removed. However, in some cases, duplicates may contain additional information and should be preserved.
- **Missing Data**: Missing data can lead to inaccuracies or biased results. Solutions include reloading the data, filling in missing values programmatically, or removing the affected data. The approach depends on the reasons behind the missing data.
## Exploring DataFrame information
> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.
Once data is loaded into pandas, it is typically stored in a DataFrame (refer to the previous [lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/2-Working-With-Data/07-python#dataframe) for an overview). If your DataFrame contains 60,000 rows and 400 columns, how do you start understanding it? Fortunately, [pandas](https://pandas.pydata.org/) offers tools to quickly view overall information about a DataFrame, as well as its first and last few rows.
To explore this functionality, we will use the Python scikit-learn library and the well-known **Iris dataset**.
```python
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
```
| |sepal length (cm)|sepal width (cm)|petal length (cm)|petal width (cm)|
|----------------------------------------|-----------------|----------------|-----------------|----------------|
|0 |5.1 |3.5 |1.4 |0.2 |
|1 |4.9 |3.0 |1.4 |0.2 |
|2 |4.7 |3.2 |1.3 |0.2 |
|3 |4.6 |3.1 |1.5 |0.2 |
|4 |5.0 |3.6 |1.4 |0.2 |
- **DataFrame.info**: The `info()` method provides a summary of the content in a `DataFrame`. Let's examine this dataset:
```python
iris_df.info()
```
```
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB
```
This tells us that the *Iris* dataset has 150 entries across four columns, with no null values. All data is stored as 64-bit floating-point numbers.
- **DataFrame.head()**: To view the first few rows of the `DataFrame`, use the `head()` method:
```python
iris_df.head()
```
```
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
```
- **DataFrame.tail()**: To view the last few rows of the `DataFrame`, use the `tail()` method:
```python
iris_df.tail()
```
```
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
```
> **Takeaway:** By examining metadata and the first/last few rows of a DataFrame, you can quickly understand its size, structure, and content.
## Dealing with Missing Data
> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.
Datasets often contain missing values. How you handle missing data can impact your analysis and real-world outcomes.
Pandas uses two methods to represent missing values: `NaN` (Not a Number) for floating-point data and `None` for other types. While this dual approach may seem confusing, it provides flexibility for most use cases. However, both `NaN` and `None` have limitations you should be aware of.
Learn more about `NaN` and `None` in the [notebook](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb)!
- **Detecting null values**: Use the `isnull()` and `notnull()` methods to detect null data. Both return Boolean masks over your data. We'll use `numpy` for `NaN` values:
```python
import numpy as np
example1 = pd.Series([0, np.nan, '', None])
example1.isnull()
```
```
0 False
1 True
2 False
3 True
dtype: bool
```
Notice the output. While `0` is an arithmetic null, pandas treats it as a valid integer. Similarly, `''` (an empty string) is considered a valid string, not null.
You can use Boolean masks directly as a `Series` or `DataFrame` index to isolate missing or present values.
> **Takeaway**: The `isnull()` and `notnull()` methods provide results with indices, making it easier to work with your data.
- **Dropping null values**: Pandas offers a convenient way to remove null values from `Series` and `DataFrame`s. For large datasets, removing missing values is often more practical than other approaches. Let's revisit `example1`:
```python
example1 = example1.dropna()
example1
```
```
0 0
2
dtype: object
```
This output matches `example3[example3.notnull()]`, but `dropna` removes missing values directly from the `Series`.
For `DataFrame`s, you can drop entire rows or columns. By default, `dropna()` removes rows with any null values:
```python
example2 = pd.DataFrame([[1, np.nan, 7],
[2, 5, 8],
[np.nan, 6, 9]])
example2
```
| | 0 | 1 | 2 |
|------|---|---|---|
|0 |1.0|NaN|7 |
|1 |2.0|5.0|8 |
|2 |NaN|6.0|9 |
(Pandas converts columns to floats to accommodate `NaN`s.)
To drop columns with null values, use `axis=1`:
```python
example2.dropna()
```
```
0 1 2
1 2.0 5.0 8
```
You can also drop rows or columns with all null values using `how='all'`. For finer control, use the `thresh` parameter to specify the minimum number of non-null values required to keep a row or column:
```python
example2[3] = np.nan
example2
```
| |0 |1 |2 |3 |
|------|---|---|---|---|
|0 |1.0|NaN|7 |NaN|
|1 |2.0|5.0|8 |NaN|
|2 |NaN|6.0|9 |NaN|
```python
example2.dropna(axis='rows', thresh=3)
```
```
0 1 2 3
1 2.0 5.0 8 NaN
```
Here, rows with fewer than three non-null values are dropped.
- **Filling null values**: Instead of dropping null values, you can replace them with valid ones using `fillna`. This method is more efficient than manually replacing values. Let's create another example `Series`:
```python
example3 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
example3
```
```
a 1.0
b NaN
c 2.0
d NaN
e 3.0
dtype: float64
```
You can replace all null entries with a single value, like `0`:
```python
example3.fillna(0)
```
```
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
```
You can **forward-fill** null values using the last valid value:
```python
example3.fillna(method='ffill')
```
```
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64
```
You can also **back-fill** null values using the next valid value:
```python
example3.fillna(method='bfill')
```
```
a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
dtype: float64
```
This works similarly for `DataFrame`s, where you can specify an `axis` for filling null values. Using `example2` again:
```python
example2.fillna(method='ffill', axis=1)
```
```
0 1 2 3
0 1.0 1.0 7.0 7.0
1 2.0 5.0 8.0 8.0
2 NaN 6.0 9.0 9.0
```
If no previous value exists for forward-filling, the null value remains.
> **Takeaway:** There are several ways to handle missing values in your datasets. The specific approach you choose (removing them, replacing them, or even how you replace them) should depend on the characteristics of the data. The more you work with and explore datasets, the better you'll become at managing missing values.
## Removing duplicate data
> **Learning goal:** By the end of this subsection, you should feel confident identifying and removing duplicate values from DataFrames.
In addition to missing data, real-world datasets often contain duplicate entries. Luckily, `pandas` offers a straightforward way to detect and remove duplicates.
- **Identifying duplicates: `duplicated`**: You can easily identify duplicate values using the `duplicated` method in pandas. This method returns a Boolean mask that indicates whether an entry in a `DataFrame` is a duplicate of a previous one. Lets create another example `DataFrame` to see how this works.
```python
example4 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
'numbers': [1, 2, 1, 3, 3]})
example4
```
| |letters|numbers|
|------|-------|-------|
|0 |A |1 |
|1 |B |2 |
|2 |A |1 |
|3 |B |3 |
|4 |B |3 |
```python
example4.duplicated()
```
```
0 False
1 False
2 True
3 False
4 True
dtype: bool
```
- **Dropping duplicates: `drop_duplicates`:** This method simply returns a copy of the data where all `duplicated` values are `False`:
```python
example4.drop_duplicates()
```
```
letters numbers
0 A 1
1 B 2
3 B 3
```
Both `duplicated` and `drop_duplicates` default to considering all columns, but you can specify that they only examine a subset of columns in your `DataFrame`:
```python
example4.drop_duplicates(['letters'])
```
```
letters numbers
0 A 1
1 B 2
```
> **Takeaway:** Removing duplicate data is a crucial step in almost every data science project. Duplicate data can skew your analysis and lead to inaccurate results!
## 🚀 Challenge
All the materials covered are available as a [Jupyter Notebook](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/2-Working-With-Data/08-data-preparation/notebook.ipynb). Additionally, there are exercises at the end of each section—give them a try!
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/15)
## Review & Self Study
There are many ways to explore and approach preparing your data for analysis and modeling. Cleaning your data is a critical step that requires hands-on practice. Try these Kaggle challenges to learn techniques not covered in this lesson:
- [Data Cleaning Challenge: Parsing Dates](https://www.kaggle.com/rtatman/data-cleaning-challenge-parsing-dates/)
- [Data Cleaning Challenge: Scale and Normalize Data](https://www.kaggle.com/rtatman/data-cleaning-challenge-scale-and-normalize-data)
## Assignment
[Evaluating Data from a Form](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,28 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "f9d5a7275e046223fa6474477674b810",
"translation_date": "2025-08-31T10:59:43+00:00",
"source_file": "2-Working-With-Data/08-data-preparation/assignment.md",
"language_code": "en"
}
-->
# Evaluating Data from a Form
A client has been testing a [small form](../../../../2-Working-With-Data/08-data-preparation/index.html) to collect some basic information about their customer base. They have shared their findings with you to validate the data they have gathered. You can open the `index.html` page in your browser to review the form.
You have been provided with a [dataset of csv records](../../../../data/form.csv) containing entries from the form, along with some basic visualizations. The client has noted that some of the visualizations appear incorrect, but they are unsure how to fix them. You can explore this further in the [assignment notebook](../../../../2-Working-With-Data/08-data-preparation/assignment.ipynb).
## Instructions
Use the techniques covered in this lesson to provide recommendations for improving the form so that it collects accurate and consistent information.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,31 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "abc3309ab41bc5a7846f70ee1a055838",
"translation_date": "2025-08-31T10:57:11+00:00",
"source_file": "2-Working-With-Data/README.md",
"language_code": "en"
}
-->
# Working with Data
![data love](../../../2-Working-With-Data/images/data-love.jpg)
> Photo by <a href="https://unsplash.com/@swimstaralex?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Alexander Sinn</a> on <a href="https://unsplash.com/s/photos/data?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
In these lessons, you will explore various methods for managing, manipulating, and utilizing data in applications. You'll dive into relational and non-relational databases and learn how data is stored within them. Additionally, you'll gain foundational knowledge of using Python to handle data and uncover numerous ways to leverage Python for data management and analysis.
### Topics
1. [Relational databases](05-relational-databases/README.md)
2. [Non-relational databases](06-non-relational/README.md)
3. [Working with Python](07-python/README.md)
4. [Preparing data](08-data-preparation/README.md)
### Credits
These lessons were created with ❤️ by [Christopher Harrison](https://twitter.com/geektrainer), [Dmitry Soshnikov](https://twitter.com/shwars), and [Jasmine Greenaway](https://twitter.com/paladique)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,222 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "43c402d9d90ae6da55d004519ada5033",
"translation_date": "2025-08-31T11:05:55+00:00",
"source_file": "3-Data-Visualization/09-visualization-quantities/README.md",
"language_code": "en"
}
-->
# Visualizing Quantities
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/09-Visualizing-Quantities.png)|
|:---:|
| Visualizing Quantities - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you'll learn how to use one of the many Python libraries available to create engaging visualizations focused on the concept of quantity. Using a cleaned dataset about the birds of Minnesota, you'll uncover fascinating insights about local wildlife.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/16)
## Observe wingspan with Matplotlib
[Matplotlib](https://matplotlib.org/stable/index.html) is an excellent library for creating both simple and complex plots and charts of various types. Generally, the process of plotting data with these libraries involves identifying the parts of your dataframe to target, performing any necessary transformations, assigning x and y axis values, choosing the type of plot, and displaying the plot. Matplotlib offers a wide range of visualizations, but for this lesson, we'll focus on those best suited for visualizing quantities: line charts, scatterplots, and bar plots.
> ✅ Choose the chart type that best fits your data structure and the story you want to tell.
> - To analyze trends over time: line
> - To compare values: bar, column, pie, scatterplot
> - To show how parts relate to a whole: pie
> - To show data distribution: scatterplot, bar
> - To show trends: line, column
> - To show relationships between values: line, scatterplot, bubble
If you have a dataset and need to determine how much of a specific item is included, one of your first tasks will be to inspect its values.
✅ There are excellent 'cheat sheets' for Matplotlib available [here](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## Build a line plot about bird wingspan values
Open the `notebook.ipynb` file located at the root of this lesson folder and add a cell.
> Note: The data is stored in the root of this repository in the `/data` folder.
```python
import pandas as pd
import matplotlib.pyplot as plt
birds = pd.read_csv('../../data/birds.csv')
birds.head()
```
This data contains a mix of text and numbers:
| | Name | ScientificName | Category | Order | Family | Genus | ConservationStatus | MinLength | MaxLength | MinBodyMass | MaxBodyMass | MinWingspan | MaxWingspan |
| ---: | :--------------------------- | :--------------------- | :-------------------- | :----------- | :------- | :---------- | :----------------- | --------: | --------: | ----------: | ----------: | ----------: | ----------: |
| 0 | Black-bellied whistling-duck | Dendrocygna autumnalis | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 47 | 56 | 652 | 1020 | 76 | 94 |
| 1 | Fulvous whistling-duck | Dendrocygna bicolor | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 45 | 53 | 712 | 1050 | 85 | 93 |
| 2 | Snow goose | Anser caerulescens | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 79 | 2050 | 4050 | 135 | 165 |
| 3 | Ross's goose | Anser rossii | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 57.3 | 64 | 1066 | 1567 | 113 | 116 |
| 4 | Greater white-fronted goose | Anser albifrons | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 81 | 1930 | 3310 | 130 | 165 |
Let's start by plotting some of the numeric data using a basic line plot. Suppose you wanted to visualize the maximum wingspan of these fascinating birds.
```python
wingspan = birds['MaxWingspan']
wingspan.plot()
```
![Max Wingspan](../../../../3-Data-Visualization/09-visualization-quantities/images/max-wingspan-02.png)
What stands out immediately? There seems to be at least one outlier—what a wingspan! A 2300-centimeter wingspan equals 23 meters—are there Pterodactyls in Minnesota? Let's investigate.
While you could quickly sort the data in Excel to find these outliers (likely typos), continue the visualization process by working directly within the plot.
Add labels to the x-axis to show the types of birds in question:
```
plt.title('Max Wingspan in Centimeters')
plt.ylabel('Wingspan (CM)')
plt.xlabel('Birds')
plt.xticks(rotation=45)
x = birds['Name']
y = birds['MaxWingspan']
plt.plot(x, y)
plt.show()
```
![Wingspan with labels](../../../../3-Data-Visualization/09-visualization-quantities/images/max-wingspan-labels-02.png)
Even with the labels rotated 45 degrees, there are too many to read. Let's try a different approach: label only the outliers and set the labels within the chart. You can use a scatter chart to make room for the labeling:
```python
plt.title('Max Wingspan in Centimeters')
plt.ylabel('Wingspan (CM)')
plt.tick_params(axis='both',which='both',labelbottom=False,bottom=False)
for i in range(len(birds)):
x = birds['Name'][i]
y = birds['MaxWingspan'][i]
plt.plot(x, y, 'bo')
if birds['MaxWingspan'][i] > 500:
plt.text(x, y * (1 - 0.05), birds['Name'][i], fontsize=12)
plt.show()
```
What's happening here? You used `tick_params` to hide the bottom labels and then created a loop over your birds dataset. By plotting the chart with small round blue dots using `bo`, you checked for any bird with a maximum wingspan over 500 and displayed its label next to the dot. You offset the labels slightly on the y-axis (`y * (1 - 0.05)`) and used the bird name as the label.
What did you discover?
![Outliers](../../../../3-Data-Visualization/09-visualization-quantities/images/labeled-wingspan-02.png)
## Filter your data
Both the Bald Eagle and the Prairie Falcon, while likely large birds, appear to be mislabeled with an extra `0` added to their maximum wingspan. It's unlikely you'll encounter a Bald Eagle with a 25-meter wingspan, but if you do, let us know! Let's create a new dataframe without these two outliers:
```python
plt.title('Max Wingspan in Centimeters')
plt.ylabel('Wingspan (CM)')
plt.xlabel('Birds')
plt.tick_params(axis='both',which='both',labelbottom=False,bottom=False)
for i in range(len(birds)):
x = birds['Name'][i]
y = birds['MaxWingspan'][i]
if birds['Name'][i] not in ['Bald eagle', 'Prairie falcon']:
plt.plot(x, y, 'bo')
plt.show()
```
By filtering out outliers, your data becomes more cohesive and easier to understand.
![Scatterplot of wingspans](../../../../3-Data-Visualization/09-visualization-quantities/images/scatterplot-wingspan-02.png)
Now that we have a cleaner dataset, at least in terms of wingspan, let's explore more about these birds.
While line and scatter plots can display information about data values and their distributions, we want to focus on the quantities inherent in this dataset. You could create visualizations to answer questions like:
> How many categories of birds are there, and what are their counts?
> How many birds are extinct, endangered, rare, or common?
> How many birds belong to various genera and orders in Linnaeus's classification?
## Explore bar charts
Bar charts are useful for showing groupings of data. Let's explore the bird categories in this dataset to see which is the most common.
In the notebook file, create a basic bar chart.
✅ Note: You can either filter out the two outlier birds identified earlier, correct the typo in their wingspan, or leave them in for these exercises, which don't depend on wingspan values.
To create a bar chart, select the data you want to focus on. Bar charts can be created from raw data:
```python
birds.plot(x='Category',
kind='bar',
stacked=True,
title='Birds of Minnesota')
```
![Full data as a bar chart](../../../../3-Data-Visualization/09-visualization-quantities/images/full-data-bar-02.png)
This bar chart, however, is unreadable due to too much ungrouped data. You need to select only the data you want to plot, so let's examine the length of birds based on their category.
Filter your data to include only the bird's category.
✅ Notice how you use Pandas to manage the data and let Matplotlib handle the charting.
Since there are many categories, display this chart vertically and adjust its height to accommodate all the data:
```python
category_count = birds.value_counts(birds['Category'].values, sort=True)
plt.rcParams['figure.figsize'] = [6, 12]
category_count.plot.barh()
```
![Category and length](../../../../3-Data-Visualization/09-visualization-quantities/images/category-counts-02.png)
This bar chart provides a clear view of the number of birds in each category. At a glance, you can see that the largest number of birds in this region belong to the Ducks/Geese/Waterfowl category. Given Minnesota's nickname as the 'land of 10,000 lakes,' this isn't surprising!
✅ Try counting other aspects of this dataset. Does anything surprise you?
## Comparing data
You can compare grouped data by creating new axes. Try comparing the MaxLength of birds based on their category:
```python
maxlength = birds['MaxLength']
plt.barh(y=birds['Category'], width=maxlength)
plt.rcParams['figure.figsize'] = [6, 12]
plt.show()
```
![Comparing data](../../../../3-Data-Visualization/09-visualization-quantities/images/category-length-02.png)
Nothing surprising here: hummingbirds have the smallest MaxLength compared to pelicans or geese. It's reassuring when data aligns with logic!
You can create more engaging bar chart visualizations by overlaying data. Let's overlay Minimum and Maximum Length for each bird category:
```python
minLength = birds['MinLength']
maxLength = birds['MaxLength']
category = birds['Category']
plt.barh(category, maxLength)
plt.barh(category, minLength)
plt.show()
```
In this plot, you can see the range of Minimum and Maximum Length for each bird category. You can confidently say that, based on this data, larger birds tend to have a wider length range. Fascinating!
![Superimposed values](../../../../3-Data-Visualization/09-visualization-quantities/images/superimposed-02.png)
## 🚀 Challenge
This bird dataset offers a wealth of information about different bird types within a specific ecosystem. Search online for other bird-related datasets. Practice building charts and graphs to uncover facts you didn't know.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/17)
## Review & Self Study
This lesson introduced you to using Matplotlib for visualizing quantities. Research other ways to work with datasets for visualization. [Plotly](https://github.com/plotly/plotly.py) is one library we won't cover in these lessons, so explore what it can offer.
## Assignment
[Lines, Scatters, and Bars](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "ad163c4fda72c8278280b61cad317ff4",
"translation_date": "2025-08-31T11:06:16+00:00",
"source_file": "3-Data-Visualization/09-visualization-quantities/assignment.md",
"language_code": "en"
}
-->
# Lines, Scatters and Bars
## Instructions
In this lesson, you explored line charts, scatterplots, and bar charts to highlight interesting insights from the dataset. For this assignment, delve deeper into the dataset to uncover a fact about a specific type of bird. For instance, create a notebook that visualizes all the fascinating data you can find about Snow Geese. Use the three types of plots mentioned above to craft a compelling narrative in your notebook.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
The notebook includes clear annotations, a strong narrative, and visually appealing graphs | The notebook lacks one of these elements | The notebook lacks two of these elements
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,216 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "87faccac113d772551486a67a607153e",
"translation_date": "2025-08-31T11:07:26+00:00",
"source_file": "3-Data-Visualization/10-visualization-distributions/README.md",
"language_code": "en"
}
-->
# Visualizing Distributions
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/10-Visualizing-Distributions.png)|
|:---:|
| Visualizing Distributions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In the previous lesson, you explored an interesting dataset about the birds of Minnesota. You identified some erroneous data by visualizing outliers and examined the differences between bird categories based on their maximum length.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/18)
## Explore the birds dataset
Another way to analyze data is by examining its distribution, or how the data is spread along an axis. For instance, you might want to understand the general distribution of maximum wingspan or maximum body mass for the birds of Minnesota in this dataset.
Lets uncover some insights about the data distributions in this dataset. In the _notebook.ipynb_ file located in the root of this lesson folder, import Pandas, Matplotlib, and your data:
```python
import pandas as pd
import matplotlib.pyplot as plt
birds = pd.read_csv('../../data/birds.csv')
birds.head()
```
| | Name | ScientificName | Category | Order | Family | Genus | ConservationStatus | MinLength | MaxLength | MinBodyMass | MaxBodyMass | MinWingspan | MaxWingspan |
| ---: | :--------------------------- | :--------------------- | :-------------------- | :----------- | :------- | :---------- | :----------------- | --------: | --------: | ----------: | ----------: | ----------: | ----------: |
| 0 | Black-bellied whistling-duck | Dendrocygna autumnalis | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 47 | 56 | 652 | 1020 | 76 | 94 |
| 1 | Fulvous whistling-duck | Dendrocygna bicolor | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 45 | 53 | 712 | 1050 | 85 | 93 |
| 2 | Snow goose | Anser caerulescens | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 79 | 2050 | 4050 | 135 | 165 |
| 3 | Ross's goose | Anser rossii | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 57.3 | 64 | 1066 | 1567 | 113 | 116 |
| 4 | Greater white-fronted goose | Anser albifrons | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 81 | 1930 | 3310 | 130 | 165 |
In general, you can quickly visualize how data is distributed by using a scatter plot, as demonstrated in the previous lesson:
```python
birds.plot(kind='scatter',x='MaxLength',y='Order',figsize=(12,8))
plt.title('Max Length per Order')
plt.ylabel('Order')
plt.xlabel('Max Length')
plt.show()
```
![max length per order](../../../../3-Data-Visualization/10-visualization-distributions/images/scatter-wb.png)
This provides an overview of the general distribution of body length per bird Order, but its not the best way to display true distributions. Thats where histograms come in.
## Working with histograms
Matplotlib provides excellent tools for visualizing data distributions using histograms. A histogram is similar to a bar chart, but it shows the distribution of data through the rise and fall of the bars. To create a histogram, you need numeric data. You can plot a histogram by setting the chart type to 'hist'. This chart displays the distribution of MaxBodyMass across the datasets numeric range. By dividing the data into smaller bins, it reveals the distribution of values:
```python
birds['MaxBodyMass'].plot(kind = 'hist', bins = 10, figsize = (12,12))
plt.show()
```
![distribution over the entire dataset](../../../../3-Data-Visualization/10-visualization-distributions/images/dist1-wb.png)
As shown, most of the 400+ birds in this dataset have a Max Body Mass under 2000. You can gain more insight by increasing the `bins` parameter to a higher value, such as 30:
```python
birds['MaxBodyMass'].plot(kind = 'hist', bins = 30, figsize = (12,12))
plt.show()
```
![distribution over the entire dataset with larger bins param](../../../../3-Data-Visualization/10-visualization-distributions/images/dist2-wb.png)
This chart provides a more detailed view of the distribution. To create a chart thats less skewed to the left, you can filter the data to include only birds with a body mass under 60 and set the `bins` parameter to 40:
```python
filteredBirds = birds[(birds['MaxBodyMass'] > 1) & (birds['MaxBodyMass'] < 60)]
filteredBirds['MaxBodyMass'].plot(kind = 'hist',bins = 40,figsize = (12,12))
plt.show()
```
![filtered histogram](../../../../3-Data-Visualization/10-visualization-distributions/images/dist3-wb.png)
✅ Experiment with other filters and data points. To view the full distribution of the data, remove the `['MaxBodyMass']` filter to display labeled distributions.
Histograms also allow for color and labeling enhancements:
Create a 2D histogram to compare the relationship between two distributions. For example, compare `MaxBodyMass` and `MaxLength`. Matplotlib provides a built-in way to show convergence using brighter colors:
```python
x = filteredBirds['MaxBodyMass']
y = filteredBirds['MaxLength']
fig, ax = plt.subplots(tight_layout=True)
hist = ax.hist2d(x, y)
```
There seems to be a clear correlation between these two variables along an expected axis, with one particularly strong point of convergence:
![2D plot](../../../../3-Data-Visualization/10-visualization-distributions/images/2D-wb.png)
Histograms are ideal for numeric data. But what if you want to analyze distributions based on text data?
## Explore the dataset for distributions using text data
This dataset also contains valuable information about bird categories, genus, species, family, and conservation status. Lets explore the conservation status. What is the distribution of birds based on their conservation status?
> ✅ In the dataset, several acronyms are used to describe conservation status. These acronyms are derived from the [IUCN Red List Categories](https://www.iucnredlist.org/), which classify species' statuses:
>
> - CR: Critically Endangered
> - EN: Endangered
> - EX: Extinct
> - LC: Least Concern
> - NT: Near Threatened
> - VU: Vulnerable
Since these are text-based values, youll need to transform them to create a histogram. Using the filteredBirds dataframe, display its conservation status alongside its Minimum Wingspan. What do you observe?
```python
x1 = filteredBirds.loc[filteredBirds.ConservationStatus=='EX', 'MinWingspan']
x2 = filteredBirds.loc[filteredBirds.ConservationStatus=='CR', 'MinWingspan']
x3 = filteredBirds.loc[filteredBirds.ConservationStatus=='EN', 'MinWingspan']
x4 = filteredBirds.loc[filteredBirds.ConservationStatus=='NT', 'MinWingspan']
x5 = filteredBirds.loc[filteredBirds.ConservationStatus=='VU', 'MinWingspan']
x6 = filteredBirds.loc[filteredBirds.ConservationStatus=='LC', 'MinWingspan']
kwargs = dict(alpha=0.5, bins=20)
plt.hist(x1, **kwargs, color='red', label='Extinct')
plt.hist(x2, **kwargs, color='orange', label='Critically Endangered')
plt.hist(x3, **kwargs, color='yellow', label='Endangered')
plt.hist(x4, **kwargs, color='green', label='Near Threatened')
plt.hist(x5, **kwargs, color='blue', label='Vulnerable')
plt.hist(x6, **kwargs, color='gray', label='Least Concern')
plt.gca().set(title='Conservation Status', ylabel='Min Wingspan')
plt.legend();
```
![wingspan and conservation collation](../../../../3-Data-Visualization/10-visualization-distributions/images/histogram-conservation-wb.png)
There doesnt appear to be a strong correlation between minimum wingspan and conservation status. Test other elements of the dataset using this method. Try different filters as well. Do you notice any correlations?
## Density plots
You may have noticed that the histograms weve examined so far are 'stepped' and dont flow smoothly. To create a smoother density chart, you can use a density plot.
To work with density plots, familiarize yourself with a new plotting library, [Seaborn](https://seaborn.pydata.org/generated/seaborn.kdeplot.html).
Load Seaborn and try a basic density plot:
```python
import seaborn as sns
import matplotlib.pyplot as plt
sns.kdeplot(filteredBirds['MinWingspan'])
plt.show()
```
![Density plot](../../../../3-Data-Visualization/10-visualization-distributions/images/density1.png)
This plot mirrors the previous one for Minimum Wingspan data but appears smoother. According to Seaborns documentation, "Relative to a histogram, KDE can produce a plot that is less cluttered and more interpretable, especially when drawing multiple distributions. But it has the potential to introduce distortions if the underlying distribution is bounded or not smooth. Like a histogram, the quality of the representation also depends on the selection of good smoothing parameters." [source](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) In other words, outliers can still negatively impact your charts.
If you revisit the jagged MaxBodyMass line from the second chart, you can smooth it out using this method:
```python
sns.kdeplot(filteredBirds['MaxBodyMass'])
plt.show()
```
![smooth bodymass line](../../../../3-Data-Visualization/10-visualization-distributions/images/density2.png)
To create a line thats smooth but not overly so, adjust the `bw_adjust` parameter:
```python
sns.kdeplot(filteredBirds['MaxBodyMass'], bw_adjust=.2)
plt.show()
```
![less smooth bodymass line](../../../../3-Data-Visualization/10-visualization-distributions/images/density3.png)
✅ Explore the available parameters for this type of plot and experiment!
This type of chart provides visually appealing and explanatory visualizations. For instance, with just a few lines of code, you can display the max body mass density per bird Order:
```python
sns.kdeplot(
data=filteredBirds, x="MaxBodyMass", hue="Order",
fill=True, common_norm=False, palette="crest",
alpha=.5, linewidth=0,
)
```
![bodymass per order](../../../../3-Data-Visualization/10-visualization-distributions/images/density4.png)
You can also map the density of multiple variables in one chart. Compare the MaxLength and MinLength of a bird to their conservation status:
```python
sns.kdeplot(data=filteredBirds, x="MinLength", y="MaxLength", hue="ConservationStatus")
```
![multiple densities, superimposed](../../../../3-Data-Visualization/10-visualization-distributions/images/multi.png)
It might be worth investigating whether the cluster of 'Vulnerable' birds based on their lengths has any significance.
## 🚀 Challenge
Histograms are a more advanced type of chart compared to basic scatterplots, bar charts, or line charts. Search online for examples of histograms. How are they used, what do they reveal, and in which fields or areas of study are they commonly applied?
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/19)
## Review & Self Study
In this lesson, you used Matplotlib and began working with Seaborn to create more advanced charts. Research `kdeplot` in Seaborn, which generates a "continuous probability density curve in one or more dimensions." Read through [the documentation](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) to understand how it works.
## Assignment
[Apply your skills](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "40eeb9b9f94009c537c7811f9f27f037",
"translation_date": "2025-08-31T11:07:55+00:00",
"source_file": "3-Data-Visualization/10-visualization-distributions/assignment.md",
"language_code": "en"
}
-->
# Apply your skills
## Instructions
Up to this point, you have worked with the Minnesota birds dataset to uncover information about bird numbers and population density. Now, practice applying these techniques by exploring a different dataset, perhaps one from [Kaggle](https://www.kaggle.com/). Create a notebook that tells a story about this dataset, and be sure to include histograms in your analysis.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
A notebook is provided with annotations about the dataset, including its source, and uses at least 5 histograms to uncover insights about the data. | A notebook is provided with incomplete annotations or contains errors. | A notebook is provided without annotations and contains errors.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,200 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "af6a12015c6e250e500b570a9fa42593",
"translation_date": "2025-08-31T11:05:02+00:00",
"source_file": "3-Data-Visualization/11-visualization-proportions/README.md",
"language_code": "en"
}
-->
# Visualizing Proportions
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/11-Visualizing-Proportions.png)|
|:---:|
|Visualizing Proportions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you'll work with a nature-focused dataset to visualize proportions, such as the distribution of different types of fungi in a dataset about mushrooms. We'll dive into these fascinating fungi using a dataset from Audubon that provides details about 23 species of gilled mushrooms in the Agaricus and Lepiota families. You'll experiment with fun visualizations like:
- Pie charts 🥧
- Donut charts 🍩
- Waffle charts 🧇
> 💡 Microsoft Research has an interesting project called [Charticulator](https://charticulator.com), which offers a free drag-and-drop interface for creating data visualizations. One of their tutorials uses this mushroom dataset! You can explore the data and learn the library simultaneously: [Charticulator tutorial](https://charticulator.com/tutorials/tutorial4.html).
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/20)
## Get to know your mushrooms 🍄
Mushrooms are fascinating organisms. Let's import a dataset to study them:
```python
import pandas as pd
import matplotlib.pyplot as plt
mushrooms = pd.read_csv('../../data/mushrooms.csv')
mushrooms.head()
```
A table is displayed with some great data for analysis:
| class | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | stalk-root | stalk-surface-above-ring | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat |
| --------- | --------- | ----------- | --------- | ------- | ------- | --------------- | ------------ | --------- | ---------- | ----------- | ---------- | ------------------------ | ------------------------ | ---------------------- | ---------------------- | --------- | ---------- | ----------- | --------- | ----------------- | ---------- | ------- |
| Poisonous | Convex | Smooth | Brown | Bruises | Pungent | Free | Close | Narrow | Black | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban |
| Edible | Convex | Smooth | Yellow | Bruises | Almond | Free | Close | Broad | Black | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Grasses |
| Edible | Bell | Smooth | White | Bruises | Anise | Free | Close | Broad | Brown | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Meadows |
| Poisonous | Convex | Scaly | White | Bruises | Pungent | Free | Close | Narrow | Brown | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban |
You'll notice that all the data is textual. To use it in a chart, you'll need to convert it. Most of the data is represented as an object:
```python
print(mushrooms.select_dtypes(["object"]).columns)
```
The output is:
```output
Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
'stalk-surface-below-ring', 'stalk-color-above-ring',
'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
'ring-type', 'spore-print-color', 'population', 'habitat'],
dtype='object')
```
Convert the 'class' column into a category:
```python
cols = mushrooms.select_dtypes(["object"]).columns
mushrooms[cols] = mushrooms[cols].astype('category')
```
```python
edibleclass=mushrooms.groupby(['class']).count()
edibleclass
```
Now, if you print the mushrooms data, you'll see it grouped into categories based on the poisonous/edible class:
| | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat |
| --------- | --------- | ----------- | --------- | ------- | ---- | --------------- | ------------ | --------- | ---------- | ----------- | --- | ------------------------ | ---------------------- | ---------------------- | --------- | ---------- | ----------- | --------- | ----------------- | ---------- | ------- |
| class | | | | | | | | | | | | | | | | | | | | | |
| Edible | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | ... | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 |
| Poisonous | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | ... | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 |
Using the order in this table to create your class category labels, you can build a pie chart:
## Pie!
```python
labels=['Edible','Poisonous']
plt.pie(edibleclass['population'],labels=labels,autopct='%.1f %%')
plt.title('Edible?')
plt.show()
```
And voilà, a pie chart showing the proportions of the two mushroom classes. It's crucial to get the label order correct, so double-check the array when building the labels!
![pie chart](../../../../3-Data-Visualization/11-visualization-proportions/images/pie1-wb.png)
## Donuts!
A donut chart is a visually appealing variation of a pie chart, with a hole in the center. Let's use this method to explore the habitats where mushrooms grow:
```python
habitat=mushrooms.groupby(['habitat']).count()
habitat
```
Group the data by habitat. There are seven listed habitats, so use them as labels for your donut chart:
```python
labels=['Grasses','Leaves','Meadows','Paths','Urban','Waste','Wood']
plt.pie(habitat['class'], labels=labels,
autopct='%1.1f%%', pctdistance=0.85)
center_circle = plt.Circle((0, 0), 0.40, fc='white')
fig = plt.gcf()
fig.gca().add_artist(center_circle)
plt.title('Mushroom Habitats')
plt.show()
```
![donut chart](../../../../3-Data-Visualization/11-visualization-proportions/images/donut-wb.png)
This code draws the chart and a center circle, then adds the circle to the chart. You can adjust the width of the center circle by changing `0.40` to another value.
Donut charts can be customized in various ways, especially the labels for better readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut).
Now that you know how to group data and display it as a pie or donut chart, let's explore another type of chart: the waffle chart.
## Waffles!
A waffle chart visualizes quantities as a 2D array of squares. Let's use it to examine the proportions of mushroom cap colors in the dataset. First, install the helper library [PyWaffle](https://pypi.org/project/pywaffle/) and use Matplotlib:
```python
pip install pywaffle
```
Select a segment of your data to group:
```python
capcolor=mushrooms.groupby(['cap-color']).count()
capcolor
```
Create a waffle chart by defining labels and grouping your data:
```python
import pandas as pd
import matplotlib.pyplot as plt
from pywaffle import Waffle
data ={'color': ['brown', 'buff', 'cinnamon', 'green', 'pink', 'purple', 'red', 'white', 'yellow'],
'amount': capcolor['class']
}
df = pd.DataFrame(data)
fig = plt.figure(
FigureClass = Waffle,
rows = 100,
values = df.amount,
labels = list(df.color),
figsize = (30,30),
colors=["brown", "tan", "maroon", "green", "pink", "purple", "red", "whitesmoke", "yellow"],
)
```
The waffle chart clearly shows the proportions of cap colors in the mushroom dataset. Interestingly, there are many green-capped mushrooms!
![waffle chart](../../../../3-Data-Visualization/11-visualization-proportions/images/waffle.png)
✅ PyWaffle supports icons within the charts, using any icon available in [Font Awesome](https://fontawesome.com/). Experiment with icons to create even more engaging waffle charts.
In this lesson, you learned three ways to visualize proportions. First, group your data into categories, then choose the best visualization method—pie, donut, or waffle. Each offers a quick and intuitive snapshot of the dataset.
## 🚀 Challenge
Try recreating these charts in [Charticulator](https://charticulator.com).
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/21)
## Review & Self Study
Choosing between pie, donut, or waffle charts isn't always straightforward. Here are some articles to help you decide:
https://www.beautiful.ai/blog/battle-of-the-charts-pie-chart-vs-donut-chart
https://medium.com/@hypsypops/pie-chart-vs-donut-chart-showdown-in-the-ring-5d24fd86a9ce
https://www.mit.edu/~mbarker/formula1/f1help/11-ch-c6.htm
https://medium.datadriveninvestor.com/data-visualization-done-the-right-way-with-tableau-waffle-chart-fdf2a19be402
Do some research to learn more about this decision-making process.
## Assignment
[Try it in Excel](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1e00fe6a244c2f8f9a794c862661dd4f",
"translation_date": "2025-08-31T11:05:29+00:00",
"source_file": "3-Data-Visualization/11-visualization-proportions/assignment.md",
"language_code": "en"
}
-->
# Try it in Excel
## Instructions
Did you know you can create donut, pie, and waffle charts in Excel? Using a dataset of your choice, create these three charts directly in an Excel spreadsheet.
## Rubric
| Outstanding | Satisfactory | Needs Improvement |
| ------------------------------------------------------- | ------------------------------------------------- | ------------------------------------------------------ |
| An Excel spreadsheet is provided with all three charts | An Excel spreadsheet is provided with two charts | An Excel spreadsheet is provided with only one chart |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,186 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "cad419b574d5c35eaa417e9abfdcb0c8",
"translation_date": "2025-08-31T11:06:58+00:00",
"source_file": "3-Data-Visualization/12-visualization-relationships/README.md",
"language_code": "en"
}
-->
# Visualizing Relationships: All About Honey 🍯
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/12-Visualizing-Relationships.png)|
|:---:|
|Visualizing Relationships - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Continuing with the nature focus of our research, let's explore fascinating ways to visualize the relationships between different types of honey, based on a dataset from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
This dataset, containing around 600 entries, showcases honey production across various U.S. states. For instance, it includes data on the number of colonies, yield per colony, total production, stocks, price per pound, and the value of honey produced in each state from 1998 to 2012, with one row per year for each state.
It would be intriguing to visualize the relationship between a state's annual production and, for example, the price of honey in that state. Alternatively, you could examine the relationship between honey yield per colony across states. This time frame also includes the emergence of the devastating 'CCD' or 'Colony Collapse Disorder' first identified in 2006 (http://npic.orst.edu/envir/ccd.html), making this dataset particularly significant to study. 🐝
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/22)
In this lesson, you can use Seaborn, a library you've worked with before, to effectively visualize relationships between variables. One particularly useful function in Seaborn is `relplot`, which enables scatter plots and line plots to quickly illustrate '[statistical relationships](https://seaborn.pydata.org/tutorial/relational.html?highlight=relationships)', helping data scientists better understand how variables interact.
## Scatterplots
Use a scatterplot to visualize how the price of honey has changed year over year in each state. Seaborn's `relplot` conveniently organizes state data and displays data points for both categorical and numeric data.
Let's begin by importing the data and Seaborn:
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
honey = pd.read_csv('../../data/honey.csv')
honey.head()
```
You'll notice that the honey dataset includes several interesting columns, such as year and price per pound. Let's explore this data, grouped by U.S. state:
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
| ----- | ------ | ----------- | --------- | -------- | ---------- | --------- | ---- |
| AL | 16000 | 71 | 1136000 | 159000 | 0.72 | 818000 | 1998 |
| AZ | 55000 | 60 | 3300000 | 1485000 | 0.64 | 2112000 | 1998 |
| AR | 53000 | 65 | 3445000 | 1688000 | 0.59 | 2033000 | 1998 |
| CA | 450000 | 83 | 37350000 | 12326000 | 0.62 | 23157000 | 1998 |
| CO | 27000 | 72 | 1944000 | 1594000 | 0.7 | 1361000 | 1998 |
Create a basic scatterplot to show the relationship between the price per pound of honey and its U.S. state of origin. Adjust the `y` axis to ensure all states are visible:
```python
sns.relplot(x="priceperlb", y="state", data=honey, height=15, aspect=.5);
```
![scatterplot 1](../../../../3-Data-Visualization/12-visualization-relationships/images/scatter1.png)
Next, use a honey-inspired color scheme to illustrate how the price changes over the years. Add a 'hue' parameter to highlight year-over-year variations:
> ✅ Learn more about the [color palettes you can use in Seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html) - try a beautiful rainbow color scheme!
```python
sns.relplot(x="priceperlb", y="state", hue="year", palette="YlOrBr", data=honey, height=15, aspect=.5);
```
![scatterplot 2](../../../../3-Data-Visualization/12-visualization-relationships/images/scatter2.png)
With this color scheme, you can clearly see a strong upward trend in honey prices over the years. If you examine a specific state, such as Arizona, you can observe a consistent pattern of price increases year over year, with only a few exceptions:
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
| ----- | ------ | ----------- | --------- | ------- | ---------- | --------- | ---- |
| AZ | 55000 | 60 | 3300000 | 1485000 | 0.64 | 2112000 | 1998 |
| AZ | 52000 | 62 | 3224000 | 1548000 | 0.62 | 1999000 | 1999 |
| AZ | 40000 | 59 | 2360000 | 1322000 | 0.73 | 1723000 | 2000 |
| AZ | 43000 | 59 | 2537000 | 1142000 | 0.72 | 1827000 | 2001 |
| AZ | 38000 | 63 | 2394000 | 1197000 | 1.08 | 2586000 | 2002 |
| AZ | 35000 | 72 | 2520000 | 983000 | 1.34 | 3377000 | 2003 |
| AZ | 32000 | 55 | 1760000 | 774000 | 1.11 | 1954000 | 2004 |
| AZ | 36000 | 50 | 1800000 | 720000 | 1.04 | 1872000 | 2005 |
| AZ | 30000 | 65 | 1950000 | 839000 | 0.91 | 1775000 | 2006 |
| AZ | 30000 | 64 | 1920000 | 902000 | 1.26 | 2419000 | 2007 |
| AZ | 25000 | 64 | 1600000 | 336000 | 1.26 | 2016000 | 2008 |
| AZ | 20000 | 52 | 1040000 | 562000 | 1.45 | 1508000 | 2009 |
| AZ | 24000 | 77 | 1848000 | 665000 | 1.52 | 2809000 | 2010 |
| AZ | 23000 | 53 | 1219000 | 427000 | 1.55 | 1889000 | 2011 |
| AZ | 22000 | 46 | 1012000 | 253000 | 1.79 | 1811000 | 2012 |
Another way to visualize this trend is by using size instead of color. For colorblind users, this might be a better option. Modify your visualization to represent price increases with larger dot sizes:
```python
sns.relplot(x="priceperlb", y="state", size="year", data=honey, height=15, aspect=.5);
```
You can observe the dots growing larger over time.
![scatterplot 3](../../../../3-Data-Visualization/12-visualization-relationships/images/scatter3.png)
Is this simply a case of supply and demand? Could factors like climate change and colony collapse be reducing honey availability year over year, thereby driving up prices?
To explore correlations between variables in this dataset, let's examine some line charts.
## Line charts
Question: Is there a clear upward trend in honey prices per pound year over year? The easiest way to determine this is by creating a single line chart:
```python
sns.relplot(x="year", y="priceperlb", kind="line", data=honey);
```
Answer: Yes, although there are some exceptions around 2003:
![line chart 1](../../../../3-Data-Visualization/12-visualization-relationships/images/line1.png)
✅ Seaborn aggregates data into one line by plotting the mean and a 95% confidence interval around the mean. [Source](https://seaborn.pydata.org/tutorial/relational.html). You can disable this behavior by adding `ci=None`.
Question: In 2003, was there also a spike in honey supply? What happens if you examine total production year over year?
```python
sns.relplot(x="year", y="totalprod", kind="line", data=honey);
```
![line chart 2](../../../../3-Data-Visualization/12-visualization-relationships/images/line2.png)
Answer: Not really. Total production appears to have increased in 2003, even though overall honey production has been declining during these years.
Question: In that case, what might have caused the spike in honey prices around 2003?
To investigate further, you can use a facet grid.
## Facet grids
Facet grids allow you to focus on one aspect of your dataset (e.g., 'year') and create a plot for each facet using your chosen x and y coordinates. This makes comparisons easier. Does 2003 stand out in this type of visualization?
Create a facet grid using `relplot`, as recommended by [Seaborn's documentation](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html?highlight=facetgrid#seaborn.FacetGrid).
```python
sns.relplot(
data=honey,
x="yieldpercol", y="numcol",
col="year",
col_wrap=3,
kind="line"
```
In this visualization, you can compare yield per colony and number of colonies year over year, side by side, with a column wrap set to 3:
![facet grid](../../../../3-Data-Visualization/12-visualization-relationships/images/facet.png)
For this dataset, nothing particularly stands out regarding the number of colonies and their yield year over year or state by state. Is there another way to explore correlations between these variables?
## Dual-line Plots
Try a multiline plot by overlaying two line plots, using Seaborn's 'despine' to remove the top and right spines, and `ax.twinx` [from Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.twinx.html). Twinx allows a chart to share the x-axis while displaying two y-axes. Superimpose yield per colony and number of colonies:
```python
fig, ax = plt.subplots(figsize=(12,6))
lineplot = sns.lineplot(x=honey['year'], y=honey['numcol'], data=honey,
label = 'Number of bee colonies', legend=False)
sns.despine()
plt.ylabel('# colonies')
plt.title('Honey Production Year over Year');
ax2 = ax.twinx()
lineplot2 = sns.lineplot(x=honey['year'], y=honey['yieldpercol'], ax=ax2, color="r",
label ='Yield per colony', legend=False)
sns.despine(right=False)
plt.ylabel('colony yield')
ax.figure.legend();
```
![superimposed plots](../../../../3-Data-Visualization/12-visualization-relationships/images/dual-line.png)
While nothing particularly stands out around 2003, this visualization ends the lesson on a slightly positive note: although the number of colonies is declining overall, it seems to be stabilizing, even if their yield per colony is decreasing.
Go, bees, go!
🐝❤️
## 🚀 Challenge
In this lesson, you learned more about scatterplots and line grids, including facet grids. Challenge yourself to create a facet grid using a different dataset, perhaps one you've used in previous lessons. Note how long it takes to generate and consider how many grids are practical to draw using these techniques.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/23)
## Review & Self Study
Line plots can range from simple to complex. Spend some time reading the [Seaborn documentation](https://seaborn.pydata.org/generated/seaborn.lineplot.html) to learn about the various ways to build them. Try enhancing the line charts you created in this lesson using methods described in the documentation.
## Assignment
[Dive into the beehive](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "680419753c086eef51be86607c623945",
"translation_date": "2025-08-31T11:07:22+00:00",
"source_file": "3-Data-Visualization/12-visualization-relationships/assignment.md",
"language_code": "en"
}
-->
# Explore the Beehive
## Instructions
In this lesson, you began examining a dataset about bees and their honey production over a period marked by overall declines in bee colony populations. Dive deeper into this dataset and create a notebook that narrates the story of the bee population's health, broken down by state and year. Do you uncover anything intriguing in this dataset?
## Rubric
| Outstanding | Satisfactory | Needs Improvement |
| ------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------- | ---------------------------------------- |
| A notebook is provided with a narrative supported by at least three distinct charts illustrating aspects of the dataset, comparing states and years | The notebook is missing one of these elements | The notebook is missing two of these elements |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,182 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "4ec4747a9f4f7d194248ea29903ae165",
"translation_date": "2025-08-31T11:06:21+00:00",
"source_file": "3-Data-Visualization/13-meaningful-visualizations/README.md",
"language_code": "en"
}
-->
# Creating Meaningful Visualizations
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/13-MeaningfulViz.png)|
|:---:|
| Meaningful Visualizations - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
> "If you torture the data long enough, it will confess to anything" -- [Ronald Coase](https://en.wikiquote.org/wiki/Ronald_Coase)
One of the essential skills for a data scientist is the ability to create meaningful data visualizations that help answer specific questions. Before visualizing your data, you need to ensure it has been cleaned and prepared, as covered in previous lessons. Once that's done, you can start deciding how best to present the data.
In this lesson, you will explore:
1. How to select the appropriate chart type
2. How to avoid misleading visualizations
3. How to use color effectively
4. How to style charts for better readability
5. How to create animated or 3D visualizations
6. How to design creative visualizations
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/24)
## Selecting the appropriate chart type
In earlier lessons, you experimented with creating various types of data visualizations using Matplotlib and Seaborn. Generally, you can choose the [appropriate chart type](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) based on the question you're trying to answer using the following table:
| Task | Recommended Chart Type |
| -------------------------- | ------------------------------- |
| Show data trends over time | Line |
| Compare categories | Bar, Pie |
| Compare totals | Pie, Stacked Bar |
| Show relationships | Scatter, Line, Facet, Dual Line |
| Show distributions | Scatter, Histogram, Box |
| Show proportions | Pie, Donut, Waffle |
> ✅ Depending on the structure of your data, you may need to convert it from text to numeric format to make certain charts work.
## Avoiding misleading visualizations
Even when a data scientist carefully selects the right chart for the data, there are still ways to present data in a misleading manner, often to support a specific narrative at the expense of accuracy. There are numerous examples of deceptive charts and infographics!
[![How Charts Lie by Alberto Cairo](../../../../3-Data-Visualization/13-meaningful-visualizations/images/tornado.png)](https://www.youtube.com/watch?v=oX74Nge8Wkw "How charts lie")
> 🎥 Click the image above to watch a conference talk about misleading charts.
This chart flips the X-axis to present the opposite of the truth based on dates:
![bad chart 1](../../../../3-Data-Visualization/13-meaningful-visualizations/images/bad-chart-1.png)
[This chart](https://media.firstcoastnews.com/assets/WTLV/images/170ae16f-4643-438f-b689-50d66ca6a8d8/170ae16f-4643-438f-b689-50d66ca6a8d8_1140x641.jpg) is even more misleading. At first glance, it appears that COVID cases have declined over time in various counties. However, upon closer inspection, the dates have been rearranged to create a deceptive downward trend.
![bad chart 2](../../../../3-Data-Visualization/13-meaningful-visualizations/images/bad-chart-2.jpg)
This infamous example uses both color and a flipped Y-axis to mislead viewers. Instead of showing that gun deaths increased after the passage of gun-friendly legislation, the chart tricks the eye into believing the opposite:
![bad chart 3](../../../../3-Data-Visualization/13-meaningful-visualizations/images/bad-chart-3.jpg)
This peculiar chart manipulates proportions to a comical degree:
![bad chart 4](../../../../3-Data-Visualization/13-meaningful-visualizations/images/bad-chart-4.jpg)
Another deceptive tactic is comparing things that are not truly comparable. A [fascinating website](https://tylervigen.com/spurious-correlations) showcases 'spurious correlations,' such as the divorce rate in Maine being linked to margarine consumption. A Reddit group also collects [examples of poor data usage](https://www.reddit.com/r/dataisugly/top/?t=all).
Understanding how easily the eye can be tricked by misleading charts is crucial. Even with good intentions, a poorly chosen chart type—like a pie chart with too many categories—can lead to confusion.
## Using color effectively
The 'Florida gun violence' chart above demonstrates how color can add another layer of meaning to visualizations. Libraries like Matplotlib and Seaborn come with pre-designed color palettes, but if you're creating a chart manually, it's worth studying [color theory](https://colormatters.com/color-and-design/basic-color-theory).
> ✅ Keep accessibility in mind when designing charts. Some users may be colorblind—does your chart work well for those with visual impairments?
Be cautious when selecting colors for your chart, as they can convey unintended meanings. For example, the 'pink ladies' in the 'height' chart above add a gendered implication that makes the chart even more bizarre.
While [color meanings](https://colormatters.com/color-symbolism/the-meanings-of-colors) can vary across cultures and change depending on the shade, general associations include:
| Color | Meaning |
| ------ | ------------------- |
| red | power |
| blue | trust, loyalty |
| yellow | happiness, caution |
| green | ecology, luck, envy |
| purple | happiness |
| orange | vibrance |
If you're tasked with creating a chart with custom colors, ensure that your choices align with the intended message and that the chart remains accessible.
## Styling charts for better readability
Charts lose their value if they are difficult to read. Take time to adjust the width and height of your chart to ensure it scales well with your data. For example, if you need to display all 50 states, consider showing them vertically on the Y-axis to avoid horizontal scrolling.
Label your axes, include a legend if necessary, and provide tooltips for better data comprehension.
If your data includes verbose text on the X-axis, you can angle the text for improved readability. [Matplotlib](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html) also supports 3D plotting if your data warrants it. Advanced visualizations can be created using `mpl_toolkits.mplot3d`.
![3d plots](../../../../3-Data-Visualization/13-meaningful-visualizations/images/3d.png)
## Animation and 3D visualizations
Some of the most engaging visualizations today are animated. Shirley Wu has created stunning examples using D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/),' where each flower represents a movie. Another example is 'Bussed Out,' an interactive experience for the Guardian that combines visualizations with Greensock and D3, along with a scrollytelling article format, to illustrate how NYC addresses homelessness by bussing people out of the city.
![busing](../../../../3-Data-Visualization/13-meaningful-visualizations/images/busing.png)
> "Bussed Out: How America Moves its Homeless" from [the Guardian](https://www.theguardian.com/us-news/ng-interactive/2017/dec/20/bussed-out-america-moves-homeless-people-country-study). Visualizations by Nadieh Bremer & Shirley Wu
While this lesson doesn't delve deeply into these powerful visualization libraries, you can experiment with D3 in a Vue.js app to create an animated visualization of the book "Dangerous Liaisons" as a social network.
> "Les Liaisons Dangereuses" is an epistolary novel, presented as a series of letters. Written in 1782 by Choderlos de Laclos, it tells the story of the morally corrupt social maneuvers of two French aristocrats, the Vicomte de Valmont and the Marquise de Merteuil. Both meet their downfall, but not before causing significant social damage. The novel unfolds through letters written to various individuals in their circles, plotting revenge or simply creating chaos. Create a visualization of these letters to identify the key players in the narrative.
You will complete a web app that displays an animated view of this social network. It uses a library designed to create a [network visualization](https://github.com/emiliorizzo/vue-d3-network) with Vue.js and D3. Once the app is running, you can drag nodes around the screen to rearrange the data.
![liaisons](../../../../3-Data-Visualization/13-meaningful-visualizations/images/liaisons.png)
## Project: Build a network chart using D3.js
> This lesson folder includes a `solution` folder with the completed project for reference.
1. Follow the instructions in the README.md file located in the starter folder's root. Ensure you have NPM and Node.js installed on your machine before setting up the project's dependencies.
2. Open the `starter/src` folder. Inside, you'll find an `assets` folder containing a .json file with all the letters from the novel, annotated with 'to' and 'from' fields.
3. Complete the code in `components/Nodes.vue` to enable the visualization. Locate the method called `createLinks()` and add the following nested loop.
Loop through the .json object to extract the 'to' and 'from' data for the letters and build the `links` object for the visualization library:
```javascript
//loop through letters
let f = 0;
let t = 0;
for (var i = 0; i < letters.length; i++) {
for (var j = 0; j < characters.length; j++) {
if (characters[j] == letters[i].from) {
f = j;
}
if (characters[j] == letters[i].to) {
t = j;
}
}
this.links.push({ sid: f, tid: t });
}
```
Run your app from the terminal (npm run serve) and enjoy the visualization!
## 🚀 Challenge
Explore the internet to find examples of misleading visualizations. How does the author mislead the audience, and is it intentional? Try correcting the visualizations to show how they should appear.
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/25)
## Review & Self Study
Here are some articles about misleading data visualizations:
https://gizmodo.com/how-to-lie-with-data-visualization-1563576606
http://ixd.prattsi.org/2017/12/visual-lies-usability-in-deceptive-data-visualizations/
Explore these interesting visualizations of historical assets and artifacts:
https://handbook.pubpub.org/
Read this article on how animation can enhance visualizations:
https://medium.com/@EvanSinar/use-animation-to-supercharge-data-visualization-cd905a882ad4
## Assignment
[Create your own custom visualization](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "e56df4c0f49357e30ac8fc77aa439dd4",
"translation_date": "2025-08-31T11:06:43+00:00",
"source_file": "3-Data-Visualization/13-meaningful-visualizations/assignment.md",
"language_code": "en"
}
-->
# Build your own custom vis
## Instructions
Using the code sample provided in this project, create a social network by mocking up data based on your own social interactions. You could map out your social media usage or design a diagram of your family members. Build an engaging web app that showcases a unique visualization of a social network.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
A GitHub repository is provided with code that functions correctly (consider deploying it as a static web app) and includes a well-annotated README explaining the project | The repository either does not function correctly or lacks proper documentation | The repository neither functions correctly nor includes proper documentation
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,40 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "5c51a54dd89075a7a362890117b7ed9e",
"translation_date": "2025-08-31T11:06:48+00:00",
"source_file": "3-Data-Visualization/13-meaningful-visualizations/solution/README.md",
"language_code": "en"
}
-->
# Dangerous Liaisons data visualization project
To begin, make sure NPM and Node are installed and running on your computer. Install the dependencies (npm install) and then launch the project locally (npm run serve):
## Project setup
```
npm install
```
### Compiles and hot-reloads for development
```
npm run serve
```
### Compiles and minifies for production
```
npm run build
```
### Lints and fixes files
```
npm run lint
```
### Customize configuration
Refer to [Configuration Reference](https://cli.vuejs.org/config/).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,40 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "5c51a54dd89075a7a362890117b7ed9e",
"translation_date": "2025-08-31T11:06:52+00:00",
"source_file": "3-Data-Visualization/13-meaningful-visualizations/starter/README.md",
"language_code": "en"
}
-->
# Dangerous Liaisons data visualization project
To begin, make sure NPM and Node are installed and running on your computer. Install the dependencies (npm install) and then launch the project locally (npm run serve):
## Project setup
```
npm install
```
### Compiles and hot-reloads for development
```
npm run serve
```
### Compiles and minifies for production
```
npm run build
```
### Lints and fixes files
```
npm run lint
```
### Customize configuration
Refer to [Configuration Reference](https://cli.vuejs.org/config/).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,234 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "22acf28f518a4769ea14fa42f4734b9f",
"translation_date": "2025-08-31T11:03:10+00:00",
"source_file": "3-Data-Visualization/R/09-visualization-quantities/README.md",
"language_code": "en"
}
-->
# Visualizing Quantities
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/sketchnotes/09-Visualizing-Quantities.png)|
|:---:|
| Visualizing Quantities - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you'll learn how to use some of the many R packages and libraries to create engaging visualizations focused on the concept of quantity. Using a cleaned dataset about the birds of Minnesota, you can uncover fascinating insights about local wildlife.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/16)
## Observing Wingspan with ggplot2
An excellent library for creating both simple and complex plots and charts is [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html). Generally, the process of plotting data with these libraries involves identifying the parts of your dataframe to target, performing any necessary transformations, assigning x and y axis values, choosing the type of plot, and then displaying it.
`ggplot2` is a system for declaratively creating graphics, based on The Grammar of Graphics. The [Grammar of Graphics](https://en.wikipedia.org/wiki/Ggplot2) is a general framework for data visualization that breaks graphs into semantic components like scales and layers. In simpler terms, the ease of creating plots and graphs for univariate or multivariate data with minimal code makes `ggplot2` the most popular visualization package in R. The user specifies how to map variables to aesthetics, chooses graphical primitives, and `ggplot2` handles the rest.
> ✅ Plot = Data + Aesthetics + Geometry
> - Data refers to the dataset
> - Aesthetics indicate the variables to study (x and y variables)
> - Geometry refers to the type of plot (line plot, bar plot, etc.)
Choose the best geometry (type of plot) based on your data and the story you want to tell through the visualization.
> - To analyze trends: line, column
> - To compare values: bar, column, pie, scatterplot
> - To show how parts relate to a whole: pie
> - To show data distribution: scatterplot, bar
> - To show relationships between values: line, scatterplot, bubble
✅ You can also check out this helpful [cheatsheet](https://nyu-cdsc.github.io/learningr/assets/data-visualization-2.1.pdf) for ggplot2.
## Build a Line Plot for Bird Wingspan Values
Open the R console and import the dataset.
> Note: The dataset is stored in the root of this repo in the `/data` folder.
Let's import the dataset and view the first five rows of the data.
```r
birds <- read.csv("../../data/birds.csv",fileEncoding="UTF-8-BOM")
head(birds)
```
The first few rows of the data contain a mix of text and numbers:
| | Name | ScientificName | Category | Order | Family | Genus | ConservationStatus | MinLength | MaxLength | MinBodyMass | MaxBodyMass | MinWingspan | MaxWingspan |
| ---: | :--------------------------- | :--------------------- | :-------------------- | :----------- | :------- | :---------- | :----------------- | --------: | --------: | ----------: | ----------: | ----------: | ----------: |
| 0 | Black-bellied whistling-duck | Dendrocygna autumnalis | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 47 | 56 | 652 | 1020 | 76 | 94 |
| 1 | Fulvous whistling-duck | Dendrocygna bicolor | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 45 | 53 | 712 | 1050 | 85 | 93 |
| 2 | Snow goose | Anser caerulescens | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 79 | 2050 | 4050 | 135 | 165 |
| 3 | Ross's goose | Anser rossii | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 57.3 | 64 | 1066 | 1567 | 113 | 116 |
| 4 | Greater white-fronted goose | Anser albifrons | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 81 | 1930 | 3310 | 130 | 165 |
Let's start by plotting some of the numeric data using a basic line plot. Suppose you want to visualize the maximum wingspan of these birds.
```r
install.packages("ggplot2")
library("ggplot2")
ggplot(data=birds, aes(x=Name, y=MaxWingspan,group=1)) +
geom_line()
```
Here, you install the `ggplot2` package and import it into the workspace using the `library("ggplot2")` command. To create any plot in ggplot, the `ggplot()` function is used, where you specify the dataset, x and y variables as attributes. In this case, we use the `geom_line()` function to create a line plot.
![MaxWingspan-lineplot](../../../../../3-Data-Visualization/R/09-visualization-quantities/images/MaxWingspan-lineplot.png)
What do you notice right away? There seems to be at least one outlier—what a wingspan! A 2000+ centimeter wingspan equals more than 20 meters—are there Pterodactyls in Minnesota? Let's investigate.
While you could sort the data in Excel to find these outliers (likely typos), let's continue the visualization process directly within the plot.
Add labels to the x-axis to show the bird species:
```r
ggplot(data=birds, aes(x=Name, y=MaxWingspan,group=1)) +
geom_line() +
theme(axis.text.x = element_text(angle = 45, hjust=1))+
xlab("Birds") +
ylab("Wingspan (CM)") +
ggtitle("Max Wingspan in Centimeters")
```
We specify the angle in the `theme` and set the x and y axis labels using `xlab()` and `ylab()`. The `ggtitle()` adds a title to the graph.
![MaxWingspan-lineplot-improved](../../../../../3-Data-Visualization/R/09-visualization-quantities/images/MaxWingspan-lineplot-improved.png)
Even with the labels rotated 45 degrees, there are too many to read. Let's try a different approach: label only the outliers and place the labels within the chart. You can use a scatter plot to make room for the labels:
```r
ggplot(data=birds, aes(x=Name, y=MaxWingspan,group=1)) +
geom_point() +
geom_text(aes(label=ifelse(MaxWingspan>500,as.character(Name),'')),hjust=0,vjust=0) +
theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())
ylab("Wingspan (CM)") +
ggtitle("Max Wingspan in Centimeters") +
```
What happens here? You use the `geom_point()` function to plot scatter points. You also add labels for birds with `MaxWingspan > 500` and hide the x-axis labels to declutter the plot.
What do you discover?
![MaxWingspan-scatterplot](../../../../../3-Data-Visualization/R/09-visualization-quantities/images/MaxWingspan-scatterplot.png)
## Filter Your Data
Both the Bald Eagle and the Prairie Falcon, while likely large birds, seem to have been mislabeled with an extra 0 in their maximum wingspan. A Bald Eagle with a 25-meter wingspan is unlikely, but if you see one, let us know! Let's create a new dataframe without these two outliers:
```r
birds_filtered <- subset(birds, MaxWingspan < 500)
ggplot(data=birds_filtered, aes(x=Name, y=MaxWingspan,group=1)) +
geom_point() +
ylab("Wingspan (CM)") +
xlab("Birds") +
ggtitle("Max Wingspan in Centimeters") +
geom_text(aes(label=ifelse(MaxWingspan>500,as.character(Name),'')),hjust=0,vjust=0) +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
```
We create a new dataframe `birds_filtered` and plot a scatter plot. By filtering out outliers, your data becomes more cohesive and easier to interpret.
![MaxWingspan-scatterplot-improved](../../../../../3-Data-Visualization/R/09-visualization-quantities/images/MaxWingspan-scatterplot-improved.png)
Now that we have a cleaner dataset in terms of wingspan, let's explore more about these birds.
While line and scatter plots can display data values and distributions, we want to think about the quantities in this dataset. You could create visualizations to answer questions like:
> How many categories of birds are there, and how many birds are in each?
> How many birds are extinct, endangered, rare, or common?
> How many birds belong to various genera and orders in Linnaeus's classification?
## Explore Bar Charts
Bar charts are useful for showing groupings of data. Let's explore the bird categories in this dataset to see which is the most common.
Let's create a bar chart using the filtered data.
```r
install.packages("dplyr")
install.packages("tidyverse")
library(lubridate)
library(scales)
library(dplyr)
library(ggplot2)
library(tidyverse)
birds_filtered %>% group_by(Category) %>%
summarise(n=n(),
MinLength = mean(MinLength),
MaxLength = mean(MaxLength),
MinBodyMass = mean(MinBodyMass),
MaxBodyMass = mean(MaxBodyMass),
MinWingspan=mean(MinWingspan),
MaxWingspan=mean(MaxWingspan)) %>%
gather("key", "value", - c(Category, n)) %>%
ggplot(aes(x = Category, y = value, group = key, fill = key)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("#D62728", "#FF7F0E", "#8C564B","#2CA02C", "#1F77B4", "#9467BD")) +
xlab("Category")+ggtitle("Birds of Minnesota")
```
In this snippet, we install the [dplyr](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8) and [lubridate](https://www.rdocumentation.org/packages/lubridate/versions/1.8.0) packages to help manipulate and group data for a stacked bar chart. First, we group the data by the `Category` of bird and summarize columns like `MinLength`, `MaxLength`, `MinBodyMass`, `MaxBodyMass`, `MinWingspan`, and `MaxWingspan`. Then, we use the `ggplot2` package to plot the bar chart, specifying colors and labels for the categories.
![Stacked bar chart](../../../../../3-Data-Visualization/R/09-visualization-quantities/images/stacked-bar-chart.png)
This bar chart is hard to read because there's too much ungrouped data. Let's focus on the length of birds based on their category.
Filter the data to include only the bird categories.
Since there are many categories, display the chart vertically and adjust its height to fit all the data:
```r
birds_count<-dplyr::count(birds_filtered, Category, sort = TRUE)
birds_count$Category <- factor(birds_count$Category, levels = birds_count$Category)
ggplot(birds_count,aes(Category,n))+geom_bar(stat="identity")+coord_flip()
```
We count unique values in the `Category` column and sort them into a new dataframe `birds_count`. This sorted data is then factored at the same level to ensure it is plotted in order. Using `ggplot2`, we create a bar chart. The `coord_flip()` function plots horizontal bars.
![category-length](../../../../../3-Data-Visualization/R/09-visualization-quantities/images/category-length.png)
This bar chart provides a clear view of the number of birds in each category. At a glance, you can see that the Ducks/Geese/Waterfowl category has the most birds. Given that Minnesota is the "land of 10,000 lakes," this makes sense!
✅ Try counting other attributes in this dataset. Do any results surprise you?
## Comparing Data
You can compare grouped data by creating new axes. For example, compare the MaxLength of birds based on their category:
```r
birds_grouped <- birds_filtered %>%
group_by(Category) %>%
summarise(
MaxLength = max(MaxLength, na.rm = T),
MinLength = max(MinLength, na.rm = T)
) %>%
arrange(Category)
ggplot(birds_grouped,aes(Category,MaxLength))+geom_bar(stat="identity")+coord_flip()
```
We group the `birds_filtered` data by `Category` and plot a bar graph.
![comparing data](../../../../../3-Data-Visualization/R/09-visualization-quantities/images/comparingdata.png)
No surprises here: hummingbirds have the smallest MaxLength compared to pelicans or geese. It's reassuring when data aligns with logic!
You can make bar charts more interesting by superimposing data. For example, compare the Minimum and Maximum Length of birds within each category:
```r
ggplot(data=birds_grouped, aes(x=Category)) +
geom_bar(aes(y=MaxLength), stat="identity", position ="identity", fill='blue') +
geom_bar(aes(y=MinLength), stat="identity", position="identity", fill='orange')+
coord_flip()
```
![super-imposed values](../../../../../3-Data-Visualization/R/09-visualization-quantities/images/superimposed-values.png)
## 🚀 Challenge
This bird dataset offers a wealth of information about different bird species in a specific ecosystem. Search online for other bird-related datasets. Practice creating charts and graphs to uncover new insights about birds.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/17)
## Review & Self-Study
This lesson introduced you to using `ggplot2` for visualizing quantities. Research other ways to work with datasets for visualization. Look for datasets you can visualize using other packages like [Lattice](https://stat.ethz.ch/R-manual/R-devel/library/lattice/html/Lattice.html) and [Plotly](https://github.com/plotly/plotly.R#readme).
## Assignment
[Lines, Scatters, and Bars](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "0ea21b6513df5ade7419c6b7d65f10b1",
"translation_date": "2025-08-31T11:03:42+00:00",
"source_file": "3-Data-Visualization/R/09-visualization-quantities/assignment.md",
"language_code": "en"
}
-->
# Lines, Scatters and Bars
## Instructions
In this lesson, you explored line charts, scatterplots, and bar charts to highlight interesting insights from this dataset. For this assignment, dive deeper into the dataset to uncover a fact about a specific type of bird. For instance, create a script that visualizes all the intriguing data you can find about Snow Geese. Use the three types of plots mentioned above to craft a narrative in your notebook.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
The script includes clear annotations, a compelling narrative, and visually appealing graphs | The script lacks one of these elements | The script lacks two of these elements
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,183 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "ea67c0c40808fd723594de6896c37ccf",
"translation_date": "2025-08-31T11:04:13+00:00",
"source_file": "3-Data-Visualization/R/10-visualization-distributions/README.md",
"language_code": "en"
}
-->
# Visualizing Distributions
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](https://github.com/microsoft/Data-Science-For-Beginners/blob/main/sketchnotes/10-Visualizing-Distributions.png)|
|:---:|
| Visualizing Distributions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In the previous lesson, you explored a dataset about the birds of Minnesota. You identified some erroneous data by visualizing outliers and examined differences between bird categories based on their maximum length.
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/18)
## Explore the birds dataset
Another way to analyze data is by examining its distribution, or how the data is spread along an axis. For instance, you might want to understand the general distribution of maximum wingspan or maximum body mass for the birds of Minnesota in this dataset.
Lets uncover some insights about the distributions in this dataset. In your R console, import `ggplot2` and the database. Remove the outliers from the database as you did in the previous topic.
```r
library(ggplot2)
birds <- read.csv("../../data/birds.csv",fileEncoding="UTF-8-BOM")
birds_filtered <- subset(birds, MaxWingspan < 500)
head(birds_filtered)
```
| | Name | ScientificName | Category | Order | Family | Genus | ConservationStatus | MinLength | MaxLength | MinBodyMass | MaxBodyMass | MinWingspan | MaxWingspan |
| ---: | :--------------------------- | :--------------------- | :-------------------- | :----------- | :------- | :---------- | :----------------- | --------: | --------: | ----------: | ----------: | ----------: | ----------: |
| 0 | Black-bellied whistling-duck | Dendrocygna autumnalis | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 47 | 56 | 652 | 1020 | 76 | 94 |
| 1 | Fulvous whistling-duck | Dendrocygna bicolor | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 45 | 53 | 712 | 1050 | 85 | 93 |
| 2 | Snow goose | Anser caerulescens | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 79 | 2050 | 4050 | 135 | 165 |
| 3 | Ross's goose | Anser rossii | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 57.3 | 64 | 1066 | 1567 | 113 | 116 |
| 4 | Greater white-fronted goose | Anser albifrons | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 81 | 1930 | 3310 | 130 | 165 |
You can quickly visualize how data is distributed by using a scatter plot, as demonstrated in the previous lesson:
```r
ggplot(data=birds_filtered, aes(x=Order, y=MaxLength,group=1)) +
geom_point() +
ggtitle("Max Length per order") + coord_flip()
```
![max length per order](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/max-length-per-order.png)
This provides a general overview of body length distribution per bird Order, but its not the best way to display true distributions. For that, histograms are typically used.
## Working with histograms
`ggplot2` offers excellent tools for visualizing data distribution using histograms. A histogram is similar to a bar chart, but it shows the distribution of data through the rise and fall of the bars. To create a histogram, you need numeric data. You can plot a histogram by specifying the chart type as 'hist'. This chart displays the distribution of MaxBodyMass across the datasets numeric range. By dividing the data into smaller bins, it reveals how values are distributed:
```r
ggplot(data = birds_filtered, aes(x = MaxBodyMass)) +
geom_histogram(bins=10)+ylab('Frequency')
```
![distribution over entire dataset](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/distribution-over-the-entire-dataset.png)
As shown, most of the 400+ birds in this dataset have a Max Body Mass under 2000. You can gain more detailed insights by increasing the `bins` parameter to a higher value, such as 30:
```r
ggplot(data = birds_filtered, aes(x = MaxBodyMass)) + geom_histogram(bins=30)+ylab('Frequency')
```
![distribution-30bins](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/distribution-30bins.png)
This chart provides a more granular view of the distribution. To create a chart thats less skewed to the left, you can filter the data to include only birds with a body mass under 60 and display 30 `bins`:
```r
birds_filtered_1 <- subset(birds_filtered, MaxBodyMass > 1 & MaxBodyMass < 60)
ggplot(data = birds_filtered_1, aes(x = MaxBodyMass)) +
geom_histogram(bins=30)+ylab('Frequency')
```
![filtered histogram](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/filtered-histogram.png)
✅ Experiment with other filters and data points. To view the full distribution, remove the `['MaxBodyMass']` filter and display labeled distributions.
Histograms also allow for color and labeling enhancements:
Create a 2D histogram to compare the relationship between two distributions. For example, compare `MaxBodyMass` vs. `MaxLength`. `ggplot2` provides a built-in method to show convergence using brighter colors:
```r
ggplot(data=birds_filtered_1, aes(x=MaxBodyMass, y=MaxLength) ) +
geom_bin2d() +scale_fill_continuous(type = "viridis")
```
There seems to be a predictable correlation between these two elements along a specific axis, with one particularly strong point of convergence:
![2d plot](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/2d-plot.png)
Histograms work well for numeric data by default. But what if you need to analyze distributions based on text data?
## Explore the dataset for distributions using text data
This dataset also contains valuable information about bird categories, genus, species, family, and conservation status. Lets examine the conservation status distribution. What does it look like?
> ✅ In the dataset, several acronyms are used to describe conservation status. These acronyms are derived from the [IUCN Red List Categories](https://www.iucnredlist.org/), which catalog species' statuses.
>
> - CR: Critically Endangered
> - EN: Endangered
> - EX: Extinct
> - LC: Least Concern
> - NT: Near Threatened
> - VU: Vulnerable
Since these are text-based values, youll need to transform the data to create a histogram. Using the filteredBirds dataframe, display conservation status alongside Minimum Wingspan. What do you observe?
```r
birds_filtered_1$ConservationStatus[birds_filtered_1$ConservationStatus == 'EX'] <- 'x1'
birds_filtered_1$ConservationStatus[birds_filtered_1$ConservationStatus == 'CR'] <- 'x2'
birds_filtered_1$ConservationStatus[birds_filtered_1$ConservationStatus == 'EN'] <- 'x3'
birds_filtered_1$ConservationStatus[birds_filtered_1$ConservationStatus == 'NT'] <- 'x4'
birds_filtered_1$ConservationStatus[birds_filtered_1$ConservationStatus == 'VU'] <- 'x5'
birds_filtered_1$ConservationStatus[birds_filtered_1$ConservationStatus == 'LC'] <- 'x6'
ggplot(data=birds_filtered_1, aes(x = MinWingspan, fill = ConservationStatus)) +
geom_histogram(position = "identity", alpha = 0.4, bins = 20) +
scale_fill_manual(name="Conservation Status",values=c("red","green","blue","pink"),labels=c("Endangered","Near Threathened","Vulnerable","Least Concern"))
```
![wingspan and conservation collation](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/wingspan-conservation-collation.png)
There doesnt appear to be a strong correlation between minimum wingspan and conservation status. Test other elements of the dataset using this method. Try different filters as well. Do you find any correlations?
## Density plots
You may have noticed that the histograms weve explored so far are stepped and dont flow smoothly in an arc. To create a smoother density chart, you can use a density plot.
Lets explore density plots now!
```r
ggplot(data = birds_filtered_1, aes(x = MinWingspan)) +
geom_density()
```
![density plot](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/density-plot.png)
This plot mirrors the previous one for Minimum Wingspan data but is smoother. If you want to revisit the jagged MaxBodyMass line from the second chart, you can smooth it out using this method:
```r
ggplot(data = birds_filtered_1, aes(x = MaxBodyMass)) +
geom_density()
```
![bodymass density](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/bodymass-smooth.png)
If you prefer a smooth but not overly smooth line, adjust the `adjust` parameter:
```r
ggplot(data = birds_filtered_1, aes(x = MaxBodyMass)) +
geom_density(adjust = 1/5)
```
![less smooth bodymass](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/less-smooth-bodymass.png)
✅ Explore the available parameters for this type of plot and experiment!
This type of chart provides visually compelling explanations. For instance, with just a few lines of code, you can display the max body mass density per bird Order:
```r
ggplot(data=birds_filtered_1,aes(x = MaxBodyMass, fill = Order)) +
geom_density(alpha=0.5)
```
![bodymass per order](../../../../../3-Data-Visualization/R/10-visualization-distributions/images/bodymass-per-order.png)
## 🚀 Challenge
Histograms are more advanced than basic scatterplots, bar charts, or line charts. Search online for examples of histograms. How are they used, what do they reveal, and in which fields or areas of study are they commonly applied?
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/19)
## Review & Self Study
In this lesson, you used `ggplot2` to create more advanced charts. Research `geom_density_2d()`, which generates "continuous probability density curves in one or more dimensions." Read through [the documentation](https://ggplot2.tidyverse.org/reference/geom_density_2d.html) to understand its functionality.
## Assignment
[Apply your skills](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a233d542512136c4dd29aad38ca0175f",
"translation_date": "2025-08-31T11:04:33+00:00",
"source_file": "3-Data-Visualization/R/10-visualization-distributions/assignment.md",
"language_code": "en"
}
-->
# Apply your skills
## Instructions
Up to this point, youve worked with the Minnesota birds dataset to uncover insights about bird counts and population density. Now, put these techniques into practice by exploring a different dataset, perhaps one from [Kaggle](https://www.kaggle.com/). Create an R script that tells a story about this dataset, and be sure to incorporate histograms in your analysis.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
A script is provided with detailed annotations about the dataset, including its source, and utilizes at least 5 histograms to extract meaningful insights from the data. | A script is provided with incomplete annotations or contains errors. | A script is provided without annotations and includes errors.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,197 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "47028abaaafa2bcb1079702d20569066",
"translation_date": "2025-08-31T11:02:18+00:00",
"source_file": "3-Data-Visualization/R/11-visualization-proportions/README.md",
"language_code": "en"
}
-->
# Visualizing Proportions
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../../sketchnotes/11-Visualizing-Proportions.png)|
|:---:|
|Visualizing Proportions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you will work with a nature-focused dataset to visualize proportions, such as the distribution of different types of fungi in a dataset about mushrooms. We'll dive into the fascinating world of fungi using a dataset from Audubon that details 23 species of gilled mushrooms in the Agaricus and Lepiota families. You'll experiment with some fun visualizations, including:
- Pie charts 🥧
- Donut charts 🍩
- Waffle charts 🧇
> 💡 A fascinating project called [Charticulator](https://charticulator.com) by Microsoft Research provides a free drag-and-drop interface for creating data visualizations. One of their tutorials even uses this mushroom dataset! This means you can explore the data and learn the library simultaneously: [Charticulator tutorial](https://charticulator.com/tutorials/tutorial4.html).
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/20)
## Get to know your mushrooms 🍄
Mushrooms are incredibly interesting. Let's import a dataset to study them:
```r
mushrooms = read.csv('../../data/mushrooms.csv')
head(mushrooms)
```
A table is displayed with some great data for analysis:
| class | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | stalk-root | stalk-surface-above-ring | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat |
| --------- | --------- | ----------- | --------- | ------- | ------- | --------------- | ------------ | --------- | ---------- | ----------- | ---------- | ------------------------ | ------------------------ | ---------------------- | ---------------------- | --------- | ---------- | ----------- | --------- | ----------------- | ---------- | ------- |
| Poisonous | Convex | Smooth | Brown | Bruises | Pungent | Free | Close | Narrow | Black | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban |
| Edible | Convex | Smooth | Yellow | Bruises | Almond | Free | Close | Broad | Black | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Grasses |
| Edible | Bell | Smooth | White | Bruises | Anise | Free | Close | Broad | Brown | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Meadows |
| Poisonous | Convex | Scaly | White | Bruises | Pungent | Free | Close | Narrow | Brown | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban |
| Edible | Convex | Smooth | Green | No Bruises | None | Free | Crowded | Broad | Black | Tapering | Equal | Smooth | Smooth | White | White | Partial | White | One | Evanescent | Brown | Abundant | Grasses |
| Edible | Convex | Scaly | Yellow | Bruises | Almond | Free | Close | Broad | Brown | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Numerous | Grasses |
Immediately, you notice that all the data is textual. To use it in a chart, you'll need to convert it. Most of the data is represented as an object:
```r
names(mushrooms)
```
The output is:
```output
[1] "class" "cap.shape"
[3] "cap.surface" "cap.color"
[5] "bruises" "odor"
[7] "gill.attachment" "gill.spacing"
[9] "gill.size" "gill.color"
[11] "stalk.shape" "stalk.root"
[13] "stalk.surface.above.ring" "stalk.surface.below.ring"
[15] "stalk.color.above.ring" "stalk.color.below.ring"
[17] "veil.type" "veil.color"
[19] "ring.number" "ring.type"
[21] "spore.print.color" "population"
[23] "habitat"
```
Now, convert the 'class' column into a category:
```r
library(dplyr)
grouped=mushrooms %>%
group_by(class) %>%
summarise(count=n())
```
When you print the mushrooms data, you'll see it grouped into categories based on whether the mushrooms are poisonous or edible:
```r
View(grouped)
```
| class | count |
| --------- | ----- |
| Edible | 4208 |
| Poisonous | 3916 |
Using the order in this table to create your class category labels, you can build a pie chart.
## Pie!
```r
pie(grouped$count,grouped$class, main="Edible?")
```
And voilà, a pie chart showing the proportions of edible and poisonous mushrooms. It's crucial to get the order of the labels correct, so double-check the label array's order!
![pie chart](../../../../../3-Data-Visualization/R/11-visualization-proportions/images/pie1-wb.png)
## Donuts!
A slightly more visually appealing version of a pie chart is a donut chart, which is essentially a pie chart with a hole in the middle. Let's use this method to examine our data.
Take a look at the various habitats where mushrooms grow:
```r
library(dplyr)
habitat=mushrooms %>%
group_by(habitat) %>%
summarise(count=n())
View(habitat)
```
The output is:
| habitat | count |
| ------- | ----- |
| Grasses | 2148 |
| Leaves | 832 |
| Meadows | 292 |
| Paths | 1144 |
| Urban | 368 |
| Waste | 192 |
| Wood | 3148 |
Here, the data is grouped by habitat. There are seven habitats listed, so use these as labels for your donut chart:
```r
library(ggplot2)
library(webr)
PieDonut(habitat, aes(habitat, count=count))
```
![donut chart](../../../../../3-Data-Visualization/R/11-visualization-proportions/images/donut-wb.png)
This code uses two libraries: ggplot2 and webr. The PieDonut function from the webr library makes it easy to create a donut chart!
You can also create donut charts in R using only the ggplot2 library. Learn more about it [here](https://www.r-graph-gallery.com/128-ring-or-donut-plot.html) and give it a try.
Now that you know how to group your data and display it as a pie or donut chart, you can explore other chart types. For example, try a waffle chart, which offers a unique way to visualize quantities.
## Waffles!
A waffle chart is a 2D array of squares that visualizes quantities. Let's use it to explore the different mushroom cap colors in this dataset. First, install a helper library called [waffle](https://cran.r-project.org/web/packages/waffle/waffle.pdf) and use it to create your visualization:
```r
install.packages("waffle", repos = "https://cinc.rud.is")
```
Select a segment of your data to group:
```r
library(dplyr)
cap_color=mushrooms %>%
group_by(cap.color) %>%
summarise(count=n())
View(cap_color)
```
Create a waffle chart by defining labels and grouping your data:
```r
library(waffle)
names(cap_color$count) = paste0(cap_color$cap.color)
waffle((cap_color$count/10), rows = 7, title = "Waffle Chart")+scale_fill_manual(values=c("brown", "#F0DC82", "#D2691E", "green",
"pink", "purple", "red", "grey",
"yellow","white"))
```
The waffle chart clearly shows the proportions of cap colors in the mushroom dataset. Interestingly, there are quite a few green-capped mushrooms!
![waffle chart](../../../../../3-Data-Visualization/R/11-visualization-proportions/images/waffle.png)
In this lesson, you learned three ways to visualize proportions. First, group your data into categories, then decide the best way to display it—pie, donut, or waffle. Each method provides a quick and engaging snapshot of the dataset.
## 🚀 Challenge
Try recreating these fun charts in [Charticulator](https://charticulator.com).
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/21)
## Review & Self Study
Its not always clear when to use a pie, donut, or waffle chart. Here are some articles to help you decide:
- https://www.beautiful.ai/blog/battle-of-the-charts-pie-chart-vs-donut-chart
- https://medium.com/@hypsypops/pie-chart-vs-donut-chart-showdown-in-the-ring-5d24fd86a9ce
- https://www.mit.edu/~mbarker/formula1/f1help/11-ch-c6.htm
- https://medium.datadriveninvestor.com/data-visualization-done-the-right-way-with-tableau-waffle-chart-fdf2a19be402
Do some research to learn more about this tricky decision.
## Assignment
[Try it in Excel](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,179 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a33c5d4b4156a2b41788d8720b6f724c",
"translation_date": "2025-08-31T11:03:47+00:00",
"source_file": "3-Data-Visualization/R/12-visualization-relationships/README.md",
"language_code": "en"
}
-->
# Visualizing Relationships: All About Honey 🍯
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../../sketchnotes/12-Visualizing-Relationships.png)|
|:---:|
|Visualizing Relationships - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Continuing with the nature focus of our research, let's explore fascinating ways to visualize the relationships between different types of honey, based on a dataset from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
This dataset, containing around 600 entries, showcases honey production across various U.S. states. For instance, it includes data on the number of colonies, yield per colony, total production, stocks, price per pound, and the value of honey produced in each state from 1998 to 2012, with one row per year for each state.
It would be intriguing to visualize the relationship between a state's annual production and, for example, the price of honey in that state. Alternatively, you could examine the relationship between honey yield per colony across states. This time period also includes the emergence of the devastating 'CCD' or 'Colony Collapse Disorder' first observed in 2006 (http://npic.orst.edu/envir/ccd.html), making this dataset particularly meaningful to study. 🐝
## [Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/22)
In this lesson, you'll use ggplot2, a library you've worked with before, to visualize relationships between variables. One of the highlights of ggplot2 is its `geom_point` and `qplot` functions, which allow you to create scatter plots and line plots to quickly visualize '[statistical relationships](https://ggplot2.tidyverse.org/)'. These tools help data scientists better understand how variables interact with one another.
## Scatterplots
Use a scatterplot to illustrate how the price of honey has changed year over year in each state. ggplot2, with its `ggplot` and `geom_point` functions, makes it easy to group state data and display data points for both categorical and numeric variables.
Let's begin by importing the data and Seaborn:
```r
honey=read.csv('../../data/honey.csv')
head(honey)
```
You'll notice that the honey dataset contains several interesting columns, including year and price per pound. Let's explore this data, grouped by U.S. state:
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
| ----- | ------ | ----------- | --------- | -------- | ---------- | --------- | ---- |
| AL | 16000 | 71 | 1136000 | 159000 | 0.72 | 818000 | 1998 |
| AZ | 55000 | 60 | 3300000 | 1485000 | 0.64 | 2112000 | 1998 |
| AR | 53000 | 65 | 3445000 | 1688000 | 0.59 | 2033000 | 1998 |
| CA | 450000 | 83 | 37350000 | 12326000 | 0.62 | 23157000 | 1998 |
| CO | 27000 | 72 | 1944000 | 1594000 | 0.7 | 1361000 | 1998 |
| FL | 230000 | 98 |22540000 | 4508000 | 0.64 | 14426000 | 1998 |
Create a basic scatterplot to show the relationship between the price per pound of honey and its U.S. state of origin. Adjust the `y` axis to ensure all states are visible:
```r
library(ggplot2)
ggplot(honey, aes(x = priceperlb, y = state)) +
geom_point(colour = "blue")
```
![scatterplot 1](../../../../../3-Data-Visualization/R/12-visualization-relationships/images/scatter1.png)
Next, use a honey-inspired color scheme to visualize how the price changes over the years. You can achieve this by adding the 'scale_color_gradientn' parameter to highlight year-over-year changes:
> ✅ Learn more about the [scale_color_gradientn](https://www.rdocumentation.org/packages/ggplot2/versions/0.9.1/topics/scale_colour_gradientn) - try a beautiful rainbow color scheme!
```r
ggplot(honey, aes(x = priceperlb, y = state, color=year)) +
geom_point()+scale_color_gradientn(colours = colorspace::heat_hcl(7))
```
![scatterplot 2](../../../../../3-Data-Visualization/R/12-visualization-relationships/images/scatter2.png)
With this color scheme, you can clearly see a strong upward trend in honey prices over the years. If you examine a specific state, such as Arizona, you'll notice a consistent pattern of price increases year over year, with only a few exceptions:
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
| ----- | ------ | ----------- | --------- | ------- | ---------- | --------- | ---- |
| AZ | 55000 | 60 | 3300000 | 1485000 | 0.64 | 2112000 | 1998 |
| AZ | 52000 | 62 | 3224000 | 1548000 | 0.62 | 1999000 | 1999 |
| AZ | 40000 | 59 | 2360000 | 1322000 | 0.73 | 1723000 | 2000 |
| AZ | 43000 | 59 | 2537000 | 1142000 | 0.72 | 1827000 | 2001 |
| AZ | 38000 | 63 | 2394000 | 1197000 | 1.08 | 2586000 | 2002 |
| AZ | 35000 | 72 | 2520000 | 983000 | 1.34 | 3377000 | 2003 |
| AZ | 32000 | 55 | 1760000 | 774000 | 1.11 | 1954000 | 2004 |
| AZ | 36000 | 50 | 1800000 | 720000 | 1.04 | 1872000 | 2005 |
| AZ | 30000 | 65 | 1950000 | 839000 | 0.91 | 1775000 | 2006 |
| AZ | 30000 | 64 | 1920000 | 902000 | 1.26 | 2419000 | 2007 |
| AZ | 25000 | 64 | 1600000 | 336000 | 1.26 | 2016000 | 2008 |
| AZ | 20000 | 52 | 1040000 | 562000 | 1.45 | 1508000 | 2009 |
| AZ | 24000 | 77 | 1848000 | 665000 | 1.52 | 2809000 | 2010 |
| AZ | 23000 | 53 | 1219000 | 427000 | 1.55 | 1889000 | 2011 |
| AZ | 22000 | 46 | 1012000 | 253000 | 1.79 | 1811000 | 2012 |
Another way to visualize this trend is by using size instead of color. For colorblind users, this might be a better option. Modify your visualization to represent price increases with larger dot sizes:
```r
ggplot(honey, aes(x = priceperlb, y = state)) +
geom_point(aes(size = year),colour = "blue") +
scale_size_continuous(range = c(0.25, 3))
```
You can observe the dots growing larger over time.
![scatterplot 3](../../../../../3-Data-Visualization/R/12-visualization-relationships/images/scatter3.png)
Is this simply a case of supply and demand? Could factors like climate change and colony collapse be reducing honey availability year over year, leading to price increases?
To investigate correlations between variables in this dataset, let's explore line charts.
## Line charts
Question: Is there a clear upward trend in honey prices per pound year over year? The simplest way to find out is by creating a single line chart:
```r
qplot(honey$year,honey$priceperlb, geom='smooth', span =0.5, xlab = "year",ylab = "priceperlb")
```
Answer: Yes, although there are some exceptions around 2003:
![line chart 1](../../../../../3-Data-Visualization/R/12-visualization-relationships/images/line1.png)
Question: In 2003, can we also observe a spike in honey supply? What happens if you examine total production year over year?
```python
qplot(honey$year,honey$totalprod, geom='smooth', span =0.5, xlab = "year",ylab = "totalprod")
```
![line chart 2](../../../../../3-Data-Visualization/R/12-visualization-relationships/images/line2.png)
Answer: Not really. Total production seems to have increased in 2003, even though overall honey production appears to be declining during these years.
Question: In that case, what might have caused the spike in honey prices around 2003?
To explore this, let's use a facet grid.
## Facet grids
Facet grids allow you to focus on one aspect of your dataset (e.g., 'year') and create a plot for each facet based on your chosen x and y coordinates. This makes comparisons easier. Does 2003 stand out in this type of visualization?
Create a facet grid using `facet_wrap` as recommended by [ggplot2's documentation](https://ggplot2.tidyverse.org/reference/facet_wrap.html).
```r
ggplot(honey, aes(x=yieldpercol, y = numcol,group = 1)) +
geom_line() + facet_wrap(vars(year))
```
In this visualization, you can compare yield per colony and number of colonies year over year, with a wrap set at 3 columns:
![facet grid](../../../../../3-Data-Visualization/R/12-visualization-relationships/images/facet.png)
For this dataset, nothing particularly stands out regarding the number of colonies and their yield year over year or state by state. Is there another way to identify correlations between these two variables?
## Dual-line Plots
Try a multiline plot by overlaying two line plots using R's `par` and `plot` functions. Plot the year on the x-axis and display two y-axes: yield per colony and number of colonies, superimposed:
```r
par(mar = c(5, 4, 4, 4) + 0.3)
plot(honey$year, honey$numcol, pch = 16, col = 2,type="l")
par(new = TRUE)
plot(honey$year, honey$yieldpercol, pch = 17, col = 3,
axes = FALSE, xlab = "", ylab = "",type="l")
axis(side = 4, at = pretty(range(y2)))
mtext("colony yield", side = 4, line = 3)
```
![superimposed plots](../../../../../3-Data-Visualization/R/12-visualization-relationships/images/dual-line.png)
While nothing significant stands out around 2003, this visualization ends the lesson on a slightly positive note: although the number of colonies is declining overall, it appears to be stabilizing, even if their yield per colony is decreasing.
Go, bees, go!
🐝❤️
## 🚀 Challenge
In this lesson, you learned more about scatterplots and line grids, including facet grids. Challenge yourself to create a facet grid using a different dataset, perhaps one you've used in previous lessons. Note how long it takes to create and consider how many grids are practical to draw using these techniques.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/23)
## Review & Self Study
Line plots can range from simple to complex. Spend some time reading the [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/geom_path.html#:~:text=geom_line()%20connects%20them%20in,which%20cases%20are%20connected%20together) to learn about the various ways to build them. Try enhancing the line charts you created in this lesson using other methods described in the documentation.
## Assignment
[Dive into the beehive](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,182 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "b4039f1c76548d144a0aee0bf28304ec",
"translation_date": "2025-08-31T11:04:38+00:00",
"source_file": "3-Data-Visualization/R/13-meaningful-vizualizations/README.md",
"language_code": "en"
}
-->
# Creating Meaningful Visualizations
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../../sketchnotes/13-MeaningfulViz.png)|
|:---:|
| Meaningful Visualizations - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
> "If you torture the data long enough, it will confess to anything" -- [Ronald Coase](https://en.wikiquote.org/wiki/Ronald_Coase)
One of the essential skills for a data scientist is the ability to create meaningful data visualizations that help answer questions. Before visualizing your data, you need to ensure it has been cleaned and prepared, as covered in previous lessons. Once that's done, you can start deciding how best to present the data.
In this lesson, you will explore:
1. How to select the appropriate chart type
2. How to avoid misleading visualizations
3. How to use color effectively
4. How to style charts for better readability
5. How to create animated or 3D visualizations
6. How to design creative visualizations
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/24)
## Selecting the appropriate chart type
In earlier lessons, you experimented with creating various data visualizations using Matplotlib and Seaborn. Generally, you can choose the [appropriate chart type](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) based on the question you're trying to answer using this table:
| Task | Recommended Chart Type |
| -------------------------- | ------------------------------- |
| Show data trends over time | Line |
| Compare categories | Bar, Pie |
| Compare totals | Pie, Stacked Bar |
| Show relationships | Scatter, Line, Facet, Dual Line |
| Show distributions | Scatter, Histogram, Box |
| Show proportions | Pie, Donut, Waffle |
> ✅ Depending on the structure of your data, you may need to convert it from text to numeric format to make certain charts work.
## Avoiding misleading visualizations
Even when a data scientist carefully selects the right chart for the data, there are many ways data can be presented to support a specific narrative—sometimes at the expense of accuracy. There are countless examples of misleading charts and infographics!
[![How Charts Lie by Alberto Cairo](../../../../../3-Data-Visualization/R/13-meaningful-vizualizations/images/tornado.png)](https://www.youtube.com/watch?v=oX74Nge8Wkw "How charts lie")
> 🎥 Click the image above to watch a conference talk about misleading charts.
This chart flips the X-axis to present the opposite of the truth based on dates:
![bad chart 1](../../../../../3-Data-Visualization/R/13-meaningful-vizualizations/images/bad-chart-1.png)
[This chart](https://media.firstcoastnews.com/assets/WTLV/images/170ae16f-4643-438f-b689-50d66ca6a8d8/170ae16f-4643-438f-b689-50d66ca6a8d8_1140x641.jpg) is even more misleading. At first glance, it appears that COVID cases have declined over time in various counties. However, upon closer inspection, the dates have been rearranged to create a deceptive downward trend.
![bad chart 2](../../../../../3-Data-Visualization/R/13-meaningful-vizualizations/images/bad-chart-2.jpg)
This infamous example uses both color and a flipped Y-axis to mislead viewers. Instead of showing that gun deaths increased after the passage of gun-friendly legislation, the chart tricks the eye into believing the opposite:
![bad chart 3](../../../../../3-Data-Visualization/R/13-meaningful-vizualizations/images/bad-chart-3.jpg)
This peculiar chart demonstrates how proportions can be manipulated, often to humorous effect:
![bad chart 4](../../../../../3-Data-Visualization/R/13-meaningful-vizualizations/images/bad-chart-4.jpg)
Another deceptive tactic is comparing things that aren't comparable. A [fascinating website](https://tylervigen.com/spurious-correlations) showcases 'spurious correlations,' such as the divorce rate in Maine being linked to margarine consumption. A Reddit group also collects [examples of poor data usage](https://www.reddit.com/r/dataisugly/top/?t=all).
It's crucial to understand how easily the eye can be tricked by misleading charts. Even with good intentions, a poorly chosen chart type—like a pie chart with too many categories—can lead to confusion.
## Using color effectively
The 'Florida gun violence' chart above illustrates how color can add another layer of meaning to visualizations. Libraries like ggplot2 and RColorBrewer come with pre-designed color palettes, but if you're creating a chart manually, it's worth studying [color theory](https://colormatters.com/color-and-design/basic-color-theory).
> ✅ Keep accessibility in mind when designing charts. Some users may be colorblind—does your chart work well for those with visual impairments?
Be cautious when selecting colors for your chart, as they can convey unintended meanings. For example, the 'pink ladies' in the 'height' chart above add a distinctly 'feminine' connotation, which contributes to the chart's oddness.
While [color meanings](https://colormatters.com/color-symbolism/the-meanings-of-colors) can vary across cultures and change depending on the shade, general associations include:
| Color | Meaning |
| ------ | ------------------- |
| red | power |
| blue | trust, loyalty |
| yellow | happiness, caution |
| green | ecology, luck, envy |
| purple | happiness |
| orange | vibrance |
If you're tasked with creating a chart with custom colors, ensure that your choices align with the intended message and that the chart remains accessible.
## Styling charts for better readability
Charts lose their value if they're hard to read! Take time to adjust the width and height of your chart to fit the data appropriately. For example, if you're displaying all 50 states, consider showing them vertically on the Y-axis to avoid horizontal scrolling.
Label your axes, include a legend if needed, and provide tooltips for better data comprehension.
If your data includes verbose text on the X-axis, you can angle the text for improved readability. [plot3D](https://cran.r-project.org/web/packages/plot3D/index.html) offers 3D plotting capabilities if your data supports it. This can help create more sophisticated visualizations.
![3d plots](../../../../../3-Data-Visualization/R/13-meaningful-vizualizations/images/3d.png)
## Animation and 3D visualizations
Some of the most engaging data visualizations today are animated. Shirley Wu has created stunning examples using D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/),' where each flower represents a movie. Another example is 'Bussed Out,' an interactive experience combining visualizations with Greensock and D3, paired with a scrollytelling article format to explore how NYC addresses homelessness by bussing people out of the city.
![busing](../../../../../3-Data-Visualization/R/13-meaningful-vizualizations/images/busing.png)
> "Bussed Out: How America Moves its Homeless" from [the Guardian](https://www.theguardian.com/us-news/ng-interactive/2017/dec/20/bussed-out-america-moves-homeless-people-country-study). Visualizations by Nadieh Bremer & Shirley Wu
While this lesson doesn't delve deeply into these advanced visualization libraries, you can experiment with D3 in a Vue.js app to create an animated social network visualization based on the book "Dangerous Liaisons."
> "Les Liaisons Dangereuses" is an epistolary novel, meaning it is presented as a series of letters. Written in 1782 by Choderlos de Laclos, it tells the story of two morally corrupt French aristocrats, the Vicomte de Valmont and the Marquise de Merteuil, who engage in manipulative social schemes. Both ultimately meet their downfall, but not before causing significant social damage. The novel unfolds through letters exchanged among various characters, revealing plots for revenge and mischief. Create a visualization of these letters to identify the key players in the narrative.
You will build a web app that displays an animated view of this social network. The app uses a library designed to create a [network visualization](https://github.com/emiliorizzo/vue-d3-network) with Vue.js and D3. Once the app is running, you can drag nodes around the screen to rearrange the data.
![liaisons](../../../../../3-Data-Visualization/R/13-meaningful-vizualizations/images/liaisons.png)
## Project: Create a network visualization using D3.js
> The lesson folder includes a `solution` folder with the completed project for reference.
1. Follow the instructions in the README.md file located in the starter folder's root. Ensure you have NPM and Node.js installed on your machine before setting up the project's dependencies.
2. Open the `starter/src` folder. Inside, you'll find an `assets` folder containing a .json file with all the letters from the novel, annotated with 'to' and 'from' fields.
3. Complete the code in `components/Nodes.vue` to enable the visualization. Locate the method called `createLinks()` and add the following nested loop.
Loop through the .json object to extract the 'to' and 'from' data for the letters, building the `links` object for the visualization library:
```javascript
//loop through letters
let f = 0;
let t = 0;
for (var i = 0; i < letters.length; i++) {
for (var j = 0; j < characters.length; j++) {
if (characters[j] == letters[i].from) {
f = j;
}
if (characters[j] == letters[i].to) {
t = j;
}
}
this.links.push({ sid: f, tid: t });
}
```
Run your app from the terminal (npm run serve) and enjoy the visualization!
## 🚀 Challenge
Explore the internet to find examples of misleading visualizations. How does the author mislead the audience, and is it intentional? Try correcting the visualizations to show how they should appear.
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/25)
## Review & Self Study
Here are some articles about misleading data visualizations:
https://gizmodo.com/how-to-lie-with-data-visualization-1563576606
http://ixd.prattsi.org/2017/12/visual-lies-usability-in-deceptive-data-visualizations/
Check out these interesting visualizations of historical assets and artifacts:
https://handbook.pubpub.org/
Read this article on how animation can enhance visualizations:
https://medium.com/@EvanSinar/use-animation-to-supercharge-data-visualization-cd905a882ad4
## Assignment
[Create your own custom visualization](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,42 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1441550a0d789796b2821e04f7f4cc94",
"translation_date": "2025-08-31T11:02:05+00:00",
"source_file": "3-Data-Visualization/README.md",
"language_code": "en"
}
-->
# Visualizations
![a bee on a lavender flower](../../../3-Data-Visualization/images/bee.jpg)
> Photo by <a href="https://unsplash.com/@jenna2980?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Jenna Lee</a> on <a href="https://unsplash.com/s/photos/bees-in-a-meadow?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
Visualizing data is one of the most essential tasks for a data scientist. A picture is worth a thousand words, and a visualization can help you uncover various interesting aspects of your data, such as spikes, anomalies, clusters, trends, and more, enabling you to understand the story your data is telling.
In these five lessons, you will work with data sourced from nature and create engaging and visually appealing visualizations using different techniques.
| Topic Number | Topic | Linked Lesson | Author |
| :-----------: | :--: | :-----------: | :----: |
| 1. | Visualizing quantities | <ul> <li> [Python](09-visualization-quantities/README.md)</li> <li>[R](../../../3-Data-Visualization/R/09-visualization-quantities) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
| 2. | Visualizing distribution | <ul> <li> [Python](10-visualization-distributions/README.md)</li> <li>[R](../../../3-Data-Visualization/R/10-visualization-distributions) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
| 3. | Visualizing proportions | <ul> <li> [Python](11-visualization-proportions/README.md)</li> <li>[R](../../../3-Data-Visualization) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
| 4. | Visualizing relationships | <ul> <li> [Python](12-visualization-relationships/README.md)</li> <li>[R](../../../3-Data-Visualization) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
| 5. | Making Meaningful Visualizations | <ul> <li> [Python](13-meaningful-visualizations/README.md)</li> <li>[R](../../../3-Data-Visualization) </li> </ul>|<ul> <li> [Jen Looper](https://twitter.com/jenlooper)</li><li> [Vidushi Gupta](https://github.com/Vidushi-Gupta)</li> <li>[Jasleen Sondhi](https://github.com/jasleen101010)</li></ul> |
### Credits
These visualization lessons were created with 🌸 by [Jen Looper](https://twitter.com/jenlooper), [Jasleen Sondhi](https://github.com/jasleen101010), and [Vidushi Gupta](https://github.com/Vidushi-Gupta).
🍯 Data on US Honey Production is sourced from Jessica Li's project on [Kaggle](https://www.kaggle.com/jessicali9530/honey-production). The [data](https://usda.library.cornell.edu/concern/publications/rn301137d) originates from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
🍄 Data on mushrooms is also sourced from [Kaggle](https://www.kaggle.com/hatterasdunton/mushroom-classification-updated-dataset), revised by Hatteras Dunton. This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. The mushrooms are drawn from *The Audubon Society Field Guide to North American Mushrooms* (1981). This dataset was donated to UCI ML 27 in 1987.
🦆 Data on Minnesota Birds is sourced from [Kaggle](https://www.kaggle.com/hannahcollins/minnesota-birds), scraped from [Wikipedia](https://en.wikipedia.org/wiki/List_of_birds_of_Minnesota) by Hannah Collins.
All these datasets are licensed under [CC0: Creative Commons](https://creativecommons.org/publicdomain/zero/1.0/).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,123 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "c368f8f2506fe56bca0f7be05c4eb71d",
"translation_date": "2025-08-31T11:00:47+00:00",
"source_file": "4-Data-Science-Lifecycle/14-Introduction/README.md",
"language_code": "en"
}
-->
# Introduction to the Data Science Lifecycle
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/14-DataScience-Lifecycle.png)|
|:---:|
| Introduction to the Data Science Lifecycle - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/26)
By now, youve likely realized that data science is a process. This process can be divided into five stages:
- Capturing
- Processing
- Analysis
- Communication
- Maintenance
This lesson focuses on three parts of the lifecycle: capturing, processing, and maintenance.
![Diagram of the data science lifecycle](../../../../4-Data-Science-Lifecycle/14-Introduction/images/data-science-lifecycle.jpg)
> Image by [Berkeley School of Information](https://ischoolonline.berkeley.edu/data-science/what-is-data-science/)
## Capturing
The first stage of the lifecycle is crucial because the subsequent stages depend on it. It essentially combines two steps: acquiring the data and defining the purpose and problems to be addressed.
Defining the projects goals requires a deeper understanding of the problem or question. First, we need to identify and engage with those whose problem needs solving. These could be stakeholders in a business or project sponsors who can clarify who or what will benefit from the project, as well as what they need and why. A well-defined goal should be measurable and quantifiable to determine an acceptable outcome.
Questions a data scientist might ask:
- Has this problem been tackled before? What was discovered?
- Do all involved parties understand the purpose and goal?
- Is there any ambiguity, and how can it be reduced?
- What are the constraints?
- What might the end result look like?
- What resources (time, people, computational) are available?
Next, the focus shifts to identifying, collecting, and exploring the data needed to achieve these defined goals. During the acquisition step, data scientists must also evaluate the quantity and quality of the data. This involves some data exploration to ensure that the acquired data will support achieving the desired outcome.
Questions a data scientist might ask about the data:
- What data is already available to me?
- Who owns this data?
- What are the privacy concerns?
- Do I have enough data to solve this problem?
- Is the data of sufficient quality for this problem?
- If additional insights are uncovered through this data, should we consider revising or redefining the goals?
## Processing
The processing stage of the lifecycle focuses on uncovering patterns in the data and building models. Some techniques used in this stage involve statistical methods to identify patterns. For large datasets, this task would be too time-consuming for a human, so computers are used to speed up the process. This stage is also where data science and machine learning intersect. As you learned in the first lesson, machine learning involves building models to understand the data. Models represent the relationships between variables in the data and help predict outcomes.
Common techniques used in this stage are covered in the ML for Beginners curriculum. Follow the links to learn more about them:
- [Classification](https://github.com/microsoft/ML-For-Beginners/tree/main/4-Classification): Organizing data into categories for more efficient use.
- [Clustering](https://github.com/microsoft/ML-For-Beginners/tree/main/5-Clustering): Grouping data into similar clusters.
- [Regression](https://github.com/microsoft/ML-For-Beginners/tree/main/2-Regression): Determining relationships between variables to predict or forecast values.
## Maintaining
In the lifecycle diagram, you may have noticed that maintenance sits between capturing and processing. Maintenance is an ongoing process of managing, storing, and securing the data throughout the project and should be considered throughout the projects duration.
### Storing Data
Decisions about how and where data is stored can impact storage costs and the performance of data access. These decisions are unlikely to be made by a data scientist alone, but they may influence how the data scientist works with the data based on its storage method.
Here are some aspects of modern data storage systems that can affect these decisions:
**On-premise vs. off-premise vs. public or private cloud**
On-premise refers to hosting and managing data on your own equipment, such as owning a server with hard drives to store the data. Off-premise relies on equipment you dont own, such as a data center. The public cloud is a popular choice for storing data, requiring no knowledge of how or where the data is stored. Public refers to a shared underlying infrastructure used by all cloud users. Some organizations have strict security policies requiring complete control over the equipment where the data is hosted, so they use a private cloud that provides dedicated cloud services. Youll learn more about data in the cloud in [later lessons](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/5-Data-Science-In-Cloud).
**Cold vs. hot data**
When training models, you may need more training data. Once your model is finalized, additional data will still arrive for the model to fulfill its purpose. In either case, the cost of storing and accessing data increases as more data accumulates. Separating rarely used data (cold data) from frequently accessed data (hot data) can be a more cost-effective storage solution using hardware or software services. Accessing cold data may take longer compared to hot data.
### Managing Data
As you work with data, you may find that some of it needs cleaning using techniques covered in the lesson on [data preparation](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/2-Working-With-Data/08-data-preparation) to build accurate models. When new data arrives, it will require similar processing to maintain quality consistency. Some projects use automated tools for cleansing, aggregation, and compression before moving the data to its final location. Azure Data Factory is an example of such a tool.
### Securing the Data
One of the main goals of securing data is ensuring that those working with it have control over what is collected and how it is used. Keeping data secure involves limiting access to only those who need it, complying with local laws and regulations, and maintaining ethical standards, as discussed in the [ethics lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/1-Introduction/02-ethics).
Here are some steps a team might take to ensure security:
- Ensure all data is encrypted.
- Provide customers with information about how their data is used.
- Remove data access for individuals who have left the project.
- Restrict data modification to specific project members.
## 🚀 Challenge
There are many versions of the Data Science Lifecycle, with different names and numbers of stages, but they all include the processes discussed in this lesson.
Explore the [Team Data Science Process lifecycle](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle) and the [Cross-industry standard process for data mining](https://www.datascience-pm.com/crisp-dm-2/). Identify three similarities and differences between the two.
|Team Data Science Process (TDSP)|Cross-industry standard process for data mining (CRISP-DM)|
|--|--|
|![Team Data Science Lifecycle](../../../../4-Data-Science-Lifecycle/14-Introduction/images/tdsp-lifecycle2.png) | ![Data Science Process Alliance Image](../../../../4-Data-Science-Lifecycle/14-Introduction/images/CRISP-DM.png) |
| Image by [Microsoft](https://docs.microsoft.comazure/architecture/data-science-process/lifecycle) | Image by [Data Science Process Alliance](https://www.datascience-pm.com/crisp-dm-2/) |
## [Post-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/27)
## Review & Self Study
Applying the Data Science Lifecycle involves multiple roles and tasks, with some focusing on specific parts of each stage. The Team Data Science Process provides resources explaining the roles and tasks involved in a project.
* [Team Data Science Process roles and tasks](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/roles-tasks)
* [Execute data science tasks: exploration, modeling, and deployment](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/execute-data-science-tasks)
## Assignment
[Assessing a Dataset](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,37 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "564445c39ad29a491abcb9356fc4d47d",
"translation_date": "2025-08-31T11:01:12+00:00",
"source_file": "4-Data-Science-Lifecycle/14-Introduction/assignment.md",
"language_code": "en"
}
-->
# Assessing a Dataset
A client has reached out to your team for assistance in analyzing the seasonal spending habits of taxi customers in New York City.
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Your team is currently in the [Capturing](Readme.md#Capturing) phase of the Data Science Lifecycle, and you are responsible for managing the dataset. You have been provided with a notebook and [data](../../../../data/taxi.csv) to examine.
In this directory, there is a [notebook](../../../../4-Data-Science-Lifecycle/14-Introduction/notebook.ipynb) that uses Python to load yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets).
You can also open the taxi data file using a text editor or spreadsheet software like Excel.
## Instructions
- Evaluate whether the data in this dataset is sufficient to answer the question.
- Explore the [NYC Open Data catalog](https://data.cityofnewyork.us/browse?sortBy=most_accessed&utf8=%E2%9C%93). Identify an additional dataset that might be useful in addressing the client's question.
- Formulate 3 questions to ask the client for further clarification and a deeper understanding of the problem.
Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more details about the data.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,61 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "d92f57eb110dc7f765c05cbf0f837c77",
"translation_date": "2025-08-31T11:00:04+00:00",
"source_file": "4-Data-Science-Lifecycle/15-analyzing/README.md",
"language_code": "en"
}
-->
# The Data Science Lifecycle: Analyzing
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/15-Analyzing.png)|
|:---:|
| Data Science Lifecycle: Analyzing - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## Pre-Lecture Quiz
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/28)
The "Analyzing" phase in the data lifecycle ensures that the data can address the questions posed or solve a specific problem. This step may also involve verifying that a model is effectively tackling these questions and issues. This lesson focuses on Exploratory Data Analysis (EDA), which includes techniques for identifying features and relationships within the data, as well as preparing the data for modeling.
We'll use an example dataset from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv/version/1) to demonstrate how this can be done using Python and the Pandas library. This dataset contains counts of common words found in emails, with the sources of these emails anonymized. Use the [notebook](../../../../4-Data-Science-Lifecycle/15-analyzing/notebook.ipynb) in this directory to follow along.
## Exploratory Data Analysis
The "Capture" phase of the lifecycle involves acquiring data and defining the problems and questions at hand. But how can we confirm that the data will support the desired outcomes?
A data scientist might ask the following questions when working with acquired data:
- Do I have enough data to solve this problem?
- Is the data of sufficient quality for this problem?
- If new insights emerge from the data, should we consider revising or redefining the goals?
Exploratory Data Analysis is the process of familiarizing yourself with the data and can help answer these questions, as well as identify challenges associated with the dataset. Lets explore some techniques used to achieve this.
## Data Profiling, Descriptive Statistics, and Pandas
How can we determine if we have enough data to solve the problem? Data profiling provides a summary and general overview of the dataset using descriptive statistics techniques. Data profiling helps us understand what is available, while descriptive statistics help us understand how much is available.
In previous lessons, we used Pandas to generate descriptive statistics with the [`describe()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html). This function provides the count, maximum and minimum values, mean, standard deviation, and quantiles for numerical data. Using descriptive statistics like `describe()` can help you evaluate whether you have sufficient data or need more.
## Sampling and Querying
Exploring every detail in a large dataset can be time-consuming and is often left to computers. However, sampling is a useful technique for gaining a better understanding of the data and what it represents. By working with a sample, you can apply probability and statistics to draw general conclusions about the dataset. While theres no strict rule for how much data to sample, its worth noting that larger samples lead to more accurate generalizations about the data.
Pandas includes the [`sample()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html), which allows you to specify the number of random samples you want to extract and use.
General querying of the data can help answer specific questions or test theories you may have. Unlike sampling, queries allow you to focus on particular parts of the dataset that are relevant to your questions. The [`query()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library lets you select columns and retrieve rows to answer specific questions about the data.
## Exploring with Visualizations
You dont need to wait until the data is fully cleaned and analyzed to start creating visualizations. In fact, visualizations can be helpful during exploration, as they can reveal patterns, relationships, and issues within the data. Additionally, visualizations provide a way to communicate findings to people who arent directly involved in managing the data, offering an opportunity to share and refine questions that may not have been addressed during the "Capture" phase. Refer to the [section on Visualizations](../../../../../../../../../3-Data-Visualization) to learn more about popular methods for visual exploration.
## Exploring to Identify Inconsistencies
The techniques covered in this lesson can help identify missing or inconsistent values, but Pandas also offers specific functions for detecting these issues. [isna() or isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) can be used to check for missing values. An important part of exploring these values is understanding why they are missing in the first place. This insight can guide you in deciding what [actions to take to resolve them](../../../../../../../../../2-Working-With-Data/08-data-preparation/notebook.ipynb).
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/27)
## Assignment
[Exploring for answers](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,36 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "fcc7547171f4530f159676dd73ed772e",
"translation_date": "2025-08-31T11:00:18+00:00",
"source_file": "4-Data-Science-Lifecycle/15-analyzing/assignment.md",
"language_code": "en"
}
-->
# Exploring for answers
This is a continuation of the previous lesson's [assignment](../14-Introduction/assignment.md), where we briefly examined the dataset. Now, we will dive deeper into the data.
Once again, the question the client wants answered is: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Your team is currently in the [Analyzing](README.md) stage of the Data Science Lifecycle, where you are tasked with performing exploratory data analysis (EDA) on the dataset. You have been provided with a notebook and a dataset containing 200 taxi transactions from January and July 2019.
## Instructions
In this directory, you will find a [notebook](../../../../4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb) and data from the [Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets). For more details about the data, refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf).
Use some of the techniques covered in this lesson to conduct your own EDA in the notebook (feel free to add cells if needed) and answer the following questions:
- What other factors in the data might influence the tip amount?
- Which columns are likely unnecessary for answering the client's question?
- Based on the data provided so far, does it appear to show any evidence of seasonal tipping patterns?
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,221 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1ac43023e78bfe76481a32c878ace516",
"translation_date": "2025-08-31T11:01:18+00:00",
"source_file": "4-Data-Science-Lifecycle/16-communication/README.md",
"language_code": "en"
}
-->
# The Data Science Lifecycle: Communication
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev)](../../sketchnotes/16-Communicating.png)|
|:---:|
| Data Science Lifecycle: Communication - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/30)
Test your knowledge of the upcoming content with the Pre-Lecture Quiz above!
# Introduction
### What is Communication?
Lets begin this lesson by defining communication. **To communicate is to convey or exchange information.** Information can include ideas, thoughts, feelings, messages, subtle signals, data—anything that a **_sender_** (someone sharing information) wants a **_receiver_** (someone receiving information) to understand. In this lesson, well refer to senders as communicators and receivers as the audience.
### Data Communication & Storytelling
When communicating, the goal is to convey or exchange information. However, when communicating data, your goal shouldnt just be to pass along numbers. Instead, you should aim to tell a story informed by your data—effective data communication and storytelling go hand-in-hand. Your audience is more likely to remember a story you tell than a number you share. Later in this lesson, well explore ways to use storytelling to communicate your data more effectively.
### Types of Communication
This lesson will cover two types of communication: One-Way Communication and Two-Way Communication.
**One-way communication** occurs when a sender shares information with a receiver without expecting feedback or a response. Examples of one-way communication include mass emails, news broadcasts, or TV commercials that inform you about a product. In these cases, the senders goal is to deliver information, not to engage in an exchange.
**Two-way communication** happens when all parties act as both senders and receivers. A sender begins by sharing information, and the receiver provides feedback or a response. This is the type of communication we typically think of, such as conversations in person, over the phone, on social media, or via text messages.
When communicating data, you may use one-way communication (e.g., presenting at a conference or to a large group where questions wont be asked immediately) or two-way communication (e.g., persuading stakeholders for buy-in or convincing a teammate to invest time and effort in a new project).
# Effective Communication
### Your Responsibilities as a Communicator
As a communicator, its your responsibility to ensure that your audience takes away the information you want them to understand. When communicating data, you dont just want your audience to remember numbers—you want them to grasp a story informed by your data. A good data communicator is also a good storyteller.
How do you tell a story with data? There are countless ways, but here are six strategies well discuss in this lesson:
1. Understand Your Audience, Your Medium, & Your Communication Method
2. Begin with the End in Mind
3. Approach it Like an Actual Story
4. Use Meaningful Words & Phrases
5. Use Emotion
Each of these strategies is explained in detail below.
### 1. Understand Your Audience, Your Channel & Your Communication Method
The way you communicate with family members is likely different from how you communicate with friends. You probably use different words and phrases tailored to the people youre speaking to. The same principle applies when communicating data. Consider who your audience is, their goals, and the context they have about the situation youre explaining.
You can often categorize your audience. In a _Harvard Business Review_ article, “[How to Tell a Story with Data](http://blogs.hbr.org/2013/04/how-to-tell-a-story-with-data/),” Dell Executive Strategist Jim Stikeleather identifies five audience categories:
- **Novice**: First exposure to the subject but doesnt want oversimplification.
- **Generalist**: Aware of the topic but looking for an overview and major themes.
- **Managerial**: Seeks an in-depth, actionable understanding of intricacies and interrelationships, with access to details.
- **Expert**: Prefers exploration and discovery over storytelling, with a focus on great detail.
- **Executive**: Has limited time and wants to understand the significance and conclusions of weighted probabilities.
These categories can guide how you present data to your audience.
Additionally, consider the channel youre using to communicate. Your approach will differ if youre writing a memo or email versus presenting at a meeting or conference.
Understanding whether youll use one-way or two-way communication is also critical. For example:
- If your audience is mostly Novices and youre using one-way communication, youll need to educate them and provide context before presenting your data and explaining its significance. Clarity is key since they cant ask direct questions.
- If your audience is mostly Managerial and youre using two-way communication, you can likely skip the context and dive into the data and its implications. However, youll need to manage timing and keep the discussion on track, as questions may arise that could derail your story.
### 2. Begin With The End In Mind
Starting with the end in mind means knowing your intended takeaways for the audience before you begin communicating. Being clear about what you want your audience to learn helps you craft a coherent story. This approach works for both one-way and two-way communication.
How do you start with the end in mind? Before communicating your data, write down your key takeaways. As you prepare your story, continually ask yourself, “How does this fit into the story Im telling?”
**Caution**: While starting with the end in mind is ideal, avoid cherry-picking data—only sharing data that supports your point while ignoring other data. If some of your data contradicts your takeaways, share it honestly and explain why youre sticking with your conclusions despite the conflicting data.
### 3. Approach it Like an Actual Story
Traditional stories often follow five phases: Exposition, Rising Action, Climax, Falling Action, and Denouement. Or, more simply: Context, Conflict, Climax, Closure, and Conclusion. You can use a similar structure when communicating data.
- **Context**: Set the stage and ensure everyone is on the same page.
- **Conflict**: Explain why you collected the data and the problem youre addressing.
- **Climax**: Present the data, its meaning, and the solutions it suggests.
- **Closure**: Reiterate the problem and proposed solutions.
- **Conclusion**: Summarize key takeaways and recommend next steps.
### 4. Use Meaningful Words & Phrases
If I told you, “Our users take a long time to onboard onto our platform,” how long would you think “a long time” is? An hour? A week? Its unclear. Now imagine I said this to an audience—each person might interpret “a long time” differently.
Instead, what if I said, “Our users take, on average, 3 minutes to sign up and onboard onto our platform”? Thats much clearer.
When communicating data, dont assume your audience thinks like you. Clarity is your responsibility. If your data or story isnt clear, your audience may struggle to follow and miss your key takeaways.
Use specific, meaningful words and phrases instead of vague ones. For example:
- “We had an *impressive* year!” (What does “impressive” mean? A 2% increase? A 50% increase?)
- “Our users success rates increased *dramatically*.” (How much is “dramatic”?)
- “This project will require *significant* effort.” (What does “significant” mean?)
While vague words can be useful for introductions or summaries, ensure the rest of your presentation is precise and clear.
### 5. Use Emotion
Emotion is a powerful tool in storytelling, especially when communicating data. It helps your audience empathize, makes them more likely to take action, and increases the chances theyll remember your message.
Youve likely seen this in TV commercials. Some use somber tones to evoke sadness and emphasize their message, while others are upbeat and associate their data with happiness.
Here are a few ways to use emotion when communicating data:
- **Testimonials and Personal Stories**: Collect both quantitative and qualitative data. If your data is mostly quantitative, gather personal stories to add depth and context.
- **Imagery**: Use images to help your audience visualize the situation and feel the emotion you want to convey.
- **Color**: Colors evoke different emotions. For example:
- Blue: Peace and trust
- Green: Nature and environment
- Red: Passion and excitement
- Yellow: Optimism and happiness
Be mindful that colors can have different meanings in different cultures.
# Communication Case Study
Emerson is a Product Manager for a mobile app. Emerson notices that customers submit 42% more complaints and bug reports on weekends. Additionally, customers who dont receive a response to their complaints within 48 hours are 32% more likely to rate the app 1 or 2 stars in the app store.
After researching, Emerson identifies two solutions to address the issue. Emerson schedules a 30-minute meeting with the three company leads to present the data and proposed solutions.
The goal of the meeting is to help the company leads understand that the following two solutions can improve the apps rating, which could lead to higher revenue:
**Solution 1.** Hire customer service reps to work on weekends.
**Solution 2.** Purchase a new customer service ticketing system that helps reps prioritize complaints based on how long theyve been in the queue.
In the meeting, Emerson spends 5 minutes explaining why having a low rating on the app store is problematic, 10 minutes discussing the research process and how trends were identified, 10 minutes reviewing recent customer complaints, and the final 5 minutes briefly covering two potential solutions.
Was this an effective way for Emerson to communicate during this meeting?
During the meeting, one company lead became fixated on the 10 minutes of customer complaints Emerson presented. After the meeting, these complaints were the only thing this team lead remembered. Another company lead primarily focused on Emersons explanation of the research process. The third company lead did recall the solutions Emerson proposed but wasnt sure how those solutions could be implemented.
In the situation above, its clear there was a significant gap between what Emerson intended for the team leads to take away and what they actually took away from the meeting. Below is an alternative approach Emerson could consider.
How could Emerson improve this approach?
Context, Conflict, Climax, Closure, Conclusion
**Context** - Emerson could spend the first 5 minutes introducing the overall situation and ensuring the team leads understand how the problems impact key company metrics, such as revenue.
It could be framed like this: "Currently, our app's rating in the app store is 2.5. App store ratings are crucial for App Store Optimization, which affects how many users discover our app in search results and how potential users perceive it. Naturally, the number of users we attract is directly tied to revenue."
**Conflict** Emerson could then dedicate the next 5 minutes to discussing the conflict.
It could be presented like this: “Users submit 42% more complaints and bug reports on weekends. Customers who submit a complaint that remains unanswered for over 48 hours are 32% less likely to rate our app above a 2 in the app store. Improving our app's rating to a 4 would boost visibility by 20-30%, which I estimate could increase revenue by 10%." Emerson should be ready to back up these figures with evidence.
**Climax** After establishing the context and conflict, Emerson could move to the climax for about 5 minutes.
Here, Emerson could introduce the proposed solutions, explain how they address the outlined issues, detail how they could be integrated into current workflows, provide cost estimates, discuss the ROI, and perhaps even share screenshots or wireframes illustrating how the solutions would look in practice. Emerson could also include testimonials from users whose complaints took over 48 hours to resolve, as well as feedback from a current customer service representative about the existing ticketing system.
**Closure** Emerson could then spend 5 minutes summarizing the companys challenges, revisiting the proposed solutions, and reinforcing why these solutions are the right choice.
**Conclusion** Since this is a meeting with a few stakeholders involving two-way communication, Emerson could allocate 10 minutes for questions to ensure any confusion among the team leads is addressed before the meeting concludes.
If Emerson adopted approach #2, its far more likely the team leads would leave the meeting with the intended takeaways: that the handling of complaints and bugs needs improvement, and there are two actionable solutions to achieve that improvement. This approach would be a much more effective way to communicate the data and the narrative Emerson wants to convey.
---
# Conclusion
### Summary of main points
- Communication is the act of conveying or exchanging information.
- When communicating data, the goal isnt just to share numbers—its to tell a story informed by the data.
- There are two types of communication: One-Way Communication (information is shared without expecting a response) and Two-Way Communication (information is exchanged interactively).
- There are various strategies for telling a story with data. The five strategies discussed are:
- Understand Your Audience, Your Medium, & Your Communication Method
- Begin with the End in Mind
- Approach it Like an Actual Story
- Use Meaningful Words & Phrases
- Use Emotion
### Recommended Resources for Self Study
[The Five C's of Storytelling - Articulate Persuasion](http://articulatepersuasion.com/the-five-cs-of-storytelling/)
[1.4 Your Responsibilities as a Communicator Business Communication for Success (umn.edu)](https://open.lib.umn.edu/businesscommunication/chapter/1-4-your-responsibilities-as-a-communicator/)
[How to Tell a Story with Data (hbr.org)](https://hbr.org/2013/04/how-to-tell-a-story-with-data)
[Two-Way Communication: 4 Tips for a More Engaged Workplace (yourthoughtpartner.com)](https://www.yourthoughtpartner.com/blog/bid/59576/4-steps-to-increase-employee-engagement-through-two-way-communication)
[6 succinct steps to great data storytelling - BarnRaisers, LLC (barnraisersllc.com)](https://barnraisersllc.com/2021/05/02/6-succinct-steps-to-great-data-storytelling/)
[How to Tell a Story With Data | Lucidchart Blog](https://www.lucidchart.com/blog/how-to-tell-a-story-with-data)
[6 Cs of Effective Storytelling on Social Media | Cooler Insights](https://coolerinsights.com/2018/06/effective-storytelling-social-media/)
[The Importance of Emotions In Presentations | Ethos3 - A Presentation Training and Design Agency](https://ethos3.com/2015/02/the-importance-of-emotions-in-presentations/)
[Data storytelling: linking emotions and rational decisions (toucantoco.com)](https://www.toucantoco.com/en/blog/data-storytelling-dataviz)
[Emotional Advertising: How Brands Use Feelings to Get People to Buy (hubspot.com)](https://blog.hubspot.com/marketing/emotions-in-advertising-examples)
[Choosing Colors for Your Presentation Slides | Think Outside The Slide](https://www.thinkoutsidetheslide.com/choosing-colors-for-your-presentation-slides/)
[How To Present Data [10 Expert Tips] | ObservePoint](https://resources.observepoint.com/blog/10-tips-for-presenting-data)
[Microsoft Word - Persuasive Instructions.doc (tpsnva.org)](https://www.tpsnva.org/teach/lq/016/persinstr.pdf)
[The Power of Story for Your Data (thinkhdi.com)](https://www.thinkhdi.com/library/supportworld/2019/power-story-your-data.aspx)
[Common Mistakes in Data Presentation (perceptualedge.com)](https://www.perceptualedge.com/articles/ie/data_presentation.pdf)
[Infographic: Here are 15 Common Data Fallacies to Avoid (visualcapitalist.com)](https://www.visualcapitalist.com/here-are-15-common-data-fallacies-to-avoid/)
[Cherry Picking: When People Ignore Evidence that They Dislike Effectiviology](https://effectiviology.com/cherry-picking/#How_to_avoid_cherry_picking)
[Tell Stories with Data: Communication in Data Science | by Sonali Verghese | Towards Data Science](https://towardsdatascience.com/tell-stories-with-data-communication-in-data-science-5266f7671d7)
[1. Communicating Data - Communicating Data with Tableau [Book] (oreilly.com)](https://www.oreilly.com/library/view/communicating-data-with/9781449372019/ch01.html)
---
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/31)
Review what you've just learned with the Post-Lecture Quiz above!
---
## Assignment
[Market Research](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,26 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "8980d7efd101c82d6d6ffc3458214120",
"translation_date": "2025-08-31T11:02:00+00:00",
"source_file": "4-Data-Science-Lifecycle/16-communication/assignment.md",
"language_code": "en"
}
-->
# Tell a story
## Instructions
Data Science is all about storytelling. Choose any dataset and write a brief paper about the story it can tell. What insights do you hope to uncover from your dataset? How will you handle it if the findings are challenging or unexpected? What if the data doesn't easily reveal its patterns? Consider the different scenarios your dataset might present and document them.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
A one-page essay is submitted in .doc format, with the dataset clearly explained, properly documented, credited, and a well-structured story is developed using detailed examples from the data.| A shorter essay is submitted with less detail | The essay is missing one or more of the required elements.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,30 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "dd173fd30fc039a7a299898920680723",
"translation_date": "2025-08-31T10:59:58+00:00",
"source_file": "4-Data-Science-Lifecycle/README.md",
"language_code": "en"
}
-->
# The Data Science Lifecycle
![communication](../../../4-Data-Science-Lifecycle/images/communication.jpg)
> Photo by <a href="https://unsplash.com/@headwayio?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Headway</a> on <a href="https://unsplash.com/s/photos/communication?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
In these lessons, you'll dive into various aspects of the Data Science lifecycle, including data analysis and effective communication.
### Topics
1. [Introduction](14-Introduction/README.md)
2. [Analyzing](15-analyzing/README.md)
3. [Communication](16-communication/README.md)
### Credits
These lessons were created with ❤️ by [Jalen McGee](https://twitter.com/JalenMCG) and [Jasmine Greenaway](https://twitter.com/paladique)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,116 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "408c55cab2880daa4e78616308bd5db7",
"translation_date": "2025-08-31T10:56:02+00:00",
"source_file": "5-Data-Science-In-Cloud/17-Introduction/README.md",
"language_code": "en"
}
-->
# Introduction to Data Science in the Cloud
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/17-DataScience-Cloud.png)|
|:---:|
| Data Science In The Cloud: Introduction - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you will learn the basic principles of the Cloud, understand why using Cloud services can be beneficial for your data science projects, and explore examples of data science projects implemented in the Cloud.
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/32)
## What is the Cloud?
The Cloud, or Cloud Computing, refers to the delivery of a variety of pay-as-you-go computing services hosted on infrastructure over the internet. These services include storage, databases, networking, software, analytics, and intelligent services.
We typically distinguish between Public, Private, and Hybrid clouds as follows:
* Public cloud: A public cloud is owned and operated by a third-party cloud service provider that delivers its computing resources over the Internet to the public.
* Private cloud: Refers to cloud computing resources used exclusively by a single business or organization, with services and infrastructure maintained on a private network.
* Hybrid cloud: A hybrid cloud combines public and private clouds. Users can maintain an on-premises datacenter while running data and applications on one or more public clouds.
Most cloud computing services fall into three main categories: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).
* Infrastructure as a Service (IaaS): Users rent IT infrastructure such as servers, virtual machines (VMs), storage, networks, and operating systems.
* Platform as a Service (PaaS): Users rent an environment for developing, testing, delivering, and managing software applications without worrying about the underlying infrastructure.
* Software as a Service (SaaS): Users access software applications over the Internet, typically on a subscription basis, without managing hosting, infrastructure, or maintenance tasks like updates and security patches.
Some of the largest Cloud providers include Amazon Web Services, Google Cloud Platform, and Microsoft Azure.
## Why Choose the Cloud for Data Science?
Developers and IT professionals choose to work with the Cloud for several reasons, including:
* Innovation: Integrate innovative services provided by Cloud providers directly into your applications.
* Flexibility: Pay only for the services you need, with a wide range of options. You can adapt services as your needs evolve.
* Budget: Avoid upfront investments in hardware and software, and pay only for what you use.
* Scalability: Scale resources up or down based on project needs, allowing applications to adjust computing power, storage, and bandwidth dynamically.
* Productivity: Focus on your business instead of managing datacenters and other infrastructure tasks.
* Reliability: Benefit from continuous data backups and disaster recovery plans to ensure business continuity during crises.
* Security: Leverage policies, technologies, and controls to enhance the security of your projects.
These are some of the most common reasons for using Cloud services. Now that we understand what the Cloud is and its benefits, lets explore how it can help data scientists and developers address challenges such as:
* Storing large amounts of data: Instead of managing large servers, store data in the Cloud using solutions like Azure Cosmos DB, Azure SQL Database, and Azure Data Lake Storage.
* Performing Data Integration: Transition from data collection to actionable insights using Cloud-based data integration services like Data Factory.
* Processing data: Harness the Clouds computing power to process large datasets without needing powerful local machines.
* Using data analytics services: Turn data into actionable insights with services like Azure Synapse Analytics, Azure Stream Analytics, and Azure Databricks.
* Using Machine Learning and data intelligence services: Leverage pre-built machine learning algorithms and cognitive services like speech-to-text, text-to-speech, and computer vision with services like AzureML.
## Examples of Data Science in the Cloud
Lets make this more concrete by exploring a couple of scenarios.
### Real-time social media sentiment analysis
A common beginner project in machine learning is real-time sentiment analysis of social media data.
Imagine you run a news media website and want to use live data to understand what content your readers might be interested in. You could build a program to analyze the sentiment of Twitter posts in real time on topics relevant to your audience.
Key indicators include the volume of tweets on specific topics (hashtags) and sentiment, determined using analytics tools.
Steps to create this project:
* Create an event hub to collect streaming input from Twitter.
* Configure and start a Twitter client application to call the Twitter Streaming APIs.
* Create a Stream Analytics job.
* Specify the job input and query.
* Create an output sink and specify the job output.
* Start the job.
For the full process, refer to the [documentation](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?WT.mc_id=academic-77958-bethanycheum&ocid=AID30411099).
### Scientific papers analysis
Heres another example: a project by [Dmitry Soshnikov](http://soshnikov.com), one of the authors of this curriculum.
Dmitry created a tool to analyze COVID-related scientific papers. This project demonstrates how to extract knowledge from scientific papers, gain insights, and help researchers navigate large collections of papers efficiently.
Steps involved:
* Extract and preprocess information using [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
* Use [Azure ML](https://azure.microsoft.com/services/machine-learning?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) to parallelize processing.
* Store and query information with [Cosmos DB](https://azure.microsoft.com/services/cosmos-db?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
* Create an interactive dashboard for data exploration and visualization using Power BI.
For the full process, visit [Dmitrys blog](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/).
As you can see, Cloud services offer numerous ways to perform Data Science.
## Footnote
Sources:
* https://azure.microsoft.com/overview/what-is-cloud-computing?ocid=AID3041109
* https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?ocid=AID3041109
* https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/
## Post-Lecture Quiz
[Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/33)
## Assignment
[Market Research](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "96f3696153d9ed54b19a1bb65438c104",
"translation_date": "2025-08-31T10:56:22+00:00",
"source_file": "5-Data-Science-In-Cloud/17-Introduction/assignment.md",
"language_code": "en"
}
-->
# Market Research
## Instructions
In this lesson, you learned that there are several major cloud providers. Conduct market research to explore what each one offers to Data Scientists. Are their offerings similar? Write a paper describing the services provided by three or more of these cloud providers.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | --- |
A one-page paper thoroughly describes the data science offerings of three cloud providers and highlights the differences between them. | A shorter paper is provided. | A paper is submitted but lacks a complete analysis.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,348 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "14b2a7f1c63202920bd98eeb913f5614",
"translation_date": "2025-08-31T10:54:46+00:00",
"source_file": "5-Data-Science-In-Cloud/18-Low-Code/README.md",
"language_code": "en"
}
-->
# Data Science in the Cloud: The "Low code/No code" way
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/18-DataScience-Cloud.png)|
|:---:|
| Data Science In The Cloud: Low Code - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Table of contents:
- [Data Science in the Cloud: The "Low code/No code" way](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [Pre-Lecture quiz](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [1. Introduction](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [1.1 What is Azure Machine Learning?](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [1.2 The Heart Failure Prediction Project:](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [1.3 The Heart Failure Dataset:](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [2. Low code/No code training of a model in Azure ML Studio](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [2.1 Create an Azure ML workspace](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [2.2 Compute Resources](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [2.2.1 Choosing the right options for your compute resources](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [2.2.2 Creating a compute cluster](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [2.3 Loading the Dataset](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [2.4 Low code/No Code training with AutoML](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [3. Low code/No Code model deployment and endpoint consumption](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [3.1 Model deployment](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [3.2 Endpoint consumption](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [🚀 Challenge](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [Post-Lecture Quiz](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [Review & Self Study](../../../../5-Data-Science-In-Cloud/18-Low-Code)
- [Assignment](../../../../5-Data-Science-In-Cloud/18-Low-Code)
## [Pre-Lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/34)
## 1. Introduction
### 1.1 What is Azure Machine Learning?
The Azure cloud platform offers over 200 products and services designed to help you create innovative solutions. Data scientists spend a significant amount of time exploring and preparing data, as well as testing various model-training algorithms to achieve accurate results. These tasks can be time-intensive and often lead to inefficient use of costly compute resources.
[Azure ML](https://docs.microsoft.com/azure/machine-learning/overview-what-is-azure-machine-learning?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) is a cloud-based platform for developing and managing machine learning solutions in Azure. It provides a variety of tools and features to help data scientists prepare data, train models, deploy predictive services, and monitor their usage. Most importantly, it enhances efficiency by automating many of the repetitive tasks involved in training models. It also allows the use of scalable cloud-based compute resources, enabling the handling of large datasets while incurring costs only when resources are actively used.
Azure ML offers a comprehensive suite of tools for developers and data scientists, including:
- **Azure Machine Learning Studio**: A web-based portal for low-code and no-code options for model training, deployment, automation, tracking, and asset management. It integrates seamlessly with the Azure Machine Learning SDK.
- **Jupyter Notebooks**: For quickly prototyping and testing machine learning models.
- **Azure Machine Learning Designer**: A drag-and-drop interface for building experiments and deploying pipelines in a low-code environment.
- **Automated machine learning UI (AutoML)**: Automates repetitive tasks in model development, enabling the creation of scalable, efficient, and high-quality machine learning models.
- **Data Labeling**: A tool that assists in automatically labeling data.
- **Machine learning extension for Visual Studio Code**: A full-featured development environment for managing machine learning projects.
- **Machine learning CLI**: Command-line tools for managing Azure ML resources.
- **Integration with open-source frameworks**: Compatibility with PyTorch, TensorFlow, Scikit-learn, and other frameworks for end-to-end machine learning processes.
- **MLflow**: An open-source library for managing the lifecycle of machine learning experiments. **MLflow Tracking** logs and tracks metrics and artifacts from training runs, regardless of the experiment's environment.
### 1.2 The Heart Failure Prediction Project:
Building projects is one of the best ways to test your skills and knowledge. In this lesson, we will explore two approaches to creating a data science project for predicting heart failure in Azure ML Studio: the Low code/No code method and the Azure ML SDK method, as illustrated in the following diagram:
![project-schema](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/project-schema.PNG)
Each approach has its advantages and disadvantages. The Low code/No code method is easier to get started with, as it involves interacting with a graphical user interface (GUI) and requires no prior coding knowledge. This method is ideal for quickly testing a project's feasibility and creating a proof of concept (POC). However, as the project scales and needs to be production-ready, relying solely on the GUI becomes impractical. Automating tasks programmatically, from resource creation to model deployment, becomes essential. This is where the Azure ML SDK comes into play.
| | Low code/No code | Azure ML SDK |
|-------------------|------------------|---------------------------|
| Expertise in code | Not required | Required |
| Time to develop | Fast and easy | Depends on code expertise |
| Production ready | No | Yes |
### 1.3 The Heart Failure Dataset:
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for 31% of all global deaths. Factors such as tobacco use, unhealthy diets, obesity, physical inactivity, and excessive alcohol consumption can serve as features for predictive models. Estimating the likelihood of developing CVDs can be invaluable in preventing heart attacks in high-risk individuals.
Kaggle provides a publicly available [Heart Failure dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data), which we will use for this project. You can download the dataset now. It is a tabular dataset with 13 columns (12 features and 1 target variable) and 299 rows.
| | Variable name | Type | Description | Example |
|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
| 1 | age | numerical | Age of the patient | 25 |
| 2 | anaemia | boolean | Decrease in red blood cells or hemoglobin | 0 or 1 |
| 3 | creatinine_phosphokinase | numerical | Level of CPK enzyme in the blood | 542 |
| 4 | diabetes | boolean | Whether the patient has diabetes | 0 or 1 |
| 5 | ejection_fraction | numerical | Percentage of blood leaving the heart per contraction | 45 |
| 6 | high_blood_pressure | boolean | Whether the patient has hypertension | 0 or 1 |
| 7 | platelets | numerical | Platelet count in the blood | 149000 |
| 8 | serum_creatinine | numerical | Level of serum creatinine in the blood | 0.5 |
| 9 | serum_sodium | numerical | Level of serum sodium in the blood | 137 |
| 10 | sex | boolean | Gender (0 for female, 1 for male) | 0 or 1 |
| 11 | smoking | boolean | Whether the patient smokes | 0 or 1 |
| 12 | time | numerical | Follow-up period (in days) | 4 |
|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
| 13 | DEATH_EVENT [Target] | boolean | Whether the patient died during the follow-up period | 0 or 1 |
Once you have the dataset, we can begin the project in Azure.
## 2. Low code/No code training of a model in Azure ML Studio
### 2.1 Create an Azure ML workspace
To train a model in Azure ML, you first need to create an Azure ML workspace. The workspace is the central resource for Azure Machine Learning, where you can manage all the artifacts created during your machine learning workflows. It keeps a record of all training runs, including logs, metrics, outputs, and snapshots of your scripts. This information helps you identify which training run produced the best model. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-workspace?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)
It is recommended to use the latest version of a browser compatible with your operating system. Supported browsers include:
- Microsoft Edge (latest version, not the legacy version)
- Safari (latest version, Mac only)
- Chrome (latest version)
- Firefox (latest version)
To use Azure Machine Learning, create a workspace in your Azure subscription. This workspace will allow you to manage data, compute resources, code, models, and other artifacts related to your machine learning projects.
> **_NOTE:_** Your Azure subscription will incur a small charge for data storage as long as the Azure Machine Learning workspace exists. It is recommended to delete the workspace when it is no longer needed.
1. Sign in to the [Azure portal](https://ms.portal.azure.com/) using the Microsoft credentials associated with your Azure subscription.
2. Select **Create a resource**.
![workspace-1](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/workspace-1.PNG)
Search for Machine Learning and select the Machine Learning tile.
![workspace-2](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/workspace-2.PNG)
Click the create button.
![workspace-3](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/workspace-3.PNG)
Fill in the settings as follows:
- Subscription: Your Azure subscription
- Resource group: Create or select a resource group
- Workspace name: Enter a unique name for your workspace
- Region: Select the geographical region closest to you
- Storage account: Note the default new storage account that will be created for your workspace
- Key vault: Note the default new key vault that will be created for your workspace
- Application insights: Note the default new application insights resource that will be created for your workspace
- Container registry: None (one will be created automatically the first time you deploy a model to a container)
![workspace-4](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/workspace-4.PNG)
- Click the create + review button, then click the create button.
3. Wait for your workspace to be created (this may take a few minutes). Once created, navigate to it in the portal. You can find it under the Machine Learning Azure service.
4. On the Overview page for your workspace, launch Azure Machine Learning Studio (or open a new browser tab and go to https://ml.azure.com). Sign in to Azure Machine Learning Studio using your Microsoft account. If prompted, select your Azure directory, subscription, and Azure Machine Learning workspace.
![workspace-5](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/workspace-5.PNG)
5. In Azure Machine Learning Studio, click the ☰ icon at the top left to explore the various pages in the interface. These pages allow you to manage the resources in your workspace.
![workspace-6](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/workspace-6.PNG)
You can manage your workspace through the Azure portal, but Azure Machine Learning Studio provides a more user-friendly interface tailored for data scientists and machine learning engineers.
### 2.2 Compute Resources
Compute Resources are cloud-based resources used for running model training and data exploration processes. There are four types of compute resources you can create:
- **Compute Instances**: Development workstations for data scientists to work with data and models. This involves creating a Virtual Machine (VM) and launching a notebook instance. Models can then be trained by calling a compute cluster from the notebook.
- **Compute Clusters**: Scalable clusters of VMs for on-demand processing of experiment code. These are essential for training models and can include specialized GPU or CPU resources.
- **Inference Clusters**: Deployment targets for predictive services using trained models.
- **Attached Compute**: Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters.
#### 2.2.1 Choosing the right options for your compute resources
When creating a compute resource, there are several important factors to consider, as these choices can significantly impact your project.
**Do you need CPU or GPU?**
A CPU (Central Processing Unit) is the main electronic circuitry that executes instructions in a computer program. A GPU (Graphics Processing Unit) is a specialized electronic circuit designed to process graphics-related tasks at high speeds.
The key difference between CPU and GPU architecture is that CPUs are optimized for handling a wide range of tasks quickly (measured by clock speed) but are limited in the number of tasks they can run concurrently. GPUs, on the other hand, excel at parallel computing, making them ideal for deep learning tasks.
| CPU | GPU |
|-----------------------------------------|-----------------------------|
| Less expensive | More expensive |
| Lower level of concurrency | Higher level of concurrency |
| Slower in training deep learning models | Optimal for deep learning |
**Cluster Size**
Larger clusters are more expensive but offer better responsiveness. If you have more time but a limited budget, start with a small cluster. If you have a larger budget but limited time, opt for a larger cluster.
**VM Size**
You can adjust the size of your RAM, disk, number of cores, and clock speed based on your time and budget constraints. Increasing these parameters will improve performance but also increase costs.
**Dedicated or Low-Priority Instances?**
Low-priority instances are interruptible, meaning Microsoft Azure can reassign these resources to other tasks, potentially interrupting your job. Dedicated instances, which are non-interruptible, ensure your job won't be terminated without your permission. This decision also comes down to time versus cost, as interruptible instances are cheaper than dedicated ones.
#### 2.2.2 Creating a compute cluster
In the [Azure ML workspace](https://ml.azure.com/) created earlier, navigate to the "Compute" section to view the various compute resources discussed (e.g., compute instances, compute clusters, inference clusters, and attached compute). For this project, you'll need a compute cluster for model training. In the Studio, click on the "Compute" menu, then the "Compute cluster" tab, and click the "+ New" button to create a compute cluster.
![22](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/cluster-1.PNG)
1. Select your options: Dedicated vs Low priority, CPU or GPU, VM size, and core number (default settings can be used for this project).
2. Click the "Next" button.
![23](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/cluster-2.PNG)
3. Assign a name to the cluster.
4. Configure options such as the minimum/maximum number of nodes, idle seconds before scale-down, and SSH access. Note that setting the minimum number of nodes to 0 saves money when the cluster is idle. A higher maximum number of nodes shortens training time, with 3 nodes being the recommended maximum.
5. Click the "Create" button. This process may take a few minutes.
![29](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/cluster-3.PNG)
Great! Now that the compute cluster is ready, the next step is to load the data into Azure ML Studio.
### 2.3 Loading the Dataset
1. In the [Azure ML workspace](https://ml.azure.com/) created earlier, click on "Datasets" in the left menu and then "+ Create dataset" to create a new dataset. Select the "From local files" option and upload the Kaggle dataset downloaded earlier.
![24](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/dataset-1.PNG)
2. Provide a name, type, and description for your dataset. Click "Next." Upload the data files and click "Next."
![25](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/dataset-2.PNG)
3. In the Schema section, change the data type to Boolean for the following features: anaemia, diabetes, high blood pressure, sex, smoking, and DEATH_EVENT. Click "Next" and then "Create."
![26](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/dataset-3.PNG)
Fantastic! With the dataset uploaded and the compute cluster created, you're ready to start training the model.
### 2.4 Low code/No Code training with AutoML
Developing traditional machine learning models is resource-intensive, requiring significant domain expertise and time to compare multiple models. Automated machine learning (AutoML) simplifies this process by automating repetitive tasks in model development. AutoML enables data scientists, analysts, and developers to efficiently build high-quality ML models, reducing the time needed to create production-ready models. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
1. In the [Azure ML workspace](https://ml.azure.com/) created earlier, click on "Automated ML" in the left menu and select the dataset you uploaded. Click "Next."
![27](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/aml-1.PNG)
2. Enter a new experiment name, specify the target column (DEATH_EVENT), and select the compute cluster created earlier. Click "Next."
![28](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/aml-2.PNG)
3. Choose "Classification" and click "Finish." This process may take 30 minutes to 1 hour, depending on your compute cluster size.
![30](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/aml-3.PNG)
4. Once the run is complete, go to the "Automated ML" tab, select your run, and click on the algorithm listed in the "Best model summary" card.
![31](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/aml-4.PNG)
Here, you'll find detailed information about the best model generated by AutoML. You can also explore other models in the "Models" tab. Spend some time reviewing the explanations (preview button) for the models. Once you've selected the model to use (in this case, the best model chosen by AutoML), you'll proceed to deploy it.
## 3. Low code/No Code model deployment and endpoint consumption
### 3.1 Model deployment
The AutoML interface allows you to deploy the best model as a web service in just a few steps. Deployment integrates the model so it can make predictions based on new data, enabling applications to identify opportunities. For this project, deploying the model as a web service allows medical applications to make live predictions about patients' heart attack risks.
In the best model description, click the "Deploy" button.
![deploy-1](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/deploy-1.PNG)
15. Provide a name, description, compute type (Azure Container Instance), enable authentication, and click "Deploy." This process may take about 20 minutes. Deployment involves registering the model, generating resources, and configuring them for the web service. A status message will appear under "Deploy status." Refresh periodically to check the status. The deployment is complete and running when the status is "Healthy."
![deploy-2](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/deploy-2.PNG)
16. Once deployed, go to the "Endpoint" tab and select the endpoint you just deployed. Here, you'll find all the details about the endpoint.
![deploy-3](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/deploy-3.PNG)
Excellent! With the model deployed, you can now consume the endpoint.
### 3.2 Endpoint consumption
Click on the "Consume" tab. Here, you'll find the REST endpoint and a Python script for consumption. Take some time to review the Python code.
This script can be run directly from your local machine to consume the endpoint.
![35](../../../../5-Data-Science-In-Cloud/18-Low-Code/images/consumption-1.PNG)
Pay attention to these two lines of code:
```python
url = 'http://98e3715f-xxxx-xxxx-xxxx-9ec22d57b796.centralus.azurecontainer.io/score'
api_key = '' # Replace this with the API key for the web service
```
The `url` variable contains the REST endpoint from the "Consume" tab, and the `api_key` variable contains the primary key (if authentication is enabled). These elements allow the script to consume the endpoint.
18. Running the script should produce the following output:
```python
b'"{\\"result\\": [true]}"'
```
This indicates that the prediction for heart failure based on the provided data is true. This makes sense because the default data in the script has all values set to 0 or false. You can modify the data using the following sample input:
```python
data = {
"data":
[
{
'age': "0",
'anaemia': "false",
'creatinine_phosphokinase': "0",
'diabetes': "false",
'ejection_fraction': "0",
'high_blood_pressure': "false",
'platelets': "0",
'serum_creatinine': "0",
'serum_sodium': "0",
'sex': "false",
'smoking': "false",
'time': "0",
},
{
'age': "60",
'anaemia': "false",
'creatinine_phosphokinase': "500",
'diabetes': "false",
'ejection_fraction': "38",
'high_blood_pressure': "false",
'platelets': "260000",
'serum_creatinine': "1.40",
'serum_sodium': "137",
'sex': "false",
'smoking': "false",
'time': "130",
},
],
}
```
The script should return:
```python
b'"{\\"result\\": [true, false]}"'
```
Congratulations! You've successfully trained, deployed, and consumed a model on Azure ML!
> **_NOTE:_** Once you've completed the project, remember to delete all resources.
## 🚀 Challenge
Examine the model explanations and details generated by AutoML for the top models. Try to understand why the best model outperformed the others. What algorithms were compared? What distinguishes them? Why does the best model perform better in this scenario?
## [Post-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/35)
## Review & Self Study
In this lesson, you learned how to train, deploy, and consume a model to predict heart failure risk using a Low code/No code approach in the cloud. If you haven't already, explore the model explanations generated by AutoML for the top models and understand why the best model is superior.
For further exploration of Low code/No code AutoML, refer to this [documentation](https://docs.microsoft.com/azure/machine-learning/tutorial-first-experiment-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
## Assignment
[Low code/No code Data Science project on Azure ML](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "8fdc4a5fd9bc27a8d2ebef995dfbf73f",
"translation_date": "2025-08-31T10:55:35+00:00",
"source_file": "5-Data-Science-In-Cloud/18-Low-Code/assignment.md",
"language_code": "en"
}
-->
# Low code/No code Data Science project on Azure ML
## Instructions
We explored how to use the Azure ML platform to train, deploy, and consume a model in a Low code/No code manner. Now, find some data that you can use to train another model, deploy it, and consume it. You can search for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
## Rubric
| Exemplary | Adequate | Needs Improvement |
|-----------|----------|-------------------|
|When uploading the data, you ensured to change the feature's type if necessary. You also cleaned the data if needed. You trained a model on a dataset using AutoML, and you reviewed the model explanations. You deployed the best model and successfully consumed it. | When uploading the data, you ensured to change the feature's type if necessary. You trained a model on a dataset using AutoML, deployed the best model, and successfully consumed it. | You deployed the best model trained by AutoML and successfully consumed it. |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,312 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "73dead89dc2ddda4d6ec0232814a191e",
"translation_date": "2025-08-31T10:56:26+00:00",
"source_file": "5-Data-Science-In-Cloud/19-Azure/README.md",
"language_code": "en"
}
-->
# Data Science in the Cloud: The "Azure ML SDK" way
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/19-DataScience-Cloud.png)|
|:---:|
| Data Science In The Cloud: Azure ML SDK - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Table of contents:
- [Data Science in the Cloud: The "Azure ML SDK" way](../../../../5-Data-Science-In-Cloud/19-Azure)
- [Pre-Lecture Quiz](../../../../5-Data-Science-In-Cloud/19-Azure)
- [1. Introduction](../../../../5-Data-Science-In-Cloud/19-Azure)
- [1.1 What is Azure ML SDK?](../../../../5-Data-Science-In-Cloud/19-Azure)
- [1.2 Heart failure prediction project and dataset introduction](../../../../5-Data-Science-In-Cloud/19-Azure)
- [2. Training a model with the Azure ML SDK](../../../../5-Data-Science-In-Cloud/19-Azure)
- [2.1 Create an Azure ML workspace](../../../../5-Data-Science-In-Cloud/19-Azure)
- [2.2 Create a compute instance](../../../../5-Data-Science-In-Cloud/19-Azure)
- [2.3 Loading the Dataset](../../../../5-Data-Science-In-Cloud/19-Azure)
- [2.4 Creating Notebooks](../../../../5-Data-Science-In-Cloud/19-Azure)
- [2.5 Training a model](../../../../5-Data-Science-In-Cloud/19-Azure)
- [2.5.1 Setup Workspace, experiment, compute cluster and dataset](../../../../5-Data-Science-In-Cloud/19-Azure)
- [2.5.2 AutoML Configuration and training](../../../../5-Data-Science-In-Cloud/19-Azure)
- [3. Model deployment and endpoint consumption with the Azure ML SDK](../../../../5-Data-Science-In-Cloud/19-Azure)
- [3.1 Saving the best model](../../../../5-Data-Science-In-Cloud/19-Azure)
- [3.2 Model Deployment](../../../../5-Data-Science-In-Cloud/19-Azure)
- [3.3 Endpoint consumption](../../../../5-Data-Science-In-Cloud/19-Azure)
- [🚀 Challenge](../../../../5-Data-Science-In-Cloud/19-Azure)
- [Post-lecture quiz](../../../../5-Data-Science-In-Cloud/19-Azure)
- [Review & Self Study](../../../../5-Data-Science-In-Cloud/19-Azure)
- [Assignment](../../../../5-Data-Science-In-Cloud/19-Azure)
## [Pre-Lecture Quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/36)
## 1. Introduction
### 1.1 What is Azure ML SDK?
Data scientists and AI developers use the Azure Machine Learning SDK to create and manage machine learning workflows with the Azure Machine Learning service. You can interact with the service in any Python environment, such as Jupyter Notebooks, Visual Studio Code, or your preferred Python IDE.
Key features of the SDK include:
- Explore, prepare, and manage the lifecycle of datasets used in machine learning experiments.
- Manage cloud resources for monitoring, logging, and organizing machine learning experiments.
- Train models locally or using cloud resources, including GPU-accelerated training.
- Use automated machine learning, which takes configuration parameters and training data, and automatically tests algorithms and hyperparameter settings to find the best model for predictions.
- Deploy web services to turn trained models into RESTful services that can be used in any application.
[Learn more about the Azure Machine Learning SDK](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)
In the [previous lesson](../18-Low-Code/README.md), we learned how to train, deploy, and consume a model using a Low code/No code approach. We used the Heart Failure dataset to create a heart failure prediction model. In this lesson, we will achieve the same goal but using the Azure Machine Learning SDK.
![project-schema](../../../../5-Data-Science-In-Cloud/19-Azure/images/project-schema.PNG)
### 1.2 Heart failure prediction project and dataset introduction
Refer to [this section](../18-Low-Code/README.md) for an introduction to the Heart Failure prediction project and dataset.
## 2. Training a model with the Azure ML SDK
### 2.1 Create an Azure ML workspace
For simplicity, we will work in a Jupyter notebook. This assumes you already have a Workspace and a compute instance. If you already have a Workspace, you can skip to section 2.3 Notebook creation.
If not, follow the instructions in the section **2.1 Create an Azure ML workspace** in the [previous lesson](../18-Low-Code/README.md) to create a workspace.
### 2.2 Create a compute instance
In the [Azure ML workspace](https://ml.azure.com/) we created earlier, go to the Compute menu to see the available compute resources.
![compute-instance-1](../../../../5-Data-Science-In-Cloud/19-Azure/images/compute-instance-1.PNG)
Lets create a compute instance to host a Jupyter notebook.
1. Click the + New button.
2. Name your compute instance.
3. Choose your options: CPU or GPU, VM size, and core count.
4. Click the Create button.
Congratulations! Youve created a compute instance. Well use this instance to create a Notebook in the [Creating Notebooks section](../../../../5-Data-Science-In-Cloud/19-Azure).
### 2.3 Loading the Dataset
If you havent uploaded the dataset yet, refer to the [previous lesson](../18-Low-Code/README.md) in the section **2.3 Loading the Dataset**.
### 2.4 Creating Notebooks
> **_NOTE:_** For the next step, you can either create a new notebook from scratch or upload the [notebook we created](../../../../5-Data-Science-In-Cloud/19-Azure/notebook.ipynb) into your Azure ML Studio. To upload it, click on the "Notebook" menu and upload the file.
Notebooks are a crucial part of the data science process. They can be used for Exploratory Data Analysis (EDA), training models on a compute cluster, or deploying endpoints on an inference cluster.
To create a Notebook, we need a compute node running the Jupyter notebook instance. Go back to the [Azure ML workspace](https://ml.azure.com/) and click on Compute instances. In the list of compute instances, you should see the [compute instance we created earlier](../../../../5-Data-Science-In-Cloud/19-Azure).
1. In the Applications section, click on the Jupyter option.
2. Tick the "Yes, I understand" box and click Continue.
![notebook-1](../../../../5-Data-Science-In-Cloud/19-Azure/images/notebook-1.PNG)
3. This will open a new browser tab with your Jupyter notebook instance. Click the "New" button to create a notebook.
![notebook-2](../../../../5-Data-Science-In-Cloud/19-Azure/images/notebook-2.PNG)
Now that we have a Notebook, we can start training the model with the Azure ML SDK.
### 2.5 Training a model
If you have any doubts, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). It contains all the necessary information about the modules well use in this lesson.
#### 2.5.1 Setup Workspace, experiment, compute cluster, and dataset
Load the `workspace` from the configuration file using the following code:
```python
from azureml.core import Workspace
ws = Workspace.from_config()
```
This returns a `Workspace` object representing the workspace. Next, create an `experiment` using the following code:
```python
from azureml.core import Experiment
experiment_name = 'aml-experiment'
experiment = Experiment(ws, experiment_name)
```
To get or create an experiment in a workspace, request the experiment by name. Experiment names must be 3-36 characters long, start with a letter or number, and only contain letters, numbers, underscores, and dashes. If the experiment doesnt exist in the workspace, a new one is created.
Now, create a compute cluster for training using the following code. Note that this step may take a few minutes.
```python
from azureml.core.compute import AmlCompute
aml_name = "heart-f-cluster"
try:
aml_compute = AmlCompute(ws, aml_name)
print('Found existing AML compute context.')
except:
print('Creating new AML compute context.')
aml_config = AmlCompute.provisioning_configuration(vm_size = "Standard_D2_v2", min_nodes=1, max_nodes=3)
aml_compute = AmlCompute.create(ws, name = aml_name, provisioning_configuration = aml_config)
aml_compute.wait_for_completion(show_output = True)
cts = ws.compute_targets
compute_target = cts[aml_name]
```
Retrieve the dataset from the workspace using its name:
```python
dataset = ws.datasets['heart-failure-records']
df = dataset.to_pandas_dataframe()
df.describe()
```
#### 2.5.2 AutoML Configuration and training
Set the AutoML configuration using the [AutoMLConfig class](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig(class)?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
As described in the documentation, there are many parameters you can configure. For this project, well use the following:
- `experiment_timeout_minutes`: Maximum time (in minutes) the experiment can run before stopping automatically.
- `max_concurrent_iterations`: Maximum number of concurrent training iterations allowed.
- `primary_metric`: The primary metric used to evaluate the experiment.
- `compute_target`: The Azure Machine Learning compute target for the experiment.
- `task`: The type of task to perform ('classification', 'regression', or 'forecasting').
- `training_data`: The training data, including features and a label column (optionally a sample weights column).
- `label_column_name`: The name of the label column.
- `path`: Full path to the Azure Machine Learning project folder.
- `enable_early_stopping`: Whether to stop early if the score doesnt improve in the short term.
- `featurization`: Whether to perform automatic featurization or use custom settings.
- `debug_log`: Log file for debug information.
```python
from azureml.train.automl import AutoMLConfig
project_folder = './aml-project'
automl_settings = {
"experiment_timeout_minutes": 20,
"max_concurrent_iterations": 3,
"primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(compute_target=compute_target,
task = "classification",
training_data=dataset,
label_column_name="DEATH_EVENT",
path = project_folder,
enable_early_stopping= True,
featurization= 'auto',
debug_log = "automl_errors.log",
**automl_settings
)
```
Once configured, train the model using the following code. This step may take up to an hour, depending on your cluster size.
```python
remote_run = experiment.submit(automl_config)
```
Run the RunDetails widget to display the different experiments.
```python
from azureml.widgets import RunDetails
RunDetails(remote_run).show()
```
## 3. Model deployment and endpoint consumption with the Azure ML SDK
### 3.1 Saving the best model
The `remote_run` is an object of type [AutoMLRun](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). This object has a `get_output()` method that returns the best run and the corresponding fitted model.
```python
best_run, fitted_model = remote_run.get_output()
```
You can view the parameters of the best model by printing the `fitted_model` and check its properties using the [get_properties()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml_core_Run_get_properties?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method.
```python
best_run.get_properties()
```
Register the model using the [register_model](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?view=azure-ml-py#register-model-model-name-none--description-none--tags-none--iteration-none--metric-none-?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method.
```python
model_name = best_run.properties['model_name']
script_file_name = 'inference/score.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/score.py')
description = "aml heart failure project sdk"
model = best_run.register_model(model_name = model_name,
model_path = './outputs/',
description = description,
tags = None)
```
### 3.2 Model Deployment
Once the best model is saved, deploy it using the [InferenceConfig](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py?ocid=AID3041109) class. InferenceConfig defines the settings for a custom environment used for deployment. The [AciWebservice](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py) class represents a machine learning model deployed as a web service endpoint on Azure Container Instances. A deployed service is a load-balanced HTTP endpoint with a REST API. You can send data to this API and receive predictions from the model.
Deploy the model using the [deploy](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py#deploy-workspace--name--models--inference-config-none--deployment-config-none--deployment-target-none--overwrite-false--show-output-false-?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method.
```python
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AciWebservice
inference_config = InferenceConfig(entry_script=script_file_name, environment=best_run.get_environment())
aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1,
memory_gb = 1,
tags = {'type': "automl-heart-failure-prediction"},
description = 'Sample service for AutoML Heart Failure Prediction')
aci_service_name = 'automl-hf-sdk'
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)
```
This step may take a few minutes.
### 3.3 Endpoint consumption
Consume your endpoint by creating a sample input:
```python
data = {
"data":
[
{
'age': "60",
'anaemia': "false",
'creatinine_phosphokinase': "500",
'diabetes': "false",
'ejection_fraction': "38",
'high_blood_pressure': "false",
'platelets': "260000",
'serum_creatinine': "1.40",
'serum_sodium': "137",
'sex': "false",
'smoking': "false",
'time': "130",
},
],
}
test_sample = str.encode(json.dumps(data))
```
Then send this input to your model for prediction:
```python
response = aci_service.run(input_data=test_sample)
response
```
This should output `'{"result": [false]}'`. This means that the patient input we sent to the endpoint generated the prediction `false`, indicating that this person is not likely to have a heart attack.
Congratulations! You have successfully used the model deployed and trained on Azure ML with the Azure ML SDK!
> **_NOTE:_** Once you finish the project, remember to delete all the resources.
## 🚀 Challenge
There are many other things you can do with the SDK, but unfortunately, we can't cover them all in this lesson. The good news is that learning how to navigate the SDK documentation can help you explore further on your own. Check out the Azure ML SDK documentation and look for the `Pipeline` class, which allows you to create pipelines. A pipeline is a sequence of steps that can be executed as a workflow.
**HINT:** Visit the [SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) and use keywords like "Pipeline" in the search bar. You should find the `azureml.pipeline.core.Pipeline` class in the search results.
## [Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/37)
## Review & Self Study
In this lesson, you learned how to train, deploy, and consume a model to predict heart failure risk using the Azure ML SDK in the cloud. Refer to this [documentation](https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) for more details about the Azure ML SDK. Try creating your own model using the Azure ML SDK.
## Assignment
[Data Science project using Azure ML SDK](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "386efdbc19786951341f6956247ee990",
"translation_date": "2025-08-31T10:57:04+00:00",
"source_file": "5-Data-Science-In-Cloud/19-Azure/assignment.md",
"language_code": "en"
}
-->
# Data Science project using Azure ML SDK
## Instructions
We explored how to use the Azure ML platform to train, deploy, and consume a model with the Azure ML SDK. Now, find some data that you can use to train another model, deploy it, and consume it. You can search for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
## Rubric
| Exemplary | Adequate | Needs Improvement |
|-----------|----------|-------------------|
|When configuring AutoML, you referred to the SDK documentation to explore the parameters you could use. You trained a dataset using AutoML with the Azure ML SDK, reviewed the model explanations, deployed the best model, and successfully consumed it using the Azure ML SDK. | You trained a dataset using AutoML with the Azure ML SDK, reviewed the model explanations, deployed the best model, and successfully consumed it using the Azure ML SDK. | You trained a dataset using AutoML with the Azure ML SDK, deployed the best model, and successfully consumed it using the Azure ML SDK. |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,35 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "8dfe141a0f46f7d253e07f74913c7f44",
"translation_date": "2025-08-31T10:54:38+00:00",
"source_file": "5-Data-Science-In-Cloud/README.md",
"language_code": "en"
}
-->
# Data Science in the Cloud
![cloud-picture](../../../5-Data-Science-In-Cloud/images/cloud-picture.jpg)
> Photo by [Jelleke Vanooteghem](https://unsplash.com/@ilumire) from [Unsplash](https://unsplash.com/s/photos/cloud?orientation=landscape)
When working with big data in data science, the cloud can be a game changer. In the next three lessons, we will explore what the cloud is and why it can be incredibly useful. We will also analyze a heart failure dataset and build a model to estimate the likelihood of someone experiencing heart failure. Using the cloud's capabilities, we will train, deploy, and utilize the model in two different ways: one using only the user interface in a Low code/No code approach, and the other using the Azure Machine Learning Software Developer Kit (Azure ML SDK).
![project-schema](../../../5-Data-Science-In-Cloud/19-Azure/images/project-schema.PNG)
### Topics
1. [Why use Cloud for Data Science?](17-Introduction/README.md)
2. [Data Science in the Cloud: The "Low code/No code" way](18-Low-Code/README.md)
3. [Data Science in the Cloud: The "Azure ML SDK" way](19-Azure/README.md)
### Credits
These lessons were created with ☁️ and 💕 by [Maud Levy](https://twitter.com/maudstweets) and [Tiffany Souterre](https://twitter.com/TiffanySouterre).
The data for the Heart Failure Prediction project comes from [
Larxel](https://www.kaggle.com/andrewmvd) on [Kaggle](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data). It is licensed under the [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,155 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "67076ed50f54e7d26ba1ba378d6078f1",
"translation_date": "2025-08-31T11:11:55+00:00",
"source_file": "6-Data-Science-In-Wild/20-Real-World-Examples/README.md",
"language_code": "en"
}
-->
# Data Science in the Real World
| ![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/20-DataScience-RealWorld.png) |
| :--------------------------------------------------------------------------------------------------------------: |
| Data Science In The Real World - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
We're nearing the end of this learning journey!
We began by defining data science and ethics, explored tools and techniques for data analysis and visualization, reviewed the data science lifecycle, and examined how to scale and automate workflows using cloud computing services. Now, you might be wondering: _"How do I apply all these learnings to real-world scenarios?"_
In this lesson, we'll delve into real-world applications of data science across industries and explore specific examples in research, digital humanities, and sustainability. We'll also discuss student project opportunities and wrap up with resources to help you continue your learning journey.
## Pre-Lecture Quiz
[Pre-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/38)
## Data Science + Industry
The democratization of AI has made it easier for developers to design and integrate AI-driven decision-making and data-driven insights into user experiences and development workflows. Here are some examples of how data science is applied in real-world industry scenarios:
* [Google Flu Trends](https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/) used data science to correlate search terms with flu trends. Although the approach had flaws, it highlighted the potential (and challenges) of data-driven healthcare predictions.
* [UPS Routing Predictions](https://www.technologyreview.com/2018/11/21/139000/how-ups-uses-ai-to-outsmart-bad-weather/) - explains how UPS uses data science and machine learning to predict optimal delivery routes, factoring in weather, traffic, deadlines, and more.
* [NYC Taxicab Route Visualization](http://chriswhong.github.io/nyctaxi/) - data obtained through [Freedom Of Information Laws](https://chriswhong.com/open-data/foil_nyc_taxi/) was used to visualize a day in the life of NYC cabs, providing insights into navigation, earnings, and trip durations over a 24-hour period.
* [Uber Data Science Workbench](https://eng.uber.com/dsw/) - leverages data from millions of daily Uber trips (pickup/dropoff locations, trip durations, preferred routes, etc.) to build analytics tools for pricing, safety, fraud detection, and navigation decisions.
* [Sports Analytics](https://towardsdatascience.com/scope-of-analytics-in-sports-world-37ed09c39860) - focuses on _predictive analytics_ (team and player analysis, e.g., [Moneyball](https://datasciencedegree.wisconsin.edu/blog/moneyball-proves-importance-big-data-big-ideas/)) and _data visualization_ (team dashboards, fan engagement, etc.) with applications like talent scouting, sports betting, and venue management.
* [Data Science in Banking](https://data-flair.training/blogs/data-science-in-banking/) - showcases the role of data science in finance, including risk modeling, fraud detection, customer segmentation, real-time predictions, and recommender systems. Predictive analytics also support critical measures like [credit scores](https://dzone.com/articles/using-big-data-and-predictive-analytics-for-credit).
* [Data Science in Healthcare](https://data-flair.training/blogs/data-science-in-healthcare/) - highlights applications such as medical imaging (MRI, X-Ray, CT-Scan), genomics (DNA sequencing), drug development (risk assessment, success prediction), predictive analytics (patient care and logistics), and disease tracking/prevention.
![Data Science Applications in The Real World](../../../../6-Data-Science-In-Wild/20-Real-World-Examples/images/data-science-applications.png) Image Credit: [Data Flair: 6 Amazing Data Science Applications ](https://data-flair.training/blogs/data-science-applications/)
The figure illustrates other domains and examples of data science applications. Interested in exploring more? Check out the [Review & Self Study](../../../../6-Data-Science-In-Wild/20-Real-World-Examples) section below.
## Data Science + Research
| ![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/20-DataScience-Research.png) |
| :---------------------------------------------------------------------------------------------------------------: |
| Data Science & Research - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
While industry applications often focus on large-scale use cases, research projects can provide valuable insights in two key areas:
* _Innovation opportunities_ - rapid prototyping of advanced concepts and testing user experiences for next-generation applications.
* _Deployment challenges_ - identifying potential harms or unintended consequences of data science technologies in real-world contexts.
For students, research projects offer learning and collaboration opportunities that deepen understanding and foster connections with experts in areas of interest. What do research projects look like, and how can they make an impact?
Consider the [MIT Gender Shades Study](http://gendershades.org/overview.html) by Joy Buolamwini (MIT Media Labs), co-authored with Timnit Gebru (then at Microsoft Research). This study focused on:
* **What:** Evaluating bias in automated facial analysis algorithms and datasets based on gender and skin type.
* **Why:** Facial analysis is used in critical areas like law enforcement, airport security, and hiring systems, where inaccuracies (e.g., due to bias) can lead to economic and social harm. Addressing bias is essential for fairness.
* **How:** Researchers noted that existing benchmarks predominantly featured lighter-skinned subjects. They curated a new dataset (1000+ images) balanced by gender and skin type, which was used to evaluate the accuracy of three gender classification products (Microsoft, IBM, Face++).
Results revealed that while overall accuracy was good, error rates varied significantly across subgroups, with **misgendering** being higher for females and individuals with darker skin tones, indicating bias.
**Key Outcomes:** The study emphasized the need for more _representative datasets_ (balanced subgroups) and _inclusive teams_ (diverse backgrounds) to identify and address biases early in AI solutions. Such research has influenced organizations to adopt principles and practices for _responsible AI_ to enhance fairness in their AI products and processes.
**Interested in Microsoft research efforts?**
* Explore [Microsoft Research Projects](https://www.microsoft.com/research/research-area/artificial-intelligence/?facet%5Btax%5D%5Bmsr-research-area%5D%5B%5D=13556&facet%5Btax%5D%5Bmsr-content-type%5D%5B%5D=msr-project) on Artificial Intelligence.
* Check out student projects from [Microsoft Research Data Science Summer School](https://www.microsoft.com/en-us/research/academic-program/data-science-summer-school/).
* Learn about the [Fairlearn](https://fairlearn.org/) project and [Responsible AI](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1%3aprimaryr6) initiatives.
## Data Science + Humanities
| ![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/20-DataScience-Humanities.png) |
| :---------------------------------------------------------------------------------------------------------------: |
| Data Science & Digital Humanities - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Digital Humanities [is defined](https://digitalhumanities.stanford.edu/about-dh-stanford) as "a collection of practices and approaches combining computational methods with humanistic inquiry." [Stanford projects](https://digitalhumanities.stanford.edu/projects) like _"rebooting history"_ and _"poetic thinking"_ demonstrate the connection between [Digital Humanities and Data Science](https://digitalhumanities.stanford.edu/digital-humanities-and-data-science), using techniques like network analysis, information visualization, spatial analysis, and text analysis to revisit historical and literary datasets for new insights.
*Want to explore a project in this field?*
Check out ["Emily Dickinson and the Meter of Mood"](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671) by [Jen Looper](https://twitter.com/jenlooper). This project examines how data science can reinterpret familiar poetry and reevaluate its meaning and the author's contributions. For example, _can we predict the season in which a poem was written by analyzing its tone or sentiment?_ What does this reveal about the author's mindset during that time?
To explore this, follow the data science lifecycle:
* [`Data Acquisition`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#acquiring-the-dataset) - collect relevant datasets using APIs (e.g., [Poetry DB API](https://poetrydb.org/index.html)) or web scraping tools (e.g., [Project Gutenberg](https://www.gutenberg.org/files/12242/12242-h/12242-h.htm)).
* [`Data Cleaning`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#clean-the-data) - format and sanitize text using tools like Visual Studio Code and Microsoft Excel.
* [`Data Analysis`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#working-with-the-data-in-a-notebook) - import the dataset into "Notebooks" for analysis using Python packages (e.g., pandas, numpy, matplotlib) to organize and visualize the data.
* [`Sentiment Analysis`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#sentiment-analysis-using-cognitive-services) - integrate cloud services like Text Analytics and use low-code tools like [Power Automate](https://flow.microsoft.com/en-us/) for automated workflows.
This workflow allows you to explore seasonal impacts on poem sentiment and develop your own interpretations of the author. Try it out, then extend the notebook to ask new questions or visualize the data differently!
> Use tools from the [Digital Humanities toolkit](https://github.com/Digital-Humanities-Toolkit) to pursue similar inquiries.
## Data Science + Sustainability
| ![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/20-DataScience-Sustainability.png) |
| :---------------------------------------------------------------------------------------------------------------: |
| Data Science & Sustainability - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
The [2030 Agenda For Sustainable Development](https://sdgs.un.org/2030agenda), adopted by all United Nations members in 2015, outlines 17 goals, including those aimed at **Protecting the Planet** from degradation and climate change. The [Microsoft Sustainability](https://www.microsoft.com/en-us/sustainability) initiative supports these goals by leveraging technology to build a more sustainable future, focusing on 4 key objectives: being carbon negative, water positive, zero waste, and bio-diverse by 2030.
Addressing these challenges requires large-scale data and cloud-based solutions. The [Planetary Computer](https://planetarycomputer.microsoft.com/) initiative provides four components to assist data scientists and developers:
* [Data Catalog](https://planetarycomputer.microsoft.com/catalog) - offers petabytes of Earth Systems data (free and Azure-hosted).
* [Planetary API](https://planetarycomputer.microsoft.com/docs/reference/stac/) - enables users to search for relevant data across space and time.
* [Hub](https://planetarycomputer.microsoft.com/docs/overview/environment/) - provides a managed environment for processing massive geospatial datasets.
* [Applications](https://planetarycomputer.microsoft.com/applications) - showcases use cases and tools for sustainability insights.
**The Planetary Computer Project is currently in preview (as of Sep 2021)** - here's how you can start contributing to sustainability solutions using data science.
* [Request access](https://planetarycomputer.microsoft.com/account/request) to begin exploring and connect with others.
* [Explore documentation](https://planetarycomputer.microsoft.com/docs/overview/about) to learn about supported datasets and APIs.
* Check out applications like [Ecosystem Monitoring](https://analytics-lab.org/ecosystemmonitoring/) for inspiration on project ideas.
Consider how you can use data visualization to highlight or amplify insights into issues like climate change and deforestation. Or think about how these insights can be leveraged to design new user experiences that encourage behavioral changes for more sustainable living.
## Data Science + Students
We've discussed real-world applications in industry and research, and looked at examples of data science applications in digital humanities and sustainability. So how can you develop your skills and share your knowledge as data science beginners?
Here are some examples of student data science projects to inspire you:
* [MSR Data Science Summer School](https://www.microsoft.com/en-us/research/academic-program/data-science-summer-school/#!projects) with GitHub [projects](https://github.com/msr-ds3) exploring topics such as:
- [Racial Bias in Police Use of Force](https://www.microsoft.com/en-us/research/video/data-science-summer-school-2019-replicating-an-empirical-analysis-of-racial-differences-in-police-use-of-force/) | [Github](https://github.com/msr-ds3/stop-question-frisk)
- [Reliability of NYC Subway System](https://www.microsoft.com/en-us/research/video/data-science-summer-school-2018-exploring-the-reliability-of-the-nyc-subway-system/) | [Github](https://github.com/msr-ds3/nyctransit)
* [Digitizing Material Culture: Exploring socio-economic distributions in Sirkap](https://claremont.maps.arcgis.com/apps/Cascade/index.html?appid=bdf2aef0f45a4674ba41cd373fa23afc) - by [Ornella Altunyan](https://twitter.com/ornelladotcom) and her team at Claremont, using [ArcGIS StoryMaps](https://storymaps.arcgis.com/).
## 🚀 Challenge
Look for articles that suggest beginner-friendly data science projects - like [these 50 topic areas](https://www.upgrad.com/blog/data-science-project-ideas-topics-beginners/), [these 21 project ideas](https://www.intellspot.com/data-science-project-ideas), or [these 16 projects with source code](https://data-flair.training/blogs/data-science-project-ideas/) that you can analyze and remix. And don't forget to blog about your learning experiences and share your insights with the community.
## Post-Lecture Quiz
[Post-lecture quiz](https://purple-hill-04aebfb03.1.azurestaticapps.net/quiz/39)
## Review & Self Study
Want to dive deeper into use cases? Here are some relevant articles:
* [17 Data Science Applications and Examples](https://builtin.com/data-science/data-science-applications-examples) - Jul 2021
* [11 Breathtaking Data Science Applications in Real World](https://myblindbird.com/data-science-applications-real-world/) - May 2021
* [Data Science In The Real World](https://towardsdatascience.com/data-science-in-the-real-world/home) - Article Collection
* Data Science In: [Education](https://data-flair.training/blogs/data-science-in-education/), [Agriculture](https://data-flair.training/blogs/data-science-in-agriculture/), [Finance](https://data-flair.training/blogs/data-science-in-finance/), [Movies](https://data-flair.training/blogs/data-science-at-movies/) & more.
## Assignment
[Explore A Planetary Computer Dataset](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,50 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "d1e05715f9d97de6c4f1fb0c5a4702c0",
"translation_date": "2025-08-31T11:12:29+00:00",
"source_file": "6-Data-Science-In-Wild/20-Real-World-Examples/assignment.md",
"language_code": "en"
}
-->
# Explore a Planetary Computer Dataset
## Instructions
In this lesson, we discussed various domains of data science applications, diving deeper into examples related to research, sustainability, and digital humanities. In this assignment, you'll explore one of these examples in greater detail and apply your knowledge of data visualizations and analysis to extract insights about sustainability data.
The [Planetary Computer](https://planetarycomputer.microsoft.com/) project offers datasets and APIs that can be accessed with an account—request one if you'd like to try the bonus step of the assignment. The site also includes an [Explorer](https://planetarycomputer.microsoft.com/explore) feature that you can use without needing to create an account.
`Steps:`
The Explorer interface (shown in the screenshot below) allows you to select a dataset (from the available options), a preset query (to filter the data), and a rendering option (to generate a relevant visualization). For this assignment, your task is to:
1. Read the [Explorer documentation](https://planetarycomputer.microsoft.com/docs/overview/explorer/) to understand the available options.
2. Explore the dataset [Catalog](https://planetarycomputer.microsoft.com/catalog) to learn the purpose of each dataset.
3. Use the Explorer to choose a dataset of interest, select a relevant query, and pick a rendering option.
![The Planetary Computer Explorer](../../../../6-Data-Science-In-Wild/20-Real-World-Examples/images/planetary-computer-explorer.png)
`Your Task:`
Once you've studied the visualization rendered in the browser, answer the following questions:
* What _features_ does the dataset include?
* What _insights_ or results does the visualization reveal?
* What are the _implications_ of those insights for the sustainability goals of the project?
* What are the _limitations_ of the visualization (i.e., what insights were not provided)?
* If you had access to the raw data, what _alternative visualizations_ would you create, and why?
`Bonus Points:`
Apply for an account and log in once accepted.
* Use the _Launch Hub_ option to open the raw data in a Notebook.
* Explore the data interactively and implement the alternative visualizations you envisioned.
* Analyze your custom visualizations—were you able to uncover the insights you missed earlier?
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | ---
All five core questions were answered. The student clearly identified how current and alternative visualizations could provide insights into sustainability objectives or outcomes. | The student answered at least the top three questions in detail, demonstrating practical experience with the Explorer. | The student failed to answer multiple questions or provided insufficient detail, indicating that no meaningful attempt was made for the project.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "07faf02ff163e609edf0b0308dc5d4e6",
"translation_date": "2025-08-31T11:11:50+00:00",
"source_file": "6-Data-Science-In-Wild/README.md",
"language_code": "en"
}
-->
# Data Science in the Wild
Practical applications of data science across various industries.
### Topics
1. [Data Science in the Real World](20-Real-World-Examples/README.md)
### Credits
Written with ❤️ by [Nitya Narasimhan](https://twitter.com/nitya)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,23 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "c06b12caf3c901eb3156e3dd5b0aea56",
"translation_date": "2025-08-31T10:54:34+00:00",
"source_file": "CODE_OF_CONDUCT.md",
"language_code": "en"
}
-->
# Microsoft Open Source Code of Conduct
This project follows the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
Resources:
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- For questions or concerns, contact [opencode@microsoft.com](mailto:opencode@microsoft.com)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,21 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "61aff2b3273d4ab66709493b43f91ca1",
"translation_date": "2025-08-31T10:54:09+00:00",
"source_file": "CONTRIBUTING.md",
"language_code": "en"
}
-->
# Contributing
This project encourages contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA), which confirms that you have the rights to, and are granting us the rights to, use your contribution. For more details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically check whether you need to provide a CLA and will update the PR accordingly (e.g., with a label or comment). Just follow the instructions provided by the bot. You only need to complete this process once for all repositories that use our CLA.
This project follows the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information, refer to the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or reach out to [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or feedback.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,154 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a5443b88ba402d2ec7b000e4de6cecb8",
"translation_date": "2025-08-31T10:53:09+00:00",
"source_file": "README.md",
"language_code": "en"
}
-->
# Data Science for Beginners - A Curriculum
Azure Cloud Advocates at Microsoft are excited to present a 10-week, 20-lesson curriculum focused on Data Science. Each lesson includes pre-lesson and post-lesson quizzes, step-by-step instructions, solutions, and assignments. This project-based approach helps you learn by doing, ensuring the skills you gain are long-lasting.
**A big thank you to our authors:** [Jasmine Greenaway](https://www.twitter.com/paladique), [Dmitry Soshnikov](http://soshnikov.com), [Nitya Narasimhan](https://twitter.com/nitya), [Jalen McGee](https://twitter.com/JalenMcG), [Jen Looper](https://twitter.com/jenlooper), [Maud Levy](https://twitter.com/maudstweets), [Tiffany Souterre](https://twitter.com/TiffanySouterre), [Christopher Harrison](https://www.twitter.com/geektrainer).
**🙏 Special thanks 🙏 to our [Microsoft Student Ambassador](https://studentambassadors.microsoft.com/) authors, reviewers, and contributors,** including Aaryan Arora, [Aditya Garg](https://github.com/AdityaGarg00), [Alondra Sanchez](https://www.linkedin.com/in/alondra-sanchez-molina/), [Ankita Singh](https://www.linkedin.com/in/ankitasingh007), [Anupam Mishra](https://www.linkedin.com/in/anupam--mishra/), [Arpita Das](https://www.linkedin.com/in/arpitadas01/), ChhailBihari Dubey, [Dibri Nsofor](https://www.linkedin.com/in/dibrinsofor), [Dishita Bhasin](https://www.linkedin.com/in/dishita-bhasin-7065281bb), [Majd Safi](https://www.linkedin.com/in/majd-s/), [Max Blum](https://www.linkedin.com/in/max-blum-6036a1186/), [Miguel Correa](https://www.linkedin.com/in/miguelmque/), [Mohamma Iftekher (Iftu) Ebne Jalal](https://twitter.com/iftu119), [Nawrin Tabassum](https://www.linkedin.com/in/nawrin-tabassum), [Raymond Wangsa Putra](https://www.linkedin.com/in/raymond-wp/), [Rohit Yadav](https://www.linkedin.com/in/rty2423), Samridhi Sharma, [Sanya Sinha](https://www.linkedin.com/mwlite/in/sanya-sinha-13aab1200), [Sheena Narula](https://www.linkedin.com/in/sheena-narua-n/), [Tauqeer Ahmad](https://www.linkedin.com/in/tauqeerahmad5201/), Yogendrasingh Pawar, [Vidushi Gupta](https://www.linkedin.com/in/vidushi-gupta07/), [Jasleen Sondhi](https://www.linkedin.com/in/jasleen-sondhi/).
|![Sketchnote by @sketchthedocs https://sketchthedocs.dev](../../sketchnotes/00-Title.png)|
|:---:|
| Data Science For Beginners - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
### 🌐 Multi-Language Support
#### Supported via GitHub Action (Automated & Always Up-to-Date)
[French](../fr/README.md) | [Spanish](../es/README.md) | [German](../de/README.md) | [Russian](../ru/README.md) | [Arabic](../ar/README.md) | [Persian (Farsi)](../fa/README.md) | [Urdu](../ur/README.md) | [Chinese (Simplified)](../zh/README.md) | [Chinese (Traditional, Macau)](../mo/README.md) | [Chinese (Traditional, Hong Kong)](../hk/README.md) | [Chinese (Traditional, Taiwan)](../tw/README.md) | [Japanese](../ja/README.md) | [Korean](../ko/README.md) | [Hindi](../hi/README.md) | [Bengali](../bn/README.md) | [Marathi](../mr/README.md) | [Nepali](../ne/README.md) | [Punjabi (Gurmukhi)](../pa/README.md) | [Portuguese (Portugal)](../pt/README.md) | [Portuguese (Brazil)](../br/README.md) | [Italian](../it/README.md) | [Polish](../pl/README.md) | [Turkish](../tr/README.md) | [Greek](../el/README.md) | [Thai](../th/README.md) | [Swedish](../sv/README.md) | [Danish](../da/README.md) | [Norwegian](../no/README.md) | [Finnish](../fi/README.md) | [Dutch](../nl/README.md) | [Hebrew](../he/README.md) | [Vietnamese](../vi/README.md) | [Indonesian](../id/README.md) | [Malay](../ms/README.md) | [Tagalog (Filipino)](../tl/README.md) | [Swahili](../sw/README.md) | [Hungarian](../hu/README.md) | [Czech](../cs/README.md) | [Slovak](../sk/README.md) | [Romanian](../ro/README.md) | [Bulgarian](../bg/README.md) | [Serbian (Cyrillic)](../sr/README.md) | [Croatian](../hr/README.md) | [Slovenian](../sl/README.md) | [Ukrainian](../uk/README.md) | [Burmese (Myanmar)](../my/README.md)
**If you'd like additional translations, supported languages are listed [here](https://github.com/Azure/co-op-translator/blob/main/getting_started/supported-languages.md)**
#### Join Our Community
[![Azure AI Discord](https://dcbadge.limes.pink/api/server/kzRShWzttr)](https://discord.gg/kzRShWzttr)
# Are you a student?
Start with these resources:
- [Student Hub page](https://docs.microsoft.com/en-gb/learn/student-hub?WT.mc_id=academic-77958-bethanycheum) This page offers beginner resources, student packs, and even opportunities to get a free certification voucher. Bookmark it and check back regularly as content is updated monthly.
- [Microsoft Learn Student Ambassadors](https://studentambassadors.microsoft.com?WT.mc_id=academic-77958-bethanycheum) Join a global community of student ambassadors—this could be your gateway to Microsoft.
# Getting Started
> **Teachers**: We've [included some suggestions](for-teachers.md) on how to use this curriculum. Share your feedback [in our discussion forum](https://github.com/microsoft/Data-Science-For-Beginners/discussions)!
> **[Students](https://aka.ms/student-page)**: To use this curriculum independently, fork the repository and complete the exercises, starting with the pre-lesson quiz. Then, read the lesson and complete the activities. Try to build the projects by understanding the lessons rather than copying the solution code (though solutions are available in the /solutions folders for each project-based lesson). Another option is to form a study group with friends and go through the content together. For further learning, we recommend [Microsoft Learn](https://docs.microsoft.com/en-us/users/jenlooper-2911/collections/qprpajyoy3x0g7?WT.mc_id=academic-77958-bethanycheum).
## Meet the Team
[![Promo video](../../ds-for-beginners.gif)](https://youtu.be/8mzavjQSMM4 "Promo video")
**Gif by** [Mohit Jaisal](https://www.linkedin.com/in/mohitjaisal)
> 🎥 Click the image above to watch a video about the project and the team behind it!
## Pedagogy
This curriculum is built on two key principles: project-based learning and frequent quizzes. By the end of the series, students will understand fundamental concepts of data science, including ethical considerations, data preparation, working with data, data visualization, data analysis, real-world applications, and more.
Additionally, a low-pressure quiz before each class helps students focus on the topic, while a post-class quiz reinforces retention. The curriculum is designed to be flexible and enjoyable, allowing students to complete it in full or in part. Projects start small and gradually become more complex over the 10-week cycle.
Find our [Code of Conduct](CODE_OF_CONDUCT.md), [Contributing](CONTRIBUTING.md), [Translation](TRANSLATIONS.md) guidelines. We appreciate your constructive feedback!
## Each lesson includes:
- Optional sketchnote
- Optional supplemental video
- Pre-lesson warmup quiz
- Written lesson
- For project-based lessons, step-by-step guides on how to build the project
- Knowledge checks
- A challenge
- Supplemental reading
- Assignment
- [Post-lesson quiz](https://ff-quizzes.netlify.app/en/)
> **A note about quizzes**: All quizzes are located in the Quiz-App folder, with a total of 40 quizzes, each containing three questions. They are linked within the lessons, but the quiz app can also be run locally or deployed to Azure. Follow the instructions in the `quiz-app` folder. Localization is ongoing.
## Lessons
|![ Sketchnote by @sketchthedocs https://sketchthedocs.dev](../../sketchnotes/00-Roadmap.png)|
|:---:|
| Data Science For Beginners: Roadmap - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
| Lesson Number | Topic | Lesson Grouping | Learning Objectives | Linked Lesson | Author |
| :-----------: | :----------------------------------------: | :--------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------: | :----: |
| 01 | Defining Data Science | [Introduction](1-Introduction/README.md) | Learn the basic concepts behind data science and how its related to artificial intelligence, machine learning, and big data. | [lesson](1-Introduction/01-defining-data-science/README.md) [video](https://youtu.be/beZ7Mb_oz9I) | [Dmitry](http://soshnikov.com) |
| 02 | Data Science Ethics | [Introduction](1-Introduction/README.md) | Data Ethics Concepts, Challenges & Frameworks. | [lesson](1-Introduction/02-ethics/README.md) | [Nitya](https://twitter.com/nitya) |
| 03 | Defining Data | [Introduction](1-Introduction/README.md) | How data is classified and its common sources. | [lesson](1-Introduction/03-defining-data/README.md) | [Jasmine](https://www.twitter.com/paladique) |
| 04 | Introduction to Statistics & Probability | [Introduction](1-Introduction/README.md) | The mathematical techniques of probability and statistics to understand data. | [lesson](1-Introduction/04-stats-and-probability/README.md) [video](https://youtu.be/Z5Zy85g4Yjw) | [Dmitry](http://soshnikov.com) |
| 05 | Working with Relational Data | [Working With Data](2-Working-With-Data/README.md) | Introduction to relational data and the basics of exploring and analyzing relational data with the Structured Query Language, also known as SQL (pronounced “see-quell”). | [lesson](2-Working-With-Data/05-relational-databases/README.md) | [Christopher](https://www.twitter.com/geektrainer) | | |
| 06 | Working with NoSQL Data | [Working With Data](2-Working-With-Data/README.md) | Introduction to non-relational data, its various types and the basics of exploring and analyzing document databases. | [lesson](2-Working-With-Data/06-non-relational/README.md) | [Jasmine](https://twitter.com/paladique)|
| 07 | Working with Python | [Working With Data](2-Working-With-Data/README.md) | Basics of using Python for data exploration with libraries such as Pandas. Foundational understanding of Python programming is recommended. | [lesson](2-Working-With-Data/07-python/README.md) [video](https://youtu.be/dZjWOGbsN4Y) | [Dmitry](http://soshnikov.com) |
| 08 | Data Preparation | [Working With Data](2-Working-With-Data/README.md) | Topics on data techniques for cleaning and transforming the data to handle challenges of missing, inaccurate, or incomplete data. | [lesson](2-Working-With-Data/08-data-preparation/README.md) | [Jasmine](https://www.twitter.com/paladique) |
| 09 | Visualizing Quantities | [Data Visualization](3-Data-Visualization/README.md) | Learn how to use Matplotlib to visualize bird data 🦆 | [lesson](3-Data-Visualization/09-visualization-quantities/README.md) | [Jen](https://twitter.com/jenlooper) |
| 10 | Visualizing Distributions of Data | [Data Visualization](3-Data-Visualization/README.md) | Visualizing observations and trends within an interval. | [lesson](3-Data-Visualization/10-visualization-distributions/README.md) | [Jen](https://twitter.com/jenlooper) |
| 11 | Visualizing Proportions | [Data Visualization](3-Data-Visualization/README.md) | Visualizing discrete and grouped percentages. | [lesson](3-Data-Visualization/11-visualization-proportions/README.md) | [Jen](https://twitter.com/jenlooper) |
| 12 | Visualizing Relationships | [Data Visualization](3-Data-Visualization/README.md) | Visualizing connections and correlations between sets of data and their variables. | [lesson](3-Data-Visualization/12-visualization-relationships/README.md) | [Jen](https://twitter.com/jenlooper) |
| 13 | Meaningful Visualizations | [Data Visualization](3-Data-Visualization/README.md) | Techniques and guidance for making your visualizations valuable for effective problem solving and insights. | [lesson](3-Data-Visualization/13-meaningful-visualizations/README.md) | [Jen](https://twitter.com/jenlooper) |
| 14 | Introduction to the Data Science lifecycle | [Lifecycle](4-Data-Science-Lifecycle/README.md) | Introduction to the data science lifecycle and its first step of acquiring and extracting data. | [lesson](4-Data-Science-Lifecycle/14-Introduction/README.md) | [Jasmine](https://twitter.com/paladique) |
| 15 | Analyzing | [Lifecycle](4-Data-Science-Lifecycle/README.md) | This phase of the data science lifecycle focuses on techniques to analyze data. | [lesson](4-Data-Science-Lifecycle/15-analyzing/README.md) | [Jasmine](https://twitter.com/paladique) | | |
| 16 | Communication | [Lifecycle](4-Data-Science-Lifecycle/README.md) | This phase of the data science lifecycle focuses on presenting the insights from the data in a way that makes it easier for decision makers to understand. | [lesson](4-Data-Science-Lifecycle/16-communication/README.md) | [Jalen](https://twitter.com/JalenMcG) | | |
| 17 | Data Science in the Cloud | [Cloud Data](5-Data-Science-In-Cloud/README.md) | This series of lessons introduces data science in the cloud and its benefits. | [lesson](5-Data-Science-In-Cloud/17-Introduction/README.md) | [Tiffany](https://twitter.com/TiffanySouterre) and [Maud](https://twitter.com/maudstweets) |
| 18 | Data Science in the Cloud | [Cloud Data](5-Data-Science-In-Cloud/README.md) | Training models using Low Code tools. |[lesson](5-Data-Science-In-Cloud/18-Low-Code/README.md) | [Tiffany](https://twitter.com/TiffanySouterre) and [Maud](https://twitter.com/maudstweets) |
| 19 | Data Science in the Cloud | [Cloud Data](5-Data-Science-In-Cloud/README.md) | Deploying models with Azure Machine Learning Studio. | [lesson](5-Data-Science-In-Cloud/19-Azure/README.md)| [Tiffany](https://twitter.com/TiffanySouterre) and [Maud](https://twitter.com/maudstweets) |
| 20 | Data Science in the Wild | [In the Wild](6-Data-Science-In-Wild/README.md) | Data science driven projects in the real world. | [lesson](6-Data-Science-In-Wild/20-Real-World-Examples/README.md) | [Nitya](https://twitter.com/nitya) |
## GitHub Codespaces
Follow these steps to open this sample in a Codespace:
1. Click the Code drop-down menu and select the Open with Codespaces option.
2. Select + New codespace at the bottom on the pane.
For more info, check out the [GitHub documentation](https://docs.github.com/en/codespaces/developing-in-codespaces/creating-a-codespace-for-a-repository#creating-a-codespace).
## VSCode Remote - Containers
Follow these steps to open this repo in a container using your local machine and VSCode using the VS Code Remote - Containers extension:
1. If this is your first time using a development container, please ensure your system meets the pre-reqs (i.e. have Docker installed) in [the getting started documentation](https://code.visualstudio.com/docs/devcontainers/containers#_getting-started).
To use this repository, you can either open the repository in an isolated Docker volume:
**Note**: Under the hood, this will use the Remote-Containers: **Clone Repository in Container Volume...** command to clone the source code in a Docker volume instead of the local filesystem. [Volumes](https://docs.docker.com/storage/volumes/) are the preferred mechanism for persisting container data.
Or open a locally cloned or downloaded version of the repository:
- Clone this repository to your local filesystem.
- Press F1 and select the **Remote-Containers: Open Folder in Container...** command.
- Select the cloned copy of this folder, wait for the container to start, and try things out.
## Offline access
You can run this documentation offline by using [Docsify](https://docsify.js.org/#/). Fork this repo, [install Docsify](https://docsify.js.org/#/quickstart) on your local machine, then in the root folder of this repo, type `docsify serve`. The website will be served on port 3000 on your localhost: `localhost:3000`.
> Note, notebooks will not be rendered via Docsify, so when you need to run a notebook, do that separately in VS Code running a Python kernel.
## Other Curricula
Our team produces other curricula! Check out:
- [Generative AI for Beginners](https://aka.ms/genai-beginners)
- [Generative AI for Beginners .NET](https://github.com/microsoft/Generative-AI-for-beginners-dotnet)
- [Generative AI with JavaScript](https://github.com/microsoft/generative-ai-with-javascript)
- [Generative AI with Java](https://aka.ms/genaijava)
- [AI for Beginners](https://aka.ms/ai-beginners)
- [Data Science for Beginners](https://aka.ms/datascience-beginners)
- [ML for Beginners](https://aka.ms/ml-beginners)
- [Cybersecurity for Beginners](https://github.com/microsoft/Security-101)
- [Web Dev for Beginners](https://aka.ms/webdev-beginners)
- [IoT for Beginners](https://aka.ms/iot-beginners)
- [XR Development for Beginners](https://github.com/microsoft/xr-development-for-beginners)
- [Mastering GitHub Copilot for Paired Programming](https://github.com/microsoft/Mastering-GitHub-Copilot-for-Paired-Programming)
- [Mastering GitHub Copilot for C#/.NET Developers](https://github.com/microsoft/mastering-github-copilot-for-dotnet-csharp-developers)
- [Choose Your Own Copilot Adventure](https://github.com/microsoft/CopilotAdventures)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,51 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "0d575483100c332b2dbaefef915bb3c4",
"translation_date": "2025-08-31T10:54:14+00:00",
"source_file": "SECURITY.md",
"language_code": "en"
}
-->
## Security
Microsoft prioritizes the security of its software products and services, including all source code repositories managed through our GitHub organizations, such as [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
If you believe you've identified a security vulnerability in any Microsoft-owned repository that aligns with [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us using the instructions below.
## Reporting Security Issues
**Do not report security vulnerabilities through public GitHub issues.**
Instead, report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
If you'd prefer to submit without logging in, you can email [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message using our PGP key, which can be downloaded from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
You should receive a response within 24 hours. If you don't, please follow up via email to ensure we received your original message. Additional details can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
Please include the following information (as much as you can provide) to help us better understand the nature and scope of the issue:
* Type of issue (e.g., buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration needed to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if available)
* Impact of the issue, including how an attacker might exploit it
Providing this information will help us process your report more efficiently.
If you're submitting a report for a bug bounty, more detailed reports may result in a higher bounty award. Visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more information about our active programs.
## Preferred Languages
We prefer all communications to be in English.
## Policy
Microsoft adheres to the principles of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,24 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "872be8bc1b93ef1dd9ac3d6e8f99f6ab",
"translation_date": "2025-08-31T10:54:04+00:00",
"source_file": "SUPPORT.md",
"language_code": "en"
}
-->
# Support
## How to report issues and seek assistance
This project utilizes GitHub Issues to manage bug reports and feature requests. Before submitting a new issue, please check the existing ones to avoid duplicates. If you need to report a new issue, submit your bug or feature request as a new Issue.
For assistance or questions regarding the use of this project, please create an issue.
## Microsoft Support Policy
Support for this repository is restricted to the resources mentioned above.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,40 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "3767555b3cc28a2865c79202f4374204",
"translation_date": "2025-08-31T10:59:49+00:00",
"source_file": "docs/_sidebar.md",
"language_code": "en"
}
-->
- Introduction
- [Defining Data Science](../1-Introduction/01-defining-data-science/README.md)
- [Ethics of Data Science](../1-Introduction/02-ethics/README.md)
- [Defining Data](../1-Introduction/03-defining-data/README.md)
- [Probability and Stats](../1-Introduction/04-stats-and-probability/README.md)
- Working With Data
- [Relational Databases](../2-Working-With-Data/05-relational-databases/README.md)
- [Nonrelational Databases](../2-Working-With-Data/06-non-relational/README.md)
- [Python](../2-Working-With-Data/07-python/README.md)
- [Data Preparation](../2-Working-With-Data/08-data-preparation/README.md)
- Data Visualization
- [Visualizing Quantities](../3-Data-Visualization/09-visualization-quantities/README.md)
- [Visualizing Distributions](../3-Data-Visualization/10-visualization-distributions/README.md)
- [Visualizing Proportions](../3-Data-Visualization/11-visualization-proportions/README.md)
- [Visualizing Relationships](../3-Data-Visualization/12-visualization-relationships/README.md)
- [Meaningful Visualizations](../3-Data-Visualization/13-meaningful-visualizations/README.md)
- Data Science Lifecycle
- [Introduction](../4-Data-Science-Lifecycle/14-Introduction/README.md)
- [Analyzing](../4-Data-Science-Lifecycle/15-analyzing/README.md)
- [Communication](../4-Data-Science-Lifecycle/16-communication/README.md)
- Data Science in the Cloud
- [Introduction](../5-Data-Science-In-Cloud/17-Introduction/README.md)
- [Low Code](../5-Data-Science-In-Cloud/18-Low-Code/README.md)
- [Azure](../5-Data-Science-In-Cloud/19-Azure/README.md)
- Data Science in the Wild
- [DS In The Wild](../6-Data-Science-In-Wild/README.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,78 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "87f157ea00d36c1d12c14390d9852b50",
"translation_date": "2025-08-31T10:54:23+00:00",
"source_file": "for-teachers.md",
"language_code": "en"
}
-->
## For Educators
Would you like to use this curriculum in your classroom? Go ahead!
In fact, you can use it directly on GitHub by leveraging GitHub Classroom.
To do this, fork this repository. You'll need to create a separate repository for each lesson, so you'll have to extract each folder into its own repository. This way, [GitHub Classroom](https://classroom.github.com/classrooms) can handle each lesson individually.
These [detailed instructions](https://github.blog/2020-03-18-set-up-your-digital-classroom-with-github-classroom/) will guide you on how to set up your classroom.
## Using the repository as is
If you'd prefer to use this repository as it currently exists, without GitHub Classroom, that's also possible. You'll need to coordinate with your students on which lesson to work through together.
In an online format (Zoom, Teams, or similar), you could create breakout rooms for quizzes and mentor students to prepare them for learning. Then, invite students to complete the quizzes and submit their answers as 'issues' at a designated time. You could follow the same approach for assignments if you want students to collaborate openly.
If you'd rather use a more private format, ask your students to fork the curriculum, lesson by lesson, into their own private GitHub repositories and grant you access. This way, they can complete quizzes and assignments privately and submit them to you via issues on your classroom repository.
There are many ways to adapt this for an online classroom setting. Let us know what works best for you!
## Included in this curriculum:
20 lessons, 40 quizzes, and 20 assignments. Sketchnotes are included to support visual learners. Many lessons are available in both Python and R and can be completed using Jupyter notebooks in VS Code. Learn more about setting up your classroom to use this tech stack: https://code.visualstudio.com/docs/datascience/jupyter-notebooks.
All sketchnotes, including a large-format poster, are located in [this folder](../../sketchnotes).
The entire curriculum is available [as a PDF](../../pdf/readme.pdf).
You can also run this curriculum as a standalone, offline-friendly website using [Docsify](https://docsify.js.org/#/). [Install Docsify](https://docsify.js.org/#/quickstart) on your local machine, then in the root folder of your local copy of this repository, type `docsify serve`. The website will be served on port 3000 on your localhost: `localhost:3000`.
An offline-friendly version of the curriculum will open as a standalone web page: https://localhost:3000
Lessons are organized into six parts:
- 1: Introduction
- 1: Defining Data Science
- 2: Ethics
- 3: Defining Data
- 4: Probability and Statistics Overview
- 2: Working with Data
- 5: Relational Databases
- 6: Non-Relational Databases
- 7: Python
- 8: Data Preparation
- 3: Data Visualization
- 9: Visualization of Quantities
- 10: Visualization of Distributions
- 11: Visualization of Proportions
- 12: Visualization of Relationships
- 13: Meaningful Visualizations
- 4: Data Science Lifecycle
- 14: Introduction
- 15: Analyzing
- 16: Communication
- 5: Data Science in the Cloud
- 17: Introduction
- 18: Low-Code Options
- 19: Azure
- 6: Data Science in the Wild
- 20: Overview
## Please give us your thoughts!
We want this curriculum to work for you and your students. Share your feedback in the discussion boards! Feel free to create a classroom section on the discussion boards for your students.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,139 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "e92c33ea498915a13c9aec162616db18",
"translation_date": "2025-08-31T11:11:39+00:00",
"source_file": "quiz-app/README.md",
"language_code": "en"
}
-->
# Quizzes
These quizzes are the pre- and post-lecture quizzes for the data science curriculum at https://aka.ms/datascience-beginners
## Adding a translated quiz set
To add a quiz translation, create corresponding quiz structures in the `assets/translations` folders. The original quizzes are located in `assets/translations/en`. The quizzes are divided into several sections. Ensure the numbering matches the correct quiz section. There are 40 quizzes in total in this curriculum, starting from 0.
After updating the translations, modify the `index.js` file in the translation folder to import all the files, following the conventions in `en`.
Update the `index.js` file in `assets/translations` to import the newly translated files.
Next, update the dropdown in `App.vue` in this app to include your language. Match the localized abbreviation to the folder name for your language.
Finally, update all the quiz links in the translated lessons, if applicable, to include the localization as a query parameter: `?loc=fr`, for example.
## Project setup
```
npm install
```
### Compiles and hot-reloads for development
```
npm run serve
```
### Compiles and minifies for production
```
npm run build
```
### Lints and fixes files
```
npm run lint
```
### Customize configuration
Refer to [Configuration Reference](https://cli.vuejs.org/config/).
Credits: Thanks to the original version of this quiz app: https://github.com/arpan45/simple-quiz-vue
## Deploying to Azure
Heres a step-by-step guide to help you get started:
1. Fork the GitHub Repository
Ensure your static web app code is in your GitHub repository. Fork this repository.
2. Create an Azure Static Web App
- Create an [Azure account](http://azure.microsoft.com)
- Go to the [Azure portal](https://portal.azure.com)
- Click on “Create a resource” and search for “Static Web App.”
- Click “Create.”
3. Configure the Static Web App
- Basics:
- Subscription: Select your Azure subscription.
- Resource Group: Create a new resource group or use an existing one.
- Name: Provide a name for your static web app.
- Region: Choose the region closest to your users.
- #### Deployment Details:
- Source: Select “GitHub.”
- GitHub Account: Authorize Azure to access your GitHub account.
- Organization: Select your GitHub organization.
- Repository: Choose the repository containing your static web app.
- Branch: Select the branch you want to deploy from.
- #### Build Details:
- Build Presets: Choose the framework your app is built with (e.g., React, Angular, Vue, etc.).
- App Location: Specify the folder containing your app code (e.g., `/` if its in the root).
- API Location: If you have an API, specify its location (optional).
- Output Location: Specify the folder where the build output is generated (e.g., `build` or `dist`).
4. Review and Create
Review your settings and click “Create.” Azure will set up the necessary resources and create a GitHub Actions workflow in your repository.
5. GitHub Actions Workflow
Azure will automatically create a GitHub Actions workflow file in your repository (`.github/workflows/azure-static-web-apps-<name>.yml`). This workflow will handle the build and deployment process.
6. Monitor the Deployment
Go to the “Actions” tab in your GitHub repository.
You should see a workflow running. This workflow will build and deploy your static web app to Azure.
Once the workflow completes, your app will be live on the provided Azure URL.
### Example Workflow File
Heres an example of what the GitHub Actions workflow file might look like:
name: Azure Static Web Apps CI/CD
```
on:
push:
branches:
- main
pull_request:
types: [opened, synchronize, reopened, closed]
branches:
- main
jobs:
build_and_deploy_job:
runs-on: ubuntu-latest
name: Build and Deploy Job
steps:
- uses: actions/checkout@v2
- name: Build And Deploy
id: builddeploy
uses: Azure/static-web-apps-deploy@v1
with:
azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN }}
repo_token: ${{ secrets.GITHUB_TOKEN }}
action: "upload"
app_location: "quiz-app" # App source code path
api_location: ""API source code path optional
output_location: "dist" #Built app content directory - optional
```
### Additional Resources
- [Azure Static Web Apps Documentation](https://learn.microsoft.com/azure/static-web-apps/getting-started)
- [GitHub Actions Documentation](https://docs.github.com/actions/use-cases-and-examples/deploying/deploying-to-azure-static-web-app)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,21 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "3a848466cb63aff1a93411affb152c2a",
"translation_date": "2025-08-31T11:12:38+00:00",
"source_file": "sketchnotes/README.md",
"language_code": "en"
}
-->
Find all sketchnotes here!
## Credits
Nitya Narasimhan, artist
![roadmap sketchnote](../../../sketchnotes/00-Roadmap.png)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
Loading…
Cancel
Save