Merge branch 'main' into main

pull/106/head
Jasmine Greenaway 4 years ago committed by GitHub
commit ef677870ab
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,16 @@
---
name: Review Checklist
about: Reviewing curriculum lessons
title: '[Review]'
labels: ''
assignees: ''
---
# This lesson has been reviewed and resolved of the following issues
- [ ] Typos
- [ ] Grammar errors
- [ ] Missing links
- [ ] Broken Images
- [ ] Checked for completeness
- [ ] Quiz (if no quiz assign to @paladique)

4
.gitignore vendored

@ -307,6 +307,7 @@ paket-files/
# Python Tools for Visual Studio (PTVS)
__pycache__/
*.pyc
venv/
# Cake - Uncomment if you are using it
# tools/**
@ -350,3 +351,6 @@ MigrationBackup/
# Ionide (cross platform F# VS Code tools) working folder
.ionide/
4-Data-Science-Lifecycle/14-Introduction/README.md
.vscode/settings.json
Data/Taxi/*

@ -1,27 +1,31 @@
# Defining Data Science
## Pre-Lecture Quiz
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/01-Definitions.png)|
|:---:|
|Defining Data Science - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
[Pre-lecture quiz]()
---
## What is Data?
[![Defining Data Science Video](images/video-def-ds.png)](https://youtu.be/pqqsm5reGvs)
In our everyday life, we are always surrounded by **data**. The text you are reading now is data, the list of phone numbers of your friends in your smartphone is data, as well as current time displayed on your watch. As human beings, we naturally operate with data, counting the amount of money we have, or writing letters to our friends.
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/0)
However, data became much more important with the creation of **computers**. The main role of computers is to perform *computations*, but they need data to operate on. Thus, we need to understand how computers store and process data.
## What is Data?
In our everyday life, we are constantly surrounded by data. The text you are reading now is data, the list of phone numbers of your friends in your smartphone is data, as well as the current time displayed on your watch. As human beings, we naturally operate with data by counting the money we have or writing letters to our friends.
With the emergence of Internet, the role of computers as data handling devices increased. If you think of it, we now use computers more and more for data processing and communication, rather than actual computations. When we write an e-mail to a friend, or search some information on the Internet - we are essentially creating, storing, transmitting, and manipulating data.
However, data became much more critical with the creation of computers. The primary role of computers is to perform computations, but they need data to operate on. Thus, we need to understand how computers store and process data.
With the emergence of the Internet, the role of computers as data handling devices increased. If you think of it, we now use computers more and more for data processing and communication, rather than actual computations. When we write an e-mail to a friend or search for some information on the Internet - we are essentially creating, storing, transmitting, and manipulating data.
> Can you remember the last time you have used computers to actually compute something?
## What is Data Science?
In [Wikipedia](https://en.wikipedia.org/wiki/Data_science), **Data Science** is defined as *scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains*.
In [Wikipedia](https://en.wikipedia.org/wiki/Data_science), **Data Science** is defined as *a scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains*.
This definition highlights the following important aspects of data science:
* The main goal of data science is to **extract knowledge** from data, in order words - to **understand** data, find some hidden relationships and build a **model**.
* Data science uses **scientific methods**, such as probability and statistics. In fact, when the term *data science* was first introduced, some people argued that data science is just a new fancy name for statistics. Nowadays it becomes evident that the field is much more broad.
* Data science uses **scientific methods**, such as probability and statistics. In fact, when the term *data science* was first introduced, some people argued that data science is just a new fancy name for statistics. Nowadays it has become evident that the field is much broader.
* Obtained knowledge should be applied to produce some **actionable insights**.
* We should be able to operate on both **structured** and **unstructured** data. We will come back to discuss different types of data later in the course.
* **Application domain** is an important concept, and data scientist often needs at least some degree of expertise in the problem domain.
@ -36,7 +40,7 @@ One of the ways (attributed to [Jim Gray](https://en.wikipedia.org/wiki/Jim_Gray
## Other Related Fields
Since data is pervasive concept, data science itself is also a broad field, touching many other related disciplines.
Since data is a pervasive concept, data science itself is also a broad field, touching many other related disciplines.
<dl>
<dt>Databases</dt>
@ -63,21 +67,65 @@ Vast amounts of data are incomprehensible for a human being, but once we create
## Types of Data
As we have already mentioned - data is everywhere, we just need to capture it in the right way! It is useful to distinguish between **structured** and **unstructured** data - the former are typically represented in some well-structured form, often as a table or number of tables, while latter is just a collection of files. Sometimes we can also talk about **semistructured** data, that have some sort of a structure that may vary greatly.
As we have already mentioned - data is everywhere, we just need to capture it in the right way! It is useful to distinguish between **structured** and **unstructured** data. The former are typically represented in some well-structured form, often as a table or number of tables, while latter is just a collection of files. Sometimes we can also talk about **semistructured** data, that have some sort of a structure that may vary greatly.
| Structured | Semi-structured | Unstructured |
|----------- |-----------------|--------------|
| List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopaedia Brittanica |
| List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopaedia Britannica |
| Temperature in all rooms of a building at every minute for the last 20 years | Collection of scientific papers in JSON format with authors, data of publication, and abstract | File share with corporate documents |
| Data for age and gender for all people entering the building | Internet pages | Raw video feed from surveillance camera |
| Data for age and gender of all people entering the building | Internet pages | Raw video feed from surveillance camera |
## Where to get Data
There are many possible sources of data, and it will be impossible to list all of them! However, let's mention some of the typical places where you can get data:
* **Structured**
- **Internet of Things**, including data from different sensors, such as temperature or pressure sensors, provides a lot of useful data. For example, if an office building is equipped with IoT sensors, we can automatically control heating and lighting in order to minimize costs.
- **Surveys** that we ask users after purchase of a good, or after visiting a web site.
- **Analysis of behavior** can, for example, help us understand how deeply a user goes into a site, and what is the typical reason for leaving the site.
* **Unstructured**
- **Texts** can be a rich source of insights, starting from overall **sentiment score**, up to extracting keywords and even some semantic meaning.
- **Images** or **Video**. A video from surveillance camera can be used to estimate traffic on the road, and inform people about potential traffic jams.
- Web server **Logs** can be used to understand which pages of our site are most visited, and for how long.
* Semi-structured
- **Social Network** graph can be a great source of data about user personality and potential effectiveness in spreading information around.
- When we have a bunch of photographs from a party, we can try to extract **Group Dynamics** data by building a graph of people taking pictures with each other.
By knowing different possible sources of data, you can try to think about different scenarios where data science techniques can be applied to know the situation better, and to improve business processes.
## What you can do with Data
In Data Science, we focus on the following steps of data journey:
<dl>
<dt>1) Data Acquisition</dt>
<dd>
First step is to collect the data. While in many cases it can be a straightforward process, like data coming to a database from web application, sometimes we need to use special techniques. For example, data from IoT sensors can be overwhelming, and it is a good practice to use buffering endpoints such as IoT Hub to collect all the data before further processing.
</dd>
<dt>2) Data Storage</dt>
<dd>
Storing the data can be challenging, especially if we are talking about big data. When deciding how to store data, it makes sense to anticipate the way you would want later on to query them. There are several ways data can be stored:
<ul>
<li>Relational database stores a collection of tables, and uses a special language called SQL to query them. Typically, tables would be connected to each other using some schema. In many cases we need to convert the data from original form to fit the schema.</li>
<li><a href="https://en.wikipedia.org/wiki/NoSQL">NoSQL</a> database, such as <a href="https://azure.microsoft.com/services/cosmos-db/?WT.mc_id=acad-31812-dmitryso">CosmosDB</a>, does not enforce schema on data, and allows storing more complex data, for example, hierarchical JSON documents or graphs. However, NoSQL database does not have rich querying capabilities of SQL, and cannot enforce referential integrity between data.</li>
<li><a href="https://en.wikipedia.org/wiki/Data_lake">Data Lake</a> storage is used for large collections of data in raw form. Data lakes are often used with big data, where all data cannot fit into one machine, and has to be stored and processed by a cluster. <a href="https://en.wikipedia.org/wiki/Apache_Parquet">Parquet</a> is the data format that is often used in conjunction with big data.</li>
</ul>
</dd>
<dt>3) Data Processing</dt>
<dd>
This is the most exciting part of data journey, which involved processing the data from its original form to the form that can be used for visualization/model training. When dealing with unstructured data such as text or images, we may need to use some AI techniques to extract **features** from the data, thus converting it to structured form.
</dd>
<dt>4) Visualization / Human Insights</dt>
<dd>
Often to understand the data we need to visualize them. Having many different visualization techniques in our toolbox we can find the right view to make an insight. Often, data scientist needs to "play with data", visualizing it many times and looking for some relationships. Also, we may use techniques from statistics to test some hypotheses or prove correlation between different pieces of data.
</dd>
<dt>5) Training predictive model</dt>
<dd>
Because the ultimate goal of data science is to be able to take decisions based on data, we may want to use the techniques of <a href="http://github.com/microsoft/ml-for-beginners">Machine Learning</a> to build predictive model that will be able to solve our problem.
</dd>
</dl>
Of course, depending on the actual data some steps might be missing (eg., when we already have the data in the database, or when we do not need model training), or some steps might be repeated several times (such as data processing).
## Digitalization and Digital Transformation
@ -98,12 +146,20 @@ If we want to get even more complicated, we can plot the time taken for each mod
In this challenge, we will try to find concepts relevant to the field of Data Science by looking at texts. We will take Wikipedia article on Data Science, download and process the text, and then build a word cloud like this one:
![Word Cloud for Data Science](images/ds_wordcloud.png)
## Post-Lecture Quiz
[Post-lecture quiz]()
Visit [`notebook.ipynb`](notebook.ipynb) to read through the code. You can also run the code, and see how it performs all data transformations in real time.
> If you do not know how to run code in Jupyter Notebook, have a look at [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/1)
## Assignments
## Review & Self Study
* **Task 1**: Modify the code above to find out related concepts for the fields of **Big Data** and **Machine Learning**
* **Task 2**: [Think About Data Science Scenarios](assignment.md)
## Assignment
## Credits
[Assignment Title](assignment.md)
This lesson has been authored with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)

@ -1,8 +1,31 @@
# Title
# Assignment: Data Science Scenarios
In this first assignment, we ask you to think about some real-life process or problem in different problem domains, and how you can improve it using the Data Science process. Think about the following:
1. Which data can you collect?
1. How would you collect it?
1. How would you store the data? How large the data is likely to be?
1. Which insights you might be able to get from this data? Which decisions we would be able to take based on the data?
Try to think about 3 different problems/processes and describe each of the points above for each problem domain.
Here are some of the problem domains and problems that can get you started thinking:
1. How can you use data to improve education process for children in schools?
1. How can you use data to control vaccination during the pandemic?
1. How can you use data to make sure you are being productive at work?
## Instructions
Fill in the following table (substitute suggested problem domains for your own ones if needed):
| Problem Domain | Problem | Which data to collect | How to store the data | Which insights/decisions we can make |
|----------------|---------|-----------------------|-----------------------|--------------------------------------|
| Education | | | | |
| Vaccination | | | | |
| Productivity | | | | |
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
One was able to identify reasonable data sources, ways of storing data and possible decisions/insights for all problem domains | Some of the aspects of the solution are not detailed, data storage is not discussed, at least 2 problem domains are described | Only parts of the data solution are described, only one problem domain is considered.

Binary file not shown.

After

Width:  |  Height:  |  Size: 103 KiB

@ -0,0 +1,33 @@
# Assignment: Data Science Scenarios
In this first assignment, we ask you to think about some real-life process or problem in different problem domains, and how you can improve it using the Data Science process. Think about the following:
1. Which data can you collect?
1. How would you collect it?
1. How would you store the data? How large the data is likely to be?
1. Which insights you might be able to get from this data? Which decisions we would be able to take based on the data?
Try to think about 3 different problems/processes and describe each of the points above for each problem domain.
Here are some of the problem domains and problems that can get you started thinking:
1. How can you use data to improve education process for children in schools?
1. How can you use data to control vaccination during the pandemic?
1. How can you use data to make sure you are being productive at work?
## Instructions
Fill in the following table (substitute suggested problem domains for your own ones if needed):
| Problem Domain | Problem | Which data to collect | How to store the data | Which insights/decisions we can make |
|----------------|---------|-----------------------|-----------------------|--------------------------------------|
| Education | In university, we typically have low attendance to lectures, and we have the hypothesis that students who attend lectures on average to better during exams. We want to stimulate attendance and test the hypothesis. | We can track attendance through pictures taken by the security camera in class, or by tracking bluetooth/wifi addresses of student mobile phones in class. Exam data is already available in the university database. | In case we track security camera images - we need to store a few (5-10) photographs during class (unstructured data), and then use AI to identify faces of students (convert data to structured form). | We can compute average attendance data for each student, and see if there is any correlation with exam grades. We will talk more about correlation in [probability and statistics](../../04-stats-and-probability/README.md) section. In order to stimulate student attendance, we can publish the weekly attendance rating on school portal, and draw prizes among those with highest attendance. |
| Vaccination | | | | |
| Productivity | | | | |
> *We provide just one answer as an example, so that you can get an idea of what is expected in this assignment.*
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
One was able to identify reasonable data sources, ways of storing data and possible decisions/insights for all problem domains | Some of the aspects of the solution are not detailed, data storage is not discussed, at least 2 problem domains are described | Only parts of the data solution are described, only one problem domain is considered.

File diff suppressed because one or more lines are too long

@ -1,130 +1,206 @@
# Data Ethics
# Introduction to Data Ethics
## Pre-Lecture Quiz 🎯
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/02-Ethics.png)|
|:---:|
| Data Science Ethics - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
[Pre-lecture quiz]()
---
We are all data citizens living in a datafied world.
Market trends tell us that by 2022, 1-in-3 large organizations will buy and sell their data through online [Marketplaces and Exchanges](https://www.gartner.com/smarterwithgartner/gartner-top-10-trends-in-data-and-analytics-for-2020/). As **App Developers**, we'll find it easier and cheaper to integrate data-driven insights and algorithm-driven automation into daily user experiences. But as AI becomes pervasive, we'll also need to understand the potential harms caused by the [weaponization](https://www.youtube.com/watch?v=TQHs8SA1qpk) of such algorithms at scale.
Trends also indicate that we will create and consume over [180 zettabytes](https://www.statista.com/statistics/871513/worldwide-data-created/) of data by 2025. As **Data Scientists**, this gives us unprecedented levels of access to personal data. This means we can build behavioral profiles of users and influence decision-making in ways that create an [illusion of free choice](https://www.datasciencecentral.com/profiles/blogs/the-illusion-of-choice) while potentially nudging users towards outcomes we prefer. It also raises broader questions on data privacy and user protections.
Data ethics are now _necessary guardrails_ for data science and engineering, helping us minimize potential harms and unintended consequences from our data-driven actions. The [Gartner Hype Cycle for AI](https://www.gartner.com/smarterwithgartner/2-megatrends-dominate-the-gartner-hype-cycle-for-artificial-intelligence-2020/) identifies relevant trends in digital ethics, responsible AI ,and AI governances as key drivers for larger megatrends around _democratization_ and _industrialization_ of AI.
![Gartner's Hype Cycle for AI - 2020](https://images-cdn.newscred.com/Zz1mOWJhNzlkNDA2ZTMxMWViYjRiOGFiM2IyMjQ1YmMwZQ==)
In this lesson, we'll explore the fascinating area of data ethics - from core concepts and challenges, to case studies and applied AI concepts like governance - that help establish an ethics culture in teams and organizations that work with data and AI.
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/2) 🎯
## Basic Definitions
Let's start by understanding the basic terminology.
The word "ethics" comes from the [Greek word "ethikos"](https://en.wikipedia.org/wiki/Ethics) (and its root "ethos") meaning _character or moral nature_.
**Ethics** is about the shared values and moral principles that govern our behavior in society. Ethics is based not on laws but on
widely accepted norms of what is "right vs. wrong". However, ethical considerations can influence corporate governance initiatives and government regulations that create more incentives for compliance.
**Data Ethics** is a [new branch of ethics](https://royalsocietypublishing.org/doi/full/10.1098/rsta.2016.0360#sec-1) that "studies and evaluates moral problems related to _data, algorithms and corresponding practices_". Here, **"data"** focuses on actions related to generation, recording, curation, processing dissemination, sharing ,and usage, **"algorithms"** focuses on AI, agents, machine learning ,and robots, and **"practices"** focuses on topics like responsible innovation, programming, hacking and ethics codes.
**Applied Ethics** is the [practical application of moral considerations](https://en.wikipedia.org/wiki/Applied_ethics). It's the process of actively investigating ethical issues in the context of _real-world actions, products and processes_, and taking corrective measures to make that these remain aligned with our defined ethical values.
**Ethics Culture** is about [_operationalizing_ applied ethics](https://hbr.org/2019/05/how-to-design-an-ethical-organization) to make sure that our ethical principles and practices are adopted in a consistent and scalable manner across the entire organization. Successful ethics cultures define organization-wide ethical principles, provide meaningful incentives for compliance, and reinforce ethics norms by encouraging and amplifying desired behaviors at every level of the organization.
## Ethics Concepts
## Sketchnote 🖼
In this section, we'll discuss concepts like **shared values** (principles) and **ethical challenges** (problems) for data ethics - and explore **case studies** that help you understand these concepts in real-world contexts.
> A Visual Guide to Data Ethics by [Nitya Narasimhan](https://twitter.com/nitya) / [(@sketchthedocs)](https://sketchthedocs.dev)
### 1. Ethics Principles
Every data ethics strategy begins by defining _ethical principles_ - the "shared values" that describe acceptable behaviors, and guide compliant actions, in our data & AI projects. You can define these at an individual or team level. However, most large organizations outline these in an _ethical AI_ mission statement or framework that is defined at corporate levels and enforced consistently across all teams.
## 1. Introduction
**Example:** Microsoft's [Responsible AI](https://www.microsoft.com/en-us/ai/responsible-ai) mission statement reads: _"We are committed to the advancement of AI-driven by ethical principles that put people first"_ - identifying 6 ethical principles in the framework below:
This lesson will look at the field of _data ethics_ - from core concepts (ethical challenges & societal consequences) to applied ethics (ethical principles, practices and culture). Let's start with the basics: definitions and motivations.
![Responsible AI at Microsoft](https://docs.microsoft.com/en-gb/azure/cognitive-services/personalizer/media/ethics-and-responsible-use/ai-values-future-computed.png)
### 1.1 Definitions
Let's briefly explore these principles. _Transparency_ and _accountability_ are foundational values that other principles built upon - so let's begin there:
**Ethics** [comes from the Greek word "ethikos" and its root "ethos"](https://en.wikipedia.org/wiki/Ethics). It refers to the set of _shared values and moral principles_ that govern our behavior in society and is based on widely-accepted ideas of _right vs. wrong_. Ethics are not laws! They can't be legally enforced but they can influence corporate initiatives and government regulations that help with compliance and governance.
* [**Accountability**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) makes practitioners _responsible_ for their data & AI operations, and compliance with these ethical principles.
* [**Transparency**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) ensures that data and AI actions are _understandable_ (interpretable) to users, explaining the what and why behind decisions.
* [**Fairness**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1%3aprimaryr6) - focuses on ensuring AI treats _all people_ fairly, addressing any systemic or implicit socio-technical biases in data and systems.
* [**Reliability & Safety**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) - ensures that AI behaves _consistently_ with defined values, minimizing potential harms or unintended consequences.
* [**Privacy & Security**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) - is about understanding data lineage, and providing _data privacy and related protections_ to users.
* [**Inclusiveness**](https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1:primaryr6) - is about designing AI solutions with intention, adapting them to meet a _broad range of human needs_ & capabilities.
**Data Ethics** is [defined as a new branch of ethics](https://royalsocietypublishing.org/doi/full/10.1098/rsta.2016.0360#sec-1) that "studies and evaluates moral problems related to _data, algorithms and corresponding practices_ .. to formulate and support morally good solutions" where:
* `data` = generation, recording, curation, dissemination, sharing and usage
* `algorithms` = AI, machine learning, bots
* `practices` = responsible innovation, ethical hacking, codes of conduct
> 🚨 Think about what your data ethics mission statement could be. Explore ethical AI frameworks from other organizations - here are examples from [IBM](https://www.ibm.com/cloud/learn/ai-ethics), [Google](https://ai.google/principles) ,and [Facebook](https://ai.facebook.com/blog/facebooks-five-pillars-of-responsible-ai/). What shared values do they have in common? How do these principles relate to the AI product or industry they operate in?
**Applied Ethics** is the [_practical application of moral considerations_](https://en.wikipedia.org/wiki/Applied_ethics). If focuses on understanding how ethical issues impact real-world actions, products and processes, by asking moral questions - like _"is this fair?"_ and _"how can this harm individuals or society as a whole?"_ when working with big data and AI algorithms. Applied ethics practices can then focus on taking corrective measures - like employing checklists (_"did we test data model accruacy with diverse groups, for fairness?"_) - to minimize or prevent any unintended consequences.
### 2. Ethics Challenges
**Ethics Culture**: Applied ethics focuses on identifying moral questions and adopting ethically-motivated actions with respect to real-world scenarios and projects. Ethics culture is about _operationalizing_ these practices, collaboratively and at scale, to ensure governances at the scale of organizations and industries. [Establishing an ethics culture](https://hbr.org/2019/05/how-to-design-an-ethical-organization) requires identifying and addressing _systemic_ issues (historical or ingrained) and creating norms & incentives htat keep members accountable for adherence to ethical principles.
Once we have ethical principles defined, the next step is to evaluate our data and AI actions to see if they align with those shared values. Think about your actions in two categories: _data collection_ and _algorithm design_.
With data collection, actions will likely involve **personal data** or personally identifiable information (PII) for identifiable living individuals. This includes [diverse items of non-personal data](https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en) that _collectively_ identify an individual. Ethical challenges can relate to _data privacy_, _data ownership_, and related topics like _informed consent_ and _intellectual property rights_ for users.
### 1.2 Motivation
With algorithm design, actions will involve collecting & curating **datasets**, then using them to train & deploy **data models** that predict outcomes or automate decisions in real-world contexts. Ethical challenges can arise from _dataset bias_, _data quality_ issues, _unfairness_ ,and _misrepresentation_ in algorithms - including some issues that are systemic in nature.
Let's look at some emerging trends in big data and AI:
In both cases, ethics challenges highlight areas where our actions may encounter conflict with our shared values. To detect, mitigate, minimize, or eliminate, these concerns - we need to ask moral "yes/no" questions related to our actions, then take corrective actions as needed. Let's take a look at some ethical challenges and the moral questions they raise:
* [By 2022](https://www.gartner.com/smarterwithgartner/gartner-top-10-trends-in-data-and-analytics-for-2020/) one-in-three large organizations will buy and sell data via online Marketplaces and Exchanges.
* [By 2025](https://www.statista.com/statistics/871513/worldwide-data-created/) we'll be creating and consuming over 180 zettabytes of data.
**Data scientists** will have unimaginable levels of access to personal and behavioral data, helping them develop the algorithms to fuel an AI-driven economy. This raises data ethics issues around _protection of data privacy_ with implications for individual rights around personal data collection and usage.
#### 2.1 Data Ownership
**App developers** will find it easier and cheaper to integrate AI into everday consumer experiences, thanks to the economies of scale and efficiencies of distribution in centralized exchanges. This raises ethical issues around the [_weaponization of AI_](https://www.youtube.com/watch?v=TQHs8SA1qpk) with implications for societal harms caused by unfairness, misrepresentation and systemic biases.
Data collection often involves personal data that can identify the data subjects. [Data ownership](https://permission.io/blog/data-ownership) is about _control_ and [_user rights_](https://permission.io/blog/data-ownership) related to the creation, processing ,and dissemination of data.
**Democratization and Industrialization of AI** are seen as the two megatrends in Gartner's 2020 [Hype Cycle for AI](https://www.gartner.com/smarterwithgartner/2-megatrends-dominate-the-gartner-hype-cycle-for-artificial-intelligence-2020/), shown below. The first positions developers to be a major force in driving increased AI adoption, while the second makes responsible AI and governance a priority for industries.
The moral questions we need to ask are:
* Who owns the data? (user or organization)
* What rights do data subjects have? (ex: access, erasure, portability)
* What rights do organizations have? (ex: rectify malicious user reviews)
#### 2.2 Informed Consent
![](https://images-cdn.newscred.com/Zz1mOWJhNzlkNDA2ZTMxMWViYjRiOGFiM2IyMjQ1YmMwZQ==)
[Informed consent](https://legaldictionary.net/informed-consent/) defines the act of users agreeing to an action (like data collection) with a _full understanding_ of relevant facts including the purpose, potential risks ,and alternatives.
Data ethics are now **necessary guardrails** ensuring developers ask the right moral questions and adopt the right practices (to uphold ethical values). And they influence the regulations and frameworks defined (for governance) by governments and organizations.
Questions to explore here are:
* Did the user (data subject) give permission for data capture and usage?
* Did the user understand the purpose for which that data was captured?
* Did the user understand the potential risks from their participation?
#### 2.3 Intellectual Property
## 2. Core Concepts
[Intellectual property](https://en.wikipedia.org/wiki/Intellectual_property) refers to intangible creations resulting from the human initiative, that may _have economic value_ to individuals or businesses.
A data ethics culture requires an understanding of three things: the _shared values_ we embrace as a society, the _moral questions_ we ask (to ensure adherence to those values), and the potential _harms & consequences_ (of non-adherence).
Questions to explore here are:
* Did the collected data have economic value to a user or business?
* Does the **user** have intellectual property here?
* Does the **organization** have intellectual property here?
* If these rights exist, how are we protecting them?
### 2.1 Ethical AI Values
#### 2.4 Data Privacy
Our shared values reflect our ideas of wrong-vs-right when it comes to big data and AI. Different organizations have their own views of what responsible AI and ethical AI principles look like.
[Data privacy](https://www.northeastern.edu/graduate/blog/what-is-data-privacy/) or information privacy refers to the preservation of user privacy and protection of user identity with respect to personally identifiable information.
Here is an example - the [Responsible AI Framework](https://docs.microsoft.com/en-gb/azure/cognitive-services/personalizer/media/ethics-and-responsible-use/ai-values-future-computed.png) from Microsoft defines 6 core ethics principles for all products and processes to follow, when implementing AI solutions:
Questions to explore here are:
* Is users' (personal) data secured against hacks and leaks?
* Is users' data accessible only to authorized users and contexts?
* Is users' anonymity preserved when data is shared or disseminated?
* Can a user be de-identified from anonymized datasets?
* **Accountability**: ensure AI designers & developers take _responsibility_ for its operation.
* **Transparency**: make AI operations and decisions _understandable_ to users.
* **Fairness**: understand biases and ensure AI _behaves comparably_ across target groups.
* **Reliability & Safety**: make sure AI behaves consistently, and _without malicious intent_.
* **Security & Privacy**: get _informed consent_ for data collection, provide data privacy controls.
* **Inclusiveness**: adapt AI behaviors to _broad range of human needs_ and capabilities.
![Elements of an Responsible AI Framework at Microsoft](https://docs.microsoft.com/en-gb/azure/cognitive-services/personalizer/media/ethics-and-responsible-use/ai-values-future-computed.png)
#### 2.5 Right To Be Forgotten
Note that accountability and transparency are _cross-cutting_ concerns that are foundational to the top 4 values, and can be explored in their contexts. In the next section we'll look at the ethical challenges (moral questions) raised in two core contexts:
The [Right To Be Forgotten](https://en.wikipedia.org/wiki/Right_to_be_forgotten) or [Right to Erasure](https://www.gdpreu.org/right-to-be-forgotten/) provides additional personal data protection to users. Specifically, it gives users the right to request deletion or removal of personal data from Internet searches and other locations, _under specific circumstances_ - allowing them a fresh start online without past actions being held against them.
* Data Privacy - focused on **personal data** collection & use, with consequences to individuals.
* Fairness - focused on **algorithm** design & use, with consequences to society at large.
Questions to explore here are:
* Does the system allow data subjects to request erasure?
* Should the withdrawal of user consent trigger automated erasure?
* Was data collected without consent or by unlawful means?
* Are we compliant with government regulations for data privacy?
### 2.2 Ethics of Personal Data
[Personal data](https://en.wikipedia.org/wiki/Personal_data) or personally-identifiable information (PII) is _any data that relates to an identified or identifiable living individual_. It can also [extend to diverse pieces of non-personal data](https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en) that collectively can lead to the identification of a specific individual. Examples include: participant data from research studies, social media interactions, mobile & web app data, online commerce transactions and more.
#### 2.6 Dataset Bias
Here are _some_ ethical concepts and moral questions to explore in context:
Dataset or [Collection Bias](http://researcharticles.com/index.php/bias-in-data-collection-in-research/) is about selecting a _non-representative_ subset of data for algorithm development, creating potential unfairness in result outcomes for diverse groups. Types of bias include selection or sampling bias, volunteer bias, and instrument bias.
* **Data Ownership**. Who owns the data - user or organization? How does this impact users' rights?
* **Informed Consent**. Did users give permissions for data capture? Did they understand purpose?
* **Intellectual Property**. Does data have economic value? What are the users' rights & controls?
* **Data Privacy**. Is data secured from hacks/leaks? Is anonymity preserved on data use or sharing?
* **Right to be Forgotten**. Can user request their data be deleted or removed to reclaim privacy?
Questions to explore here are:
* Did we recruit a representative set of data subjects?
* Did we test our collected or curated dataset for various biases?
* Can we mitigate or remove any discovered biases?
### 2.3 Ethics of Algorithms
#### 2.7 Data Quality
Algorithm design begins with collecting & curating datasets relevant to a specific AI problem or domain, then processing & analyzing it to create models that can help predict outcomes or automate decisions in real-world applications. Moral questions can now arise in various contexts, at any one of these stages.
[Data Quality](https://lakefs.io/data-quality-testing/) looks at the validity of the curated dataset used to develop our algorithms, checking to see if features and records meet requirements for the level of accuracy and consistency needed for our AI purpose.
Here are _some_ ethical concepts and moral questions to explore in context:
Questions to explore here are:
* Did we capture valid _features_ for our use case?
* Was data captured _consistently_ across diverse data sources?
* Is the dataset _complete_ for diverse conditions or scenarios?
* Is information captured _accurately_ in reflecting reality?
* **Dataset Bias** - Is data representative of target audience? Have we checked for different [data biases](https://towardsdatascience.com/survey-d4f168791e57)?
* **Data Quality** - Does dataset and feature selection provide the required [data quality assurance](https://lakefs.io/data-quality-testing/)?
* **Algorithm Fairness** - Does the data model [systematically discriminate](https://towardsdatascience.com/what-is-algorithm-fairness-3182e161cf9f) against some subgroups?
* **Misrepresentation** - Are we [communicating honestly reported data in a deceptive manner?](https://www.sciencedirect.com/topics/computer-science/misrepresentation)
* **Explainable AI** - Are the results of AI [understandable by humans](https://en.wikipedia.org/wiki/Explainable_artificial_intelligence)? White-box (vs. black-box) models.
* **Free Choice** - Did user exercise free will or did algorithm nudge them towards a desired outcome?
#### 2.8 Algorithm Fairness
### 2.3 Case Studies
[Algorithm Fairness](https://towardsdatascience.com/what-is-algorithm-fairness-3182e161cf9f) checks to see if the algorithm design systematically discriminates against specific subgroups of data subjects leading to [potential harms](https://docs.microsoft.com/en-us/azure/machine-learning/concept-fairness-ml) in _allocation_ (where resources are denied or withheld from that group) and _quality of service_ (where AI is not as accurate for some subgroups as it is for others).
The above are a subset of the core ethical challenges posed for big data and AI. More organizations are defining and adopting _responsible AI_ or _ethical AI_ frameworks that may identify additional shared values and related ethical challenges for specific domains or needs.
Questions to explore here are:
* Did we evaluate model accuracy for diverse subgroups and conditions?
* Did we scrutinize the system for potential harms (e.g., stereotyping)?
* Can we revise data or retrain models to mitigate identified harms?
To understand the potential _harms and consequences_ of neglecting or violating these data ethics principles, it helps to explore this in a real-world context. Here are some famous case studies and recent examples to get you started:
Explore resources like [AI Fairness checklists](https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4t6dA) to learn more.
#### 2.9 Misrepresentation
* `1972`: The [Tuskegee Syphillis Study](https://en.wikipedia.org/wiki/Tuskegee_Syphilis_Study) is a landmark case study for **informed consent** in data science. African American men who participated in the study were promised free medical care _but deceived_ by researchers who failed to inform subjects of their diagnosis or about availability of treatment. Many subjects died; some partners or children were affected by complications. The study lasted 40 years.
* `2007`: The Netflix data prize provided researchers with [_10M anonymized movie rankings from 50K customers_](https://www.wired.com/2007/12/why-anonymous-data-sometimes-isnt/) to help improve recommendation algorithms. This became a landmark case study in **de-identification (data privacy)** where researchers were able to correlate the anonymized data with _other datasets_ (e.g., IMDb) that had personally identifiable information - helping them "de-anonymize" users.
* `2013`: The City of Boston [developed Street Bump](https://www.boston.gov/transportation/street-bump), an app that let citizens report potholes, giving the city better roadway data to find and fix issues. This became a case study for **collection bias** where [people in lower income groups had less access to cars and phones](https://hbr.org/2013/04/the-hidden-biases-in-big-data), making their roadway issues invisible in this app. Developers worked with academics to _equitable access and digital divides_ issues for fairness.
* `2018`: The MIT [Gender Shades Study](http://gendershades.org/overview.html) evaluated the accuracy of gender classification AI products, exposing gaps in accuracy for women and persons of color. A [2019 Apple Card](https://www.wired.com/story/the-apple-card-didnt-see-genderand-thats-the-problem/) seemed to offer less credit to women than men. Both these illustrated issues in **algorithmic fairness** and discrimination.
* `2020`: The [Georgia Department of Public Health released COVID-19 charts](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) that appeared to mislead citizens about trends in confirmed cases with non-chronological ordering on the x-axis. This illustrates **data misrepresentation** where honest data is presented dishonestly to support a desired narrative.
* `2020`: Learning app [ABCmouse paid $10M to settle an FTC complaint](https://www.washingtonpost.com/business/2020/09/04/abcmouse-10-million-ftc-settlement/) where parents were trapped into paying for subscriptions they couldn't cancel. This highlights the **illusion of free choice** in algorithmic decision-making, and potential harms from dark patterns that exploit user insights.
* `2021`: Facebook [Data Breach](https://www.npr.org/2021/04/09/986005820/after-data-breach-exposes-530-million-facebook-says-it-will-not-notify-users) exposed data from 530M users, resulting in a $5B settlement to the FTC. It however refused to notify users of the breach - raising issues like **data privacy**, **data security** and **accountability**, including user rights to redress for those affected.
[Data Misrepresentation](https://www.sciencedirect.com/topics/computer-science/misrepresentation) is about asking whether we are communicating insights from honestly reported data in a deceptive manner to support a desired narrative.
Want to explore more case studies on your own? Check out these resources:
Questions to explore here are:
* Are we reporting incomplete or inaccurate data?
* Are we visualizing data in a manner that drives misleading conclusions?
* Are we using selective statistical techniques to manipulate outcomes?
* Are there alternative explanations that may offer a different conclusion?
* [Ethics Unwrapped](https://ethicsunwrapped.utexas.edu/case-studies) - ethics dilemmas across diverse industries.
* [Data Science Ethics course](https://www.coursera.org/learn/data-science-ethics#syllabus) - landmark case studies in data ethics.
* [Where things have gone wrong](https://deon.drivendata.org/examples/) - deon checklist examples of ethical issues
#### 2.10 Free Choice
The [Illusion of Free Choice](https://www.datasciencecentral.com/profiles/blogs/the-illusion-of-choice) occurs when system "choice architectures" use decision-making algorithms to nudge people towards taking a preferred outcome while seeming to give them options and control. These [dark patterns](https://www.darkpatterns.org/) can cause social and economic harm to users. Because user decisions impact behavior profiles, these actions potentially drive future choices that can amplify or extend the impact of these harms.
## 3. Applied Ethics
Questions to explore here are:
* Did the user understand the implications of making that choice?
* Was the user aware of (alternative) choices and the pros & cons of each?
* Can the user reverse an automated or influenced choice later?
We've learned about data ethics values, and the ethical challenges (+ moral questions) associated with adherence to these values. But how do we _implement_ these ideas in real-world contexts? Here are some tools & practices that can help.
### 3. Case Studies
### 3.1 Have Professional Codes
To put these ethical challenges in real-world contexts, it helps to look at case studies that highlight the potential harms and consequences to individuals and society, when such ethics violations are overlooked.
Professional codes are _moral guidelines_ for professional behavior, helping employees or members _make decisions that align with organizational principles_. Codes may not be legally enforceable, making them only as good as the willing compliance of members. An organization may inspire adherence by imposing incentives & penalties accordingly.
Here are a few examples:
Professional _codes of conduct_ are prescriptive rules and responsibilities that members must follow to remain in good standing with an organization. A professional *code of ethics* is more [_aspirational_](https://keydifferences.com/difference-between-code-of-ethics-and-code-of-conduct.html), defining the shared values and ideas of the organization. The terms are sometimes used interchangeably.
| Ethics Challenge | Case Study |
|--- |--- |
| **Informed Consent** | 1972 - [Tuskegee Syphillis Study](https://en.wikipedia.org/wiki/Tuskegee_Syphilis_Study) - African American men who participated in the study were promised free medical care _but deceived_ by researchers who failed to inform subjects of their diagnosis or about availability of treatment. Many subjects died & partners or children were affected; the study lasted 40 years. |
| **Data Privacy** | 2007 - The [Netflix data prize](https://www.wired.com/2007/12/why-anonymous-data-sometimes-isnt/) provided researchers with _10M anonymized movie rankings from 50K customers_ to help improve recommendation algorithms. However, researchers were able to correlate anonymized data with personally-identifiable data in _external datasets_ (e.g., IMDb comments) - effectively "de-anonymizing" some Netflix subscribers.|
| **Collection Bias** | 2013 - The City of Boston [developed Street Bump](https://www.boston.gov/transportation/street-bump), an app that let citizens report potholes, giving the city better roadway data to find and fix issues. However, [people in lower income groups had less access to cars and phones](https://hbr.org/2013/04/the-hidden-biases-in-big-data), making their roadway issues invisible in this app. Developers worked with academics to _equitable access and digital divides_ issues for fairness. |
| **Algorithmic Fairness** | 2018 - The MIT [Gender Shades Study](http://gendershades.org/overview.html) evaluated the accuracy of gender classification AI products, exposing gaps in accuracy for women and persons of color. A [2019 Apple Card](https://www.wired.com/story/the-apple-card-didnt-see-genderand-thats-the-problem/) seemed to offer less credit to women than men. Both illustrated issues in algorithmic bias leading to socio-economic harms.|
| **Data Misrepresentation** | 2020 - The [Georgia Department of Public Health released COVID-19 charts](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) that appeared to mislead citizens about trends in confirmed cases with non-chronological ordering on the x-axis. This illustrates misrepresentation through visualization tricks. |
| **Illusion of free choice** | 2020 - Learning app [ABCmouse paid $10M to settle an FTC complaint](https://www.washingtonpost.com/business/2020/09/04/abcmouse-10-million-ftc-settlement/) where parents were trapped into paying for subscriptions they couldn't cancel. This illustrates dark patterns in choice architectures, where users were nudged towards potentially harmful choices. |
| **Data Privacy & User Rights** | 2021 - Facebook [Data Breach](https://www.npr.org/2021/04/09/986005820/after-data-breach-exposes-530-million-facebook-says-it-will-not-notify-users) exposed data from 530M users, resulting in a $5B settlement to the FTC. It however refused to notify users of the breach violating user rights around data transparency and access. |
Want to explore more case studies? Check out these resources:
* [Ethics Unwrapped](https://ethicsunwrapped.utexas.edu/case-studies) - ethics dilemmas across diverse industries.
* [Data Science Ethics course](https://www.coursera.org/learn/data-science-ethics#syllabus) - landmark case studies explored.
* [Where things have gone wrong](https://deon.drivendata.org/examples/) - deon checklist with examples
> 🚨 Think about the case studies you've seen - have you experienced, or been affected by, a similar ethical challenge in your life? Can you think of at least one other case study that illustrates one of the ethical challenges we've discussed in this section?
## Applied Ethics
We've talked about ethics concepts, challenges ,and case studies in real-world contexts. But how do we get started _applying_ ethical principles and practices in our projects? And how do we _operationalize_ these practices for better governance? Let's explore some real-world solutions:
### 1. Professional Codes
Professional Codes offer one option for organizations to "incentivize" members to support their ethical principles and mission statement. Codes are _moral guidelines_ for professional behavior, helping employees or members make decisions that align with their organization's principles. They are only as good as the voluntary compliance from members; however, many organizations offer additional rewards and penalties to motivate compliance from members.
Examples include:
@ -132,69 +208,67 @@ Examples include:
* [Data Science Association](http://datascienceassn.org/code-of-conduct.html) Code of Conduct (created 2013)
* [ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics) (since 1993)
> 🚨 Do you belong to a professional engineering or data science organization? Explore their site to see if they define a professional code of ethics. What does this say about their ethical principles? How are they "incentivizing" members to follow the code?
### 3.2 Ask Moral Questions
Assuming you've already identified your shared values or ethical principles at a team or organization level, the next step is to identify the moral questions relevant to your specific use case and operational workflow.
Here are [6 basic questions about data ethics](https://halpert3.medium.com/six-questions-about-data-science-ethics-252b5ae31fec) that you can build on:
### 2. Ethics Checklists
* Is the data you're collecting fair and unbiased?
* Is the data being used ethically and fairly?
* Is user privacy being protected?
* To whom does data belong - the company or the user?
* What effects do the data and algorithms have on society (individual and collective)?
* Is the data manipulated or deceptive?
While professional codes define required _ethical behavior_ from practitioners, they [have known limitations](https://resources.oreilly.com/examples/0636920203964/blob/master/of_oaths_and_checklists.md) in enforcement, particularly in large-scale projects. Instead, many data Science experts [advocate for checklists](https://resources.oreilly.com/examples/0636920203964/blob/master/of_oaths_and_checklists.md), that can **connect principles to practices** in more deterministic and actionable ways.
For larger team or project scope, you can choose to expand on questions that reflect a specific stage of the workflow. For example here are [22 questions on ethics in data and AI](https://medium.com/the-organization/22-questions-for-ethics-in-data-and-ai-efb68fd19429) that were grouped into _design_, _implementation & management_, _systems & organization_ categories for convenience.
Checklists convert questions into "yes/no" tasks that can be operationalized, allowing them to be tracked as part of standard product release workflows.
### 3.3 Adopt Ethics Checklists
While professional codes define required _ethical behavior_ from practitioners, they [have known limitations](https://resources.oreilly.com/examples/0636920203964/blob/master/of_oaths_and_checklists.md) for implementation, particularly in large-scale projects. In [Ethics and Data Science](https://resources.oreilly.com/examples/0636920203964/blob/master/of_oaths_and_checklists.md)), experts instead advocate for ethics checklists that can **connect principles to practices** in more deterministic and actionable ways.
Examples include:
* [Deon](https://deon.drivendata.org/) - a general-purpose data ethics checklist created from [industry recommendations](https://deon.drivendata.org/#checklist-citations) with a command-line tool for easy integration.
* [Privacy Audit Checklist](https://cyber.harvard.edu/ecommerce/privacyaudit.html) - provides general guidance for information handling practices from legal and social exposure perspectives.
* [AI Fairness Checklist](https://www.microsoft.com/en-us/research/project/ai-fairness-checklist/) - created by AI practitioners to support the adoption and integration of fairness checks into AI development cycles.
* [22 questions for ethics in data and AI](https://medium.com/the-organization/22-questions-for-ethics-in-data-and-ai-efb68fd19429) - more open-ended framework, structured for initial exploration of ethical issues in design, implementation, and organizational, contexts.
Checklists convert questions into "yes/no" tasks that can be tracked and validated before product release. Tools like [deon](https://deon.drivendata.org/) make this frictionless, creating default checklists aligned to [industry recommendations](https://deon.drivendata.org/#checklist-citations) and enabling users to customize and integrate them into workflows using a command-line tool. Deon also provides [real-world examples](ttps://deon.drivendata.org/examples/) of ethical challenges to provide context for these decisions.
### 3. Ethics Regulations
### 3.4 Track Ethics Compliance
Ethics is about defining shared values and doing the right thing _voluntarily_. **Compliance** is about _following the law_ if and where defined. **Governance** broadly covers all the ways in which organizations operate to enforce ethical principles and comply with established laws.
**Ethics** is about doing the right thing, even if there are no laws to enforce it. **Compliance** is about following the law, when defined and where applicable.
**Governance** is the broader umbrella that covers all the ways in which an organization (company or government) operates to enforce ethical principles & comply with laws.
Today, governance takes two forms within organizations. First, it's about defining **ethical AI** principles and establishing practices to operationalize adoption across all AI-related projects in the organization. Second, it's about complying with all government-mandated **data protection regulations** for regions it operates in.
Companies are creating their own ethics frameworks (e.g., [Microsoft](https://www.microsoft.com/en-us/ai/responsible-ai), [IBM](https://www.ibm.com/cloud/learn/ai-ethics), [Google](https://ai.google/principles), [Facebook](https://ai.facebook.com/blog/facebooks-five-pillars-of-responsible-ai/), [Accenture](https://www.accenture.com/_acnmedia/PDF-149/Accenture-Responsible-AI-Final.pdf#zoom=50)) for governances, while state and national governments tend to focus on regulations that protect the data privacy and rights of their citizens.
Examples of data protection and privacy regulations:
Here are some landmark data privacy regulations to know:
* `1974`, [US Privacy Act](https://www.justice.gov/opcl/privacy-act-1974) - regulates _federal govt._ collection, use and disclosure of personal information.
* `1974`, [US Privacy Act](https://www.justice.gov/opcl/privacy-act-1974) - regulates _federal govt._ collection, use ,and disclosure of personal information.
* `1996`, [US Health Insurance Portability & Accountability Act (HIPAA)](https://www.cdc.gov/phlp/publications/topic/hipaa.html) - protects personal health data.
* `1998`, [US Children's Online Privacy Protection Act (COPPA)](https://www.ftc.gov/enforcement/rules/rulemaking-regulatory-reform-proceedings/childrens-online-privacy-protection-rule) - protects data privacy of children under 13.
* `2018`, [General Data Protection Regulation (GDPR)](https://gdpr-info.eu/) - provides user rights, data protection and privacy.
* `2018`, [California Consumer Privacy Act (CCPA)](https://www.oag.ca.gov/privacy/ccpa) gives consumers more _rights_ over their personal data.
* `2018`, [General Data Protection Regulation (GDPR)](https://gdpr-info.eu/) - provides user rights, data protection ,and privacy.
* `2018`, [California Consumer Privacy Act (CCPA)](https://www.oag.ca.gov/privacy/ccpa) gives consumers more _rights_ over their (personal) data.
* `2021`, China's [Personal Information Protection Law](https://www.reuters.com/world/china/china-passes-new-personal-data-privacy-law-take-effect-nov-1-2021-08-20/) just passed, creating one of the strongest online data privacy regulations worldwide.
In Aug 2021, China passed the [Personal Information Protection Law](https://www.reuters.com/world/china/china-passes-new-personal-data-privacy-law-take-effect-nov-1-2021-08-20/) (to go into effect Nov 1) which, with its Data Security Law, will create one of the strongest online data privacy regulations in the world.
> 🚨 The European Union defined GDPR (General Data Protection Regulation) remains one of the most influential data privacy regulations today. Did you know it also defines [8 user rights](https://www.freeprivacypolicy.com/blog/8-user-rights-gdpr) to protect citizens' digital privacy and personal data? Learn about what these are, and why they matter.
### 3.5 Establish Ethics Culture
### 4. Ethics Culture
There remains an intangible gap between compliance ("doing enough to meet the letter of the law") and addressing systemic issues ([like ossification, information asymmetry and distributional unfairness](https://www.coursera.org/learn/data-science-ethics/home/week/4)) that can create self-fulfilling feedback loops to weaponizes AI further. This is motivating calls for [formalizing data ethics cultures](https://www.codeforamerica.org/news/formalizing-an-ethical-data-culture/) in organizations, where everyone is empowered to [pull the Andon cord](https://en.wikipedia.org/wiki/Andon_(manufacturing) to raise ethics concerns early. And exploring [collaborative approaches to defining this culture](https://towardsdatascience.com/why-ai-ethics-requires-a-culture-driven-approach-26f451afa29f) that build emotional connections and consistent beliefs across organizations and industries.
Note that there remains an intangible gap between _compliance_ (doing enough to meet "the letter of the law") and addressing [systemic issues](https://www.coursera.org/learn/data-science-ethics/home/week/4) (like ossification, information asymmetry ,and distributional unfairness) that can speed up the weaponization of AI.
The latter requires [collaborative approaches to defining ethics cultures](https://towardsdatascience.com/why-ai-ethics-requires-a-culture-driven-approach-26f451afa29f) that build emotional connections and consistent shared values _across organizations_ in the industry. This calls for more [formalized data ethics cultures](https://www.codeforamerica.org/news/formalizing-an-ethical-data-culture/) in organizations - allowing _anyone_ to [pull the Andon cord](https://en.wikipedia.org/wiki/Andon_(manufacturing) (to raise ethics concerns early in the process) and making _ethical assessments_ (e.g., in hiring) a core criteria team formation in AI projects.
---
## Challenge 🚀
## Post-Lecture Quiz 🎯
[Post-lecture quiz]()
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/3) 🎯
## Review & Self Study
Courses and books help with understanding core ethics concepts and challenges, while case studies and tools help with applied ethics practices in real-world contexts. Here are a few resources to start with.
---
# Assignment
* [Machine Learning For Beginners](https://github.com/microsoft/ML-For-Beginners/blob/main/1-Introduction/3-fairness/README.md) - lesson on Fairness, from Microsoft.
* [Principles of Responsible AI](https://docs.microsoft.com/en-us/learn/modules/responsible-ai-principles/) - free learning path from Microsoft Learn.
* [Ethics and Data Science](https://resources.oreilly.com/examples/0636920203964) - O'Reilly EBook (M. Loukides, H. Mason et. al)
* [Data Science Ethics](https://www.coursera.org/learn/data-science-ethics#syllabus) - online course from the University of Michigan.
* [Ethics Unwrapped](https://ethicsunwrapped.utexas.edu/case-studies) - case studies from the University of Texas.
[Assignment Title](assignment.md)
---
# Resources
---
# Assignment
[Related Resources](resources.md)
[Assignment Title](assignment.md)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 73 KiB

@ -1,3 +0,0 @@
## Courses
## Articles

@ -1,16 +1,22 @@
# Defining Data
This lesson focuses on identifying and classifying data by its characteristics and its sources.
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/03-DefiningData.png)|
|:---:|
|Defining Data - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Data are facts, information, observations and measurements that are used to make discoveries and to support informed decisions. A data point is a single unit of data with in a dataset, which is collection of data points. Datasets may come in different formats and structures, and will usually be based on its source, or where the data came from. For example, a company's monthly earnings might be in a spreadsheet but hourly heart rate data from a smartwatch may be in [JSON](https://stackoverflow.com/a/383699) format. It's common for data scientists to work with different types of data within a dataset.
This lesson focuses on identifying and classifying data by its characteristics and its sources.
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/4)
## Pre-Lecture Quiz
[Pre-lecture quiz]()
![Image of numerical data, also known as quantitative data](mika-baumeister-Wpnoqo2plFA-unsplash.jpg)
> Source: [Mika Baumister](https://unsplash.com/@mbaumi) via [Unsplash](https://unsplash.com/photos/Wpnoqo2plFA)
## How Data is Described
*Raw data* are data that has come from its source in its initial state and has not been analyzed or organized. In order to make sense of what is happening with a dataset, it needs to be organized into a format that can be understood by humans as well as the technology they may use to analyze it further. The structure of a dataset describes how it's organized and can be classified at structured, unstructured and semi-structured. These types of structure will vary, depending on the source but will ultimately fit in these three categories.
**Raw data** are data that has come from its source in its initial state and has not been analyzed or organized. In order to make sense of what is happening with a dataset, it needs to be organized into a format that can be understood by humans as well as the technology they may use to analyze it further. The structure of a dataset describes how it's organized and can be classified at structured, unstructured and semi-structured. These types of structure will vary, depending on the source but will ultimately fit in these three categories.
### Quantitative Data
Quantitative data are numerical observations within a dataset and can typically be analyzed, measured and used mathematically. Some examples of quantitative data are: a country's population, a person's height or a company's quarterly earnings. With some additional analysis, quantitative data could be used to discover seasonal trends of the Air Quality Index (AQI) or estimate the probability of rush hour traffic on a typical work day.
@ -39,18 +45,31 @@ Examples of unstructured data: HTML, CSV files, JavaScript Object Notation (JSO
A data source is the initial location of where the data was generated, or where it "lives" and will vary based on how and when it was collected. Data generated by its user(s) are known as primary data while secondary data comes from a source that has collected data for general use. For example, a group of scientists collecting observations in a rainforest would be considered primary and if they decide to share it with other scientists it would be considered secondary to those that use it.
Databases are a common source and rely on a database management system to host and maintain the data where users use commands called queries to explore the data. Files as data sources can be audio, image, and video files as well as spreadsheets like Excel. Internet sources are a common location for hosting data, where databases as well as files can be found. Application programming interfaces, also known as APIs allow programmers to create ways to share data with external users through the internet, while the process of web scraping extracts data from a web page. The [lessons in Working with Data](/2-Working-With-Data) focus on how to use various data sources.
Databases are a common source and rely on a database management system to host and maintain the data where users use commands called queries to explore the data. Files as data sources can be audio, image, and video files as well as spreadsheets like Excel. Internet sources are a common location for hosting data, where databases as well as files can be found. Application programming interfaces, also known as APIs allow programmers to create ways to share data with external users through the internet, while the process of web scraping extracts data from a web page. The [lessons in Working with Data](/2-Working-With-Data) focuses on how to use various data sources.
## Conclusion
In this lesson we have learned:
- What data is
- How data is described
- How data is classified and categorized
- Where data can be found
## 🚀 Challenge
Kaggle is an excellent source of open datasets. Use the [dataset search tool](https://www.kaggle.com/datasets) to find some data sets of interest and and classify 3-5 with this criteria:
- Is the data quantitative or qualitative?
- Is the data structured, unstructured, or semi-structured?
## [Post-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/5)
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
- This Microsoft Learn unit, titled [Classify your Data](https://docs.microsoft.com/en-us/learn/modules/choose-storage-approach-in-azure/2-classify-data) has a detailed breakdown of structured, semi-structured, and unstructured data.
## Assignment

@ -4,40 +4,58 @@
Follow the prompts in this assignment to identify and classify the data with one of each of the following data types:
Structure Types: Structured, Semi-Structured, or Unstructured
Value Types: Qualitative or Quantitative
Source Types: Primary or Secondary
**Structure Types**: Structured, Semi-Structured, or Unstructured
**Value Types**: Qualitative or Quantitative
**Source Types**: Primary or Secondary
1. A company has been acquired and now has a parent company. The data scientists have received a spreadsheet of customer phone numbers from the parent company.
Structure Type:
Value Type:
Source Type:
---
2. A smart watch has been collecting heart rate data from its wearer, and the raw data is in JSON format.
Structure Type:
Value Type:
Source Type:
---
3. A workplace survey of employee morale that is stored in a CSV file.
Structure Type:
Value Type:
Source Type:
---
4. Astrophysicists are accessing a database of galaxies that has been collected by a space probe. The data contains the number of planets within in each galaxy.
Structure Type:
Value Type:
Source Type:
---
5. A personal finance app uses APIs to connect to a user's financial accounts in order to calculate their net worth. They can see all of their transactions in a format of rows and columns and looks similar to a spreadsheet.
Structure Type:
Value Type:
Source Type:
## Rubric

Binary file not shown.

After

Width:  |  Height:  |  Size: 383 KiB

@ -1,18 +1,23 @@
# A Brief Introduction to Statistics and Probability
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/04-Statistics-Probability.png)|
|:---:|
| Statistics and Probability - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Statistics and Probability Theory are two highly related areas of Mathematics that are highly relevant to Data Science. It is possible to operate with data without deep knowledge of mathematics, but it is still better to know at least some basic concepts. Here we will present a short introduction that will help you get started.
## Pre-Lecture Quiz
[![Intro Video](images/video-prob-and-stats.png)](https://youtu.be/Z5Zy85g4Yjw)
[Pre-lecture quiz]()
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/6)
## Probability and Random Variables
**Probability** is a number between 0 and 1 that expresses how probable an **event** is. It is defined as a number of positive outcomes (that lead to the event), divided by total number of outcomes, given that all outcomes are equally probable. For example, when we roll a dice, the probability that we get an even number is 3/6 = 0.5.
When we talk about events, we use **random variables**. For example, the random variable that represents a number obtained when rolling a dice would take values from 1 to 6. Set of numbers from 1 to 6 is called **sample space**. We can talk about probability of a random variable taking a certain value, for example P(X=3)=1/6.
When we talk about events, we use **random variables**. For example, the random variable that represents a number obtained when rolling a dice would take values from 1 to 6. Set of numbers from 1 to 6 is called **sample space**. We can talk about the probability of a random variable taking a certain value, for example P(X=3)=1/6.
The random variable in previous example is called **discrete**, because it has a countable sample space, i.e. there are separate values that can be enumerated. There are cases when sample space is a range of real numbers, or the whole set of real numbers. Such variables are called **continuous**. An good example is the time when the bus arrives.
The random variable in previous example is called **discrete**, because it has a countable sample space, i.e. there are separate values that can be enumerated. There are cases when sample space is a range of real numbers, or the whole set of real numbers. Such variables are called **continuous**. A good example is the time when the bus arrives.
## Probability Distribution
@ -28,7 +33,7 @@ We can only talk about the probability of a variable falling in a given interval
<img src="http://www.sciweavers.org/tex2img.php?eq=P%28t_1%5Cle%20X%3Ct_2%29%3D%5Cint_%7Bt_1%7D%5E%7Bt_2%7Dp%28x%29dx&bc=White&fc=Black&im=jpg&fs=12&ff=arev&edit=0" align="center" border="0" alt="P(t_1\le X<t_2)=\int_{t_1}^{t_2}p(x)dx" width="228" height="51" >
An continuous analog of uniform distribution is called **continuous uniform**, which is defined on a finite interval. A probability that the value X falls into an interval of length l is proportional to l, and rises up to 1.
A continuous analog of uniform distribution is called **continuous uniform**, which is defined on a finite interval. A probability that the value X falls into an interval of length l is proportional to l, and rises up to 1.
Another important distribution is **normal distribution**, which we will talk about in more detail below.
@ -53,9 +58,9 @@ Graphically we can represent relationship between median and quartiles in a diag
<img src="images/boxplot_explanation.png" width="50%"/>
Here we also computer **inter-quartile range** IQR=Q3-Q1, and so-called **outliers** - values, that lie outside the boundaries [Q1-1.5*IQR,Q3+1.5*IQR].
Here we also compute **inter-quartile range** IQR=Q3-Q1, and so-called **outliers** - values, that lie outside the boundaries [Q1-1.5*IQR,Q3+1.5*IQR].
For finite distribution that contains small number of possible values, a good "typical" value is the one that appears the most frequently, which is called **mode**. It is often applied to categorical data, such as colors. Consider a situation when we have two groups of people - some that strongly prefer red, and others who prefer blue. If we code colors by numbers, the mean value for a favorite color would be somewhere in the orange-green spectrum, which does not indicate the actual preference on neither group. However, the mode would be either one of the colors, or both colors, if the number of people voting for them is equal (in this case we call the sample **multimodal**).
For finite distribution that contains a small number of possible values, a good "typical" value is the one that appears the most frequently, which is called **mode**. It is often applied to categorical data, such as colors. Consider a situation when we have two groups of people - some that strongly prefer red, and others who prefer blue. If we code colors by numbers, the mean value for a favorite color would be somewhere in the orange-green spectrum, which does not indicate the actual preference on neither group. However, the mode would be either one of the colors, or both colors, if the number of people voting for them is equal (in this case we call the sample **multimodal**).
## Real-world Data
When we analyze data from real life, they often are not random variables as such, in a sense that we do not perform experiments with unknown result. For example, consider a team of baseball players, and their body data, such as height, weight and age. Those numbers are not exactly random, but we can still apply the same mathematical concepts. For example, a sequence of people's weights can be considered to be a sequence of values drawn from some random variable. Below is the sequence of weights of actual baseball players from [Major League Baseball](http://mlb.mlb.com/index.jsp), taken from [this dataset](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights) (for your convenience, only first 20 values are shown):
@ -64,7 +69,7 @@ When we analyze data from real life, they often are not random variables as such
[180.0, 215.0, 210.0, 210.0, 188.0, 176.0, 209.0, 200.0, 231.0, 180.0, 188.0, 180.0, 185.0, 160.0, 180.0, 185.0, 197.0, 189.0, 185.0, 219.0]
```
> **Note**: To see the example of working with this dataset, have a look at the [accompanying notebook](notebook.ipynb). There is also a number of challenges throughout this lesson, and you may complete them by adding some code to that notebook. If you are not sure how to operate on data, do not worry - we will come back to working with data using Python at a later time.
> **Note**: To see the example of working with this dataset, have a look at the [accompanying notebook](notebook.ipynb). There are also a number of challenges throughout this lesson, and you may complete them by adding some code to that notebook. If you are not sure how to operate on data, do not worry - we will come back to working with data using Python at a later time. If you do not know how to run code in Jupyter Notebook, have a look at [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
Here is the box plot showing mean, median and quartiles for our data:
@ -78,16 +83,16 @@ This diagram suggests that, on average, height of first basemen is higher that h
> When working with real-world data, we assume that all data points are samples drawn from some probability distribution. This assumption allows us to apply machine learning techniques and build working predictive models.
To see what is the distribution of our data, we can plot a graph called a **histogram**. X-axis would contain a number of different weight intervals (so-called **bins**), and vertical axis would show the number of times our random variable sample was inside a given interval.
To see what the distribution of our data is, we can plot a graph called a **histogram**. X-axis would contain a number of different weight intervals (so-called **bins**), and the vertical axis would show the number of times our random variable sample was inside a given interval.
![Histogram of real world data](images/weight-histogram.png)
From this histogram you can see that all values are centered around certain mean weight, and the further we go from that weight - the fewer weights of that value are encountered. I.e., it is very improbable that a weight of a baseball player would be very different from the mean weight. Variance of weights show the extent to which weights are likely to differ from the mean.
From this histogram you can see that all values are centered around certain mean weight, and the further we go from that weight - the fewer weights of that value are encountered. I.e., it is very improbable that the weight of a baseball player would be very different from the mean weight. Variance of weights show the extent to which weights are likely to differ from the mean.
> If we take weights of other people, not from the baseball league, the distribution is likely to be different. However, the shape of the distribution will be the same, but mean and variance would change. So, if we train our model on baseball players, it is likely to give wrong results when applied to students of a university, because the underlying distribution is different.
## Normal Distribution
The distribution of weights that we have seen above is very typical, and many measurements from real world follow the same type of distribution, but with different mean and variance. This distribution is called **normal distribution**, and it plays very important role in statistics.
The distribution of weights that we have seen above is very typical, and many measurements from real world follow the same type of distribution, but with different mean and variance. This distribution is called **normal distribution**, and it plays a very important role in statistics.
Using normal distribution is a correct way to generate random weights of potential baseball players. Once we know mean weight `mean` and standard deviation `std`, we can generate 1000 weight samples in the following way:
```python
@ -112,7 +117,7 @@ Suppose we have a sample X<sub>1</sub>, ..., X<sub>n</sub> from our distribution
It does beyond our short intro to discuss in detail how those confidence intervals are calculated. Some more details can be found [on Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval). In short, we define the distribution of computed sample mean relative to the true mean of the population, which is called **student distribution**.
> **Interesting fact**: Student distribution is named after mathematician William Sealy Gosset, who published his paper under pseudonym "Student". He worked in the Guinness brewery, and, according to one of the versions, his employer did not want general public to know that they were using statistical tests to determine the quality of raw materials.
> **Interesting fact**: Student distribution is named after mathematician William Sealy Gosset, who published his paper under the pseudonym "Student". He worked in the Guinness brewery, and, according to one of the versions, his employer did not want general public to know that they were using statistical tests to determine the quality of raw materials.
If we want to estimate the mean &mu; of our population with confidence p, we need to take *(1-p)/2-th percentile* of a Student distribution A, which can either be taken from tables, or computer using some built-in functions of statistical software (eg. Python, R, etc.). Then the interval for &mu; would be given by X&pm;A*D/&radic;n, where X is the obtained mean of the sample, D is the standard deviation.
@ -144,7 +149,7 @@ In our baseball players dataset, there are different player roles, that can be s
| Starting_Pitcher | 74.719457 | 205.163636 | 221 |
| Third_Baseman | 73.044444 | 200.955556 | 45 |
We can notice that the mean heights of first basemen is higher that that of second basemen. Thus, we may be tempted to conclude that **first basemen are higher than second basemen**.
We can notice that the mean heights of first basemen is higher than that of second basemen. Thus, we may be tempted to conclude that **first basemen are higher than second basemen**.
> This statement is called **a hypothesis**, because we do not know whether the fact is actually true or not.
@ -164,7 +169,7 @@ More formally, the problem we are solving is to see if **two probability distrib
In Student t-test, we compute so-called **t-value**, which indicates the difference between means, taking into account the variance. It is demonstrated that t-value follows **student distribution**, which allows us to get the threshold value for a given confidence level **p** (this can be computed, or looked up in the numerical tables). We then compare t-value to this threshold to approve or reject the hypothesis.
In Python, we can use **SciPy** package, which includes `ttest_ind` function (in addition to many other useful statistical functions!). It computes the t-value for us, and also does the reverse lookup of confidence p-value, so that we can just look at the confidence to draw the conclusion.
In Python, we can use the **SciPy** package, which includes `ttest_ind` function (in addition to many other useful statistical functions!). It computes the t-value for us, and also does the reverse lookup of confidence p-value, so that we can just look at the confidence to draw the conclusion.
For example, our comparison between heights of first and second basemen give us the following results:
```python
@ -179,20 +184,18 @@ P-value: 9.137321189738925e-12
```
In our case, p-value is very low, meaning that there is strong evidence supporting that first basemen are taller.
> 🚀 **Challenge**: Use the sample code in the notebook to test other hypothesis that: (1) First basemen and older that second basemen; (2) First basemen and taller than third basemen; (3) Shortstops are taller than second basemen
There are also different other types of hypothesis that we might want to test, for example:
* To prove that a given sample follows some distribution. In our case we have assumed that heights are normally distributed, but that needs formal statistical verification.
* To prove that a mean value of a sample corresponds to some predefined value
* To compare means of a number of samples (eg. what is the difference in happiness levels amond different age groups)
* To compare means of a number of samples (eg. what is the difference in happiness levels among different age groups)
## Law of Large Numbers and Central Limit Theorem
One of the reasons why normal distribution is so important is so-called **central limit theorem**. Suppose we have a large sample of independent N values X<sub>1</sub>, ..., X<sub>N</sub>, sampled from any distribution with mean &mu; and variance &sigma;<sup>2</sup>. Then, for sufficiently large N (in other words, when N&rarr;&infin;), the mean &Sigma;<sub>i</sub>X<sub>i</sub> would be normally distributed, with mean &mu; and variance &sigma;<sup>2</sup>/N.
> Another way to interpret central limit theorem is to say that regardless of distribution, when you compute the mean of a sum of any random variable values you end up with normal distribution.
> Another way to interpret the central limit theorem is to say that regardless of distribution, when you compute the mean of a sum of any random variable values you end up with normal distribution.
From central limit theorem it also follows that, when N&rarr;&infin;, the probability of the sample mean to be equal to &mu; becomes 1. This is known as **the law of large numbers**.
From the central limit theorem it also follows that, when N&rarr;&infin;, the probability of the sample mean to be equal to &mu; becomes 1. This is known as **the law of large numbers**.
## Covariance and Correlation
@ -200,7 +203,7 @@ One of the things Data Science does is finding relations between data. We say th
> Correlation does not necessarily indicate causal relationship between two sequences; sometimes both variables can depend on some external cause, or it can be purely by chance the two sequences correlate. However, strong mathematical correlation is a good indication that two variables are somehow connected.
Mathematically, the main concept that show the relation between two random variables is **covariance**, that is computed like this: Cov(X,Y) = **E**\[(X-**E**(X))(Y-**E**(Y))\]. We compute the deviation of both variables from their mean values, and then product of those deviations. If both variables deviate together, the product would always be a positive value, that would add up to positive covariance. If both variables deviate out-of-sync (i.e. one falls below average when another one rises above average), we will always get negative numbers, that will add up to negative covariance. If the deviations are not dependent, they will add up to roughly zero.
Mathematically, the main concept that shows the relation between two random variables is **covariance**, that is computed like this: Cov(X,Y) = **E**\[(X-**E**(X))(Y-**E**(Y))\]. We compute the deviation of both variables from their mean values, and then product of those deviations. If both variables deviate together, the product would always be a positive value, that would add up to positive covariance. If both variables deviate out-of-sync (i.e. one falls below average when another one rises above average), we will always get negative numbers, that will add up to negative covariance. If the deviations are not dependent, they will add up to roughly zero.
The absolute value of covariance does not tell us much on how large the correlation is, because it depends on the magnitude of actual values. To normalize it, we can divide covariance by standard deviation of both variables, to get **correlation**. The good thing is that correlation is always in the range of [-1,1], where 1 indicates strong positive correlation between values, -1 - strong negative correlation, and 0 - no correlation at all (variables are independent).
@ -234,9 +237,14 @@ In this section, we have learnt:
While this is definitely not exhaustive list of topics that exist within probability and statistics, it should be enough to give you a good start into this course.
## Post-Lecture Quiz
## 🚀 Challenge
[Post-lecture quiz]()
Use the sample code in the notebook to test other hypothesis that:
1. First basemen and older that second basemen
2. First basemen and taller than third basemen
3. Shortstops are taller than second basemen
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/7)
## Review & Self Study
@ -250,3 +258,6 @@ Probability and statistics is such a broad topic that it deserves its own course
[Small Diabetes Study](assignment.md)
## Credits
This lesson has been authored with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)

Binary file not shown.

After

Width:  |  Height:  |  Size: 89 KiB

@ -1,7 +1,9 @@
# Introduction
# Introduction to Data Science
[Brief description about the lessons in this section]
![data in action](images/data.jpg)
>Photo by <a href="https://unsplash.com/@dawson2406?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Stephen Dawson</a> on <a href="https://unsplash.com/s/photos/data?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
In these lessons, you will discover how Data Science is defined and learn about ethical considerations that must be considered by a data scientist. You will also learn how data is defined and learn a bit about statistics and probability, the core academic domains of Data Science.
### Topics
1. [Defining Data Science](01-defining-data-science/README.md)
@ -11,4 +13,4 @@
### Credits
These lessons were written with ❤️ by [Nitya Narasimhan](https://twitter.com/nitya) and [Dmitry Soshnikov](https://twitter.com/shwars).

Binary file not shown.

After

Width:  |  Height:  |  Size: 506 KiB

@ -1,8 +1,12 @@
# Working with Data: Relational Databases
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/05-RelationalData.png)|
|:---:|
| Working With Data: Relational Databases - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Chances are you have used a spreadsheet in the past to store information. You had a set of rows and columns, where the rows contained the information (or data), and the columns described the information (sometimes called metadata). A relational database is built upon this core principle of columns and rows in tables, allowing you to have information spread across multiple tables. This allows you to work with more complex data, avoid duplication, and have flexibility in the way you explore the data. Let's explore the concepts of a relational database.
[Pre-lecture quiz](asdf)
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/8)
## It all starts with tables
@ -10,7 +14,7 @@ A relational database has at its core tables. Just as with the spreadsheet, a ta
Let's begin our exploration by starting a table to store information about cities. We might start with their name and country. You could store this in a table as follows:
| city | country |
| City | Country |
| -------- | ------------- |
| Tokyo | Japan |
| Atlanta | United States |
@ -22,7 +26,7 @@ Notice the column names of **city**, **country** and **population** to describe
Chances are, the table above seems relatively familiar to you. Let's start to add some additional data to our burgeoning database - annual rainfall (in millimeters). We'll focus on the years 2018, 2019 and 2020. If we were to add it for Tokyo, it might look something like this:
| city | country | year | amount |
| City | Country | Year | Amount |
| ----- | ------- | ---- | ------ |
| Tokyo | Japan | 2020 | 1690 |
| Tokyo | Japan | 2019 | 1874 |
@ -32,7 +36,7 @@ What do you notice about our table? You might notice we're duplicating the name
OK, let's try something else. Let's add new columns for each year:
| city | country | 2018 | 2019 | 2020 |
| City | Country | 2018 | 2019 | 2020 |
| -------- | ------------- | ---- | ---- | ---- |
| Tokyo | Japan | 1445 | 1874 | 1690 |
| Atlanta | United States | 1779 | 1111 | 1683 |
@ -46,7 +50,7 @@ This is why we need multiple tables and relationships. By breaking apart our dat
Let's return to our data and determine how we want to split things up. We know we want to store the name and country for our cities, so this will probably work best in one table.
| city | country |
| City | Country |
| -------- | ------------- |
| Tokyo | Japan |
| Atlanta | United States |
@ -58,19 +62,19 @@ But before we create the next table, we need to figure out how to reference each
### cities
| city_id | city | country |
| city_id | City | Country |
| ------- | -------- | ------------- |
| 1 | Tokyo | Japan |
| 2 | Atlanta | United States |
| 3 | Auckland | New Zealand |
> [!NOTE] You will notice we use the terms "id" and "primary key" interchangeably during this lesson. The concepts here apply to DataFrames, which you will explore later. DataFrames don't use the terminology of "primary key", however you will notice they behave much in the same way.
> [NOTE] You will notice we use the terms "id" and "primary key" interchangeably during this lesson. The concepts here apply to DataFrames, which you will explore later. DataFrames don't use the terminology of "primary key", however you will notice they behave much in the same way.
With our cities table created, let's store the rainfall. Rather than duplicating the full information about the city, we can use the id. We should also ensure the newly created table has an *id* column as well, as all tables should have an id or primary key.
### rainfall
| rainfall_id | city_id | year | amount |
| rainfall_id | city_id | Year | Amount |
| ----------- | ------- | ---- | ------ |
| 1 | 1 | 2018 | 1445 |
| 2 | 1 | 2019 | 1874 |
@ -104,7 +108,7 @@ FROM cities;
`SELECT` is where you list the columns, and `FROM` is where you list the tables.
> [!NOTE] SQL syntax is case-insensitive, meaning `select` and `SELECT` mean the same thing. However, depending on the type of database you are using the columns and tables might be case sensitive. As a result, it's a best practice to always treat everything in programming like it's case sensitive. When writing SQL queries common convention is to put the keywords in all upper-case letters.
> [NOTE] SQL syntax is case-insensitive, meaning `select` and `SELECT` mean the same thing. However, depending on the type of database you are using the columns and tables might be case sensitive. As a result, it's a best practice to always treat everything in programming like it's case sensitive. When writing SQL queries common convention is to put the keywords in all upper-case letters.
The query above will display all cities. Let's imagine we only wanted to display cities in New Zealand. We need some form of a filter. The SQL keyword for this is `WHERE`, or "where something is true".
@ -162,7 +166,7 @@ There are numerous relational databases available on the internet. You can explo
## Post-Lecture Quiz
[Post-lecture quiz](asdf)
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/9)
## Review & Self Study
@ -170,6 +174,7 @@ There are several resources available on [Microsoft Learn](https://docs.microsof
- [Describe concepts of relational data](https://docs.microsoft.com//learn/modules/describe-concepts-of-relational-data?WT.mc_id=academic-40229-cxa)
- [Get Started Querying with Transact-SQL](https://docs.microsoft.com//learn/paths/get-started-querying-with-transact-sql?WT.mc_id=academic-40229-cxa) (Transact-SQL is a version of SQL)
- [SQL content on Microsoft Learn](https://docs.microsoft.com/learn/browse/?products=azure-sql-database%2Csql-server&expanded=azure&WT.mc_id=academic-40229-cxa)
## Assignment

@ -0,0 +1,27 @@
[
{
"firstname": "Christophe",
"age": 32
},
{
"firstname": "Prema",
"age": 20
},
{
"firstname": "Arthur",
"age": 15
},
{
"firstname": "Zoe",
"age": 7
},
{
"firstname": "Keisha",
"age": 84
},
{
"firstname": "Jackie",
"age": 45
}
]

@ -1,12 +1,18 @@
# Working with Data: Non-Relational Data
Data is not limited to relational databases. This lesson focuses on non relational data and will cover the basic of spreadsheets and NoSQL.
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/06-NoSQL.png)|
|:---:|
|Working with NoSQL Data - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/10)
Data is not limited to relational databases. This lesson focuses on non-relational data and will cover the basic of spreadsheets and NoSQL.
## Spreadsheets
Many data scientists will not pick spreadsheets as their first tool for various and valid reasons. However, it's a popular way to store and explore data because it requires less work to setup and get started. In this lesson you'll learn the basic components of a spreadsheet and formulas and functions are applied. This lesson provides foundational knowledge of spreadsheets in the rare event that you find yourself working with with them. The examples will be illustrated with Microsoft Excel, but most of the parts and topics will have similar names and steps in comparison to other spreadsheet software.
Spreadsheets are a popular way to store and explore data because it requires less work to setup and get started. In this lesson you'll learn the basic components of a spreadsheet, as well as formulas and functions. The examples will be illustrated with Microsoft Excel, but most of the parts and topics will have similar names and steps in comparison to other spreadsheet software.
![An empty Microsoft Excel workbook with two worksheets](parts-of-spreadsheet.png)
![An empty Microsoft Excel workbook with two worksheets](images/parts-of-spreadsheet.png)
A spreadsheet is a file and will be accessible in the file system of a computer, device, or cloud based file system. The software itself may be browser based or an application that must be installed on a computer or downloaded as an app. In Excel these files are also defined as **workbooks** and this terminology will be used the remainder of this lesson.
@ -16,52 +22,102 @@ With these basic elements of an Excel workbook, we'll use and an example from [M
### Managing an Inventory
The spreadsheet file named "Inventory Example" is a formatted spreadsheet of items within an inventory that contains three worksheets, where the tabs are labeled "Inventory List", "Inventory Pick List" and "Bin Lookup". Row 4 of the Inventory List worksheet is the header, which describes the value of each cell in the header column.
The spreadsheet file named "InventoryExample" is a formatted spreadsheet of items within an inventory that contains three worksheets, where the tabs are labeled "Inventory List", "Inventory Pick List" and "Bin Lookup". Row 4 of the Inventory List worksheet is the header, which describes the value of each cell in the header column.
There are instances where a cell is dependent on the values of other cells to generate its value. The Inventory List spreadsheet keeps track of the cost of every item in its inventory, but what if we need to know the value of everything in the inventory? **Formulas** perform actions on cell data and is used to calculate the cost of the inventory in this example. This spreadsheet used a formula in the Inventory Value column to calculate the value of each item by multiplying the quantity under the QTY header and its costs by the cells under the COST header. Double clicking or highlighting a cell will show the formula. You'll notice that formulas start with an equals sign, followed by the calculation or operation.
![A highlighted formula from an example inventory list in Microsoft Excel](images/formula-excel.png)
!IMG[Show what it looks like here]
There are instances where a cell is dependent on the values of other cells to generate its value. The Inventory List spreadsheet keeps track of the cost of every item in its inventory, but what if we need to know the value of everything in the inventory? [**Formulas**](https://support.microsoft.com/en-us/office/overview-of-formulas-34519a4e-1e8d-4f4b-84d4-d642c4f63263) perform actions on cell data and is used to calculate the cost of the inventory in this example. This spreadsheet used a formula in the Inventory Value column to calculate the value of each item by multiplying the quantity under the QTY header and its costs by the cells under the COST header. Double clicking or highlighting a cell will show the formula. You'll notice that formulas start with an equals sign, followed by the calculation or operation.
We can use another formula to add all the values of Inventory Value together to get its total value. This could be calculated by adding each cell to generate the sum, but that can be a tedious task. Excel has **functions**, or predefined formulas to perform calculations on cell values. Functions require arguments, which are the required values used to perform these calculations. When functions require more than one argument, they will need to be listed in a particular order or the function may not calculate the correct value. This example uses the SUM function, and uses the values of on Inventory Value as the argument to add generate the total listed under row 3, column B (also referred to as B3).
![A highlighted function from an example inventory list in Microsoft Excel](images/function-excel.png)
There are some additional formatting and features added to this spreadsheet that this lesson does not cover. If you're interested in learning more about Excel, [RESOURCE HERE]
We can use another formula to add all the values of Inventory Value together to get its total value. This could be calculated by adding each cell to generate the sum, but that can be a tedious task. Excel has [**functions**](https://support.microsoft.com/en-us/office/sum-function-043e1c7d-7726-4e80-8f32-07b23e057f89), or predefined formulas to perform calculations on cell values. Functions require arguments, which are the required values used to perform these calculations. When functions require more than one argument, they will need to be listed in a particular order or the function may not calculate the correct value. This example uses the SUM function, and uses the values of on Inventory Value as the argument to add generate the total listed under row 3, column B (also referred to as B3).
## NoSQL
NoSQL stands for "Not only SQL"
NoSQL is an umbrella term for the different ways to store non-relational data and can be interpreted as "non-SQL", "non-relational" or "not only SQL". These type of database systems can be categorized into 4 types.
NoSQL is an umbrella term for the different ways to store non-relational data. These can be categorized into 4 types.
https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data
![Graphical representation of a key-value data store showing 4 unique numerical keys that are associated with 4 various values](images/kv-db.png)
> Source from [Michał Białecki Blog](https://www.michalbialecki.com/2018/03/18/azure-cosmos-db-key-value-database-cloud/)
[Key-value](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#keyvalue-data-stores) databases pair unique keys, which are a unique identifier associated with a value. These pairs are stored using a [hash table](https://www.hackerearth.com/practice/data-structures/hash-tables/basics-of-hash-tables/tutorial/) with an appropriate hashing function.
![Image of a key-value store]
![Graphical representation of a graph data store showing the relationships between people, their interests and locations](images/graph-db.png)
> Source from [Microsoft](https://docs.microsoft.com/en-us/azure/cosmos-db/graph/graph-introduction#graph-database-by-example)
[Graph](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#graph-data-stores) databases describe relationships in data and are represented as a collection of nodes and edges. A node represents an entity, something that exists in the real world such as a student or bank statement. Edges represent the relationship between two entities Each node and edge have properties that provides additional information about each node and edges.
![Image of a key-value store]
![Graphical representation of a columnar data store showing a customer database with two column families named Identity and Contact Info](images/columnar-db.png)
[Columnar](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#columnar-data-stores) data stores organizes data into columns and rows like a relational data structure but each column is divided into groups called a column family, where the all the data under one column is related and can be retrieved and changed in one unit.
![Image of a key-value store]
### Document Data Stores with the Azure Cosmos DB
[Document](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#document-data-stores) data stores build on the concept of a key-value data store and is made up of a series of fields and objects. This section will explore document databases with the Cosmos DB emulator.
A Cosmos DB database fits the definition of "Not Only SQL", where Cosmos DB's document database relies on SQL to query the data. The [previous lesson](2-Working-With-Data\06-non-relational) on SQL covers the basics of the language, and we'll be able to apply some of the same queries to a document database here. We'll be using the Cosmos DB Emulator, which allows us to create and explore a document database locally on a computer. Read more about the Emulator [here](https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator?tabs=ssl-netstd21).
A document is a collection of fields and object values, where the fields describe what the object value represents. Below is an example of a document.
```json
{
"firstname": "Eva",
"age": 44,
"id": "8c74a315-aebf-4a16-bb38-2430a9896ce5",
"_rid": "bHwDAPQz8s0BAAAAAAAAAA==",
"_self": "dbs/bHwDAA==/colls/bHwDAPQz8s0=/docs/bHwDAPQz8s0BAAAAAAAAAA==/",
"_etag": "\"00000000-0000-0000-9f95-010a691e01d7\"",
"_attachments": "attachments/",
"_ts": 1630544034
}
```
The fields of interest in this document are: `firstname`, `id`, and `age`. The rest of the fields with the underscores were generated by Cosmos DB.
#### Exploring Data with the Cosmos DB Emulator
You can download and install the emulator [for Windows here](https://aka.ms/cosmosdb-emulator). Refer to this [documentation](https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator?tabs=ssl-netstd21#run-on-linux-macos) for options on how to run the Emulator for macOS and Linux.
The Emulator launches a browser window, where the Explorer view allows you to explore documents.
### Document Data Stores with the Azure Cosmos DB Emulator
[Document](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#document-data-stores) data stores build on the concept of a key-value data store and is made up of a series of fields and objects
![The Explorer view of the Cosmos DB Emulator](images/cosmosdb-emulator-explorer.png)
#### The Cosmos DB Emulator
If you're following along, click on "Start with Sample" to generate a sample database called SampleDB. If you expand Sample DB by clicking on the arrow you'll find a container called `Persons`, a container holds a collection of items, which are the documents within the container. You can explore the four individual documents under `Items`.
![Exploring sample data in the Cosmos DB Emulator](images/cosmosdb-emulator-persons.png)
#### Querying Document Data with the Cosmos DB Emulator
We can also query the sample data by clicking on the new SQL Query button (second button from the left).
## Pre-Lecture Quiz
`SELECT * FROM c` returns all the documents in the container. Let's add a where clause and find everyone younger than 40.
[Pre-lecture quiz]()
`SELECT * FROM c where c.age < 40`
![Running a SELECT query on sample data in the Cosmos DB Emulator to find documents that have an age field value that is less than 40](images/cosmosdb-emulator-persons-query.png)
The query returns two documents, notice the age value for each document is less than 40.
#### JSON and Documents
If you're familiar with JavaScript Object Notation (JSON) you'll notice that documents look similar to JSON. There is a `PersonsData.json` file in this directory with more data that you may upload to the Persons container in the Emulator via the `Upload Item` button.
In most instances, APIs that return JSON data can be directly transferred and stored in document databases. Below is another document, it represents tweets from the Microsoft Twitter account that was retrieved using the Twitter API, then inserted into Cosmos DB.
```json
{
"created_at": "2021-08-31T19:03:01.000Z",
"id": "1432780985872142341",
"text": "Blank slate. Like this tweet if youve ever painted in Microsoft Paint before. https://t.co/cFeEs8eOPK",
"_rid": "dhAmAIUsA4oHAAAAAAAAAA==",
"_self": "dbs/dhAmAA==/colls/dhAmAIUsA4o=/docs/dhAmAIUsA4oHAAAAAAAAAA==/",
"_etag": "\"00000000-0000-0000-9f84-a0958ad901d7\"",
"_attachments": "attachments/",
"_ts": 1630537000
```
The fields of interest in this document are: `created_at`, `id`, and `text`.
@ -69,13 +125,29 @@ https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-rela
## 🚀 Challenge
## Post-Lecture Quiz
There is a `TwitterData.json` file that you can upload to the SampleDB database. It's recommended that you add it to a separate container. This can be done by:
1. Clicking the new container button in the top right
1. Selecting the existing database (SampleDB) creating a container id for the container
1. Setting the partition key to `/id`
1. Clicking OK (you can ignore rest of the information in this view as this is a small dataset running locally on your machine)
1. Open your new container and upload the Twitter Data file with `Upload Item` button
Try to run a few select queries to find the documents that have Microsoft in the text field. Hint: try to use the [LIKE keyword](https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-keywords#using-like-with-the--wildcard-character)
## [Post-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/11)
[Post-lecture quiz]()
## Review & Self Study
- There are some additional formatting and features added to this spreadsheet that this lesson does not cover. Microsoft has a [large library of documentation and videos](https://support.microsoft.com/excel) on Excel if you're interested in learning more.
- This architectural documentation details the characteristics in the different types of non-relational data: [Non-relational Data and NoSQL](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data)
- Cosmos DB is a cloud based non-relational database that can also store the different NoSQL types mentioned in this lesson. Learn more about these types in this [Cosmos DB Microsoft Learn Module](https://docs.microsoft.com/en-us/learn/paths/work-with-nosql-data-in-azure-cosmos-db/)
## Assignment
[Assignment Title](assignment.md)
[Soda Profits](assignment.md)

@ -0,0 +1,53 @@
[
{
"created_at": "2021-09-01T20:43:00.000Z",
"id": "1433168535371464704",
"text": "Word? Word."
},
{
"created_at": "2021-09-01T19:30:00.000Z",
"id": "1433150165251149826",
"text": "The pandemic left thousands at home with no dependable way to shop for food and other necessities. \n\nHere's how Humana's routine customer check-ins became a call to action to feed those in need, using Microsoft solutions. https://t.co/4CGLwhUZdi 🧡"
},
{
"created_at": "2021-09-01T18:03:00.000Z",
"id": "1433128271248609287",
"text": "You: *joins meeting 2 minutes late*\n\nMicrosoft Teams: “There are 13 people here already, so we muted your mic. Youre welcome.”"
},
{
"created_at": "2021-09-01T15:00:18.000Z",
"id": "1433082292663209993",
"text": "Youre invited.\n\nLearn more about the #MicrosoftEvent: https://t.co/tpK3TB8Xxb"
},
{
"created_at": "2021-08-31T22:00:00.000Z",
"id": "1432825525626675206",
"text": "Latino-led businesses have increased by 34% in the past decade. \n\nGet to know four organizations that used technology to drive change in their community: https://t.co/sEydYaY80X 🌎"
},
{
"created_at": "2021-08-31T19:03:01.000Z",
"id": "1432780985872142341",
"text": "Blank slate. Like this tweet if youve ever painted in Microsoft Paint before. https://t.co/cFeEs8eOPK"
},
{
"created_at": "2021-08-31T16:00:00.000Z",
"id": "1432734928593096709",
"text": "Public health officials used technology to work, help, and heal during the pandemic with greater accuracy and speed.\n \nCheck out the Microsoft tools used to meet today's challenges: https://t.co/ptz7BOLxpH"
},
{
"created_at": "2021-08-31T13:15:32.000Z",
"id": "1432693540665106438",
"text": "RT @Windows: As perfect as 11.11 *would* be, we just couldn't wait any longer to make #Windows11 available. Get it October 5th, and read al…"
},
{
"created_at": "2021-08-30T20:15:01.000Z",
"id": "1432436715797553154",
"text": "Live your best hybrid work life by showing empathy, practicing active listening, and staying flexible.\n\nFind more tips from six career experts: https://t.co/BfJykUoaGH"
},
{
"created_at": "2021-08-30T17:15:00.000Z",
"id": "1432391416823570437",
"text": "Are you performing under pressure? Microsoft Teams has new built-in tools that promote productivity and balance in learning. \n\nHere are the top five: https://t.co/yn8OTPgfd2"
}
]

@ -1,7 +1,18 @@
# Title
# Soda Profits
## Instructions
The [Coca Cola Co spreadsheet](CocaColaCo.xlsx) is missing some calculations. Your task is to:
1. Calculate the Gross profits of FY '15, '16, '17, and '18
- Gross Profit = Net Operating revenues - Cost of goods sold
1. Calculate the average of all the gross profits. Try to do this with a function.
- Average = Sum of gross profits divided by the number of fiscal years (10)
- Documentation on the [AVERAGE function](https://support.microsoft.com/en-us/office/average-function-047bac88-d466-426c-a32b-8f33eb960cf6)
1. This is an Excel file, but it should be editable in any spreadsheet platform
[Data source credit to Yiyi Wang](https://www.kaggle.com/yiyiwang0826/cocacola-excel)
## Rubric
Exemplary | Adequate | Needs Improvement

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 159 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 246 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 168 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 119 KiB

@ -1,14 +1,19 @@
# Working with Data: Python and the Pandas Library
While databases offer very efficient ways to store data and query them using query languages, the most flexible way of data processing is writing your own program to manipulate data. While in many cases doing database query will be more effective way, in some cases you might need some more complex data processing, which cannot be easily done using SQL.
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/07-WorkWithPython.png)|
|:---:|
|Working With Python - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
[![Intro Video](images/video-ds-python.png)](https://youtu.be/dZjWOGbsN4Y)
While databases offer very efficient ways to store data and query them using query languages, the most flexible way of data processing is writing your own program to manipulate data. In many cases, doing a database query would be a more effective way. However in some cases when more complex data processing is needed, it cannot be done easily using SQL.
Data processing can be programmed in any programming language, but there are certain languages that are higher level with respect to working with data. Data scientists typically prefer one of the following languages:
* **[Python](https://www.python.org/)**, a general-purpose programming language, which is often considered one of the best options for beginners due to its simplicity. Python has a lot of additional libraries that can help you solve many practical problems, such as extracting your data from ZIP archive, or converting picture to grayscale. In addition to data science, Python is also often used for web development.
* **[R](https://www.r-project.org/)** is a traditional toolbox developed with statistical data processing in mind. It also contains large repository of libraries (CRAN), making it a good choice for data processing. However, R is not a general-purpose programming language, and is rarely used outside of data science domain.
* **[Julia](https://julialang.org/)** is another language developed specifically for data science. It is intended to give better performance than Python, making it a great tool for scientific experimentation.
In this lesson, we will focus on using Python for simple data processing. We will assume basic familiarity with the language. If you want deeper tour of Python, you can refer to one of the following resources:
In this lesson, we will focus on using Python for simple data processing. We will assume basic familiarity with the language. If you want a deeper tour of Python, you can refer to one of the following resources:
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse) - GitHub-based quick intro course into Python Programming
* [Take your First Steps with Python](https://docs.microsoft.com/en-us/learn/paths/python-first-steps/?WT.mc_id=acad-31812-dmitryso) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=acad-31812-dmitryso)
@ -19,9 +24,9 @@ We will focus on a few examples of data processing, instead of giving you full o
> **Most useful advice**. When you need to perform certain operation on data that you do not know how to do, try searching for it in the internet. [Stackoverflow](https://stackoverflow.com/) usually contains a lot of useful code sample in Python for many typical tasks.
## Pre-Lecture Quiz
[Pre-lecture quiz]()
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/12)
## Tabular Data and Dataframes
@ -33,7 +38,7 @@ There are two most useful libraries in Python that can help you deal with tabula
There are also a couple of other libraries you should know about:
* **[Matplotlib](https://matplotlib.org/)** is a library used for data visualization and plotting graphs
* **[SciPy](https://www.scipy.org/)** is a library with some additional scientific functions. We have already come accross this library when talking about probability and statistics
* **[SciPy](https://www.scipy.org/)** is a library with some additional scientific functions. We have already come across this library when talking about probability and statistics
Here is a piece of code that you would typically use to import those libraries in the beginning of your Python program:
```python
@ -54,7 +59,7 @@ Pandas is centered around a few basic concepts.
Consider an example: we want to analyze sales of our ice-cream spot. Let's generate a series of sales numbers (number of items sold each day) for some time period:
```python
tart_date = "Jan 1, 2020"
start_date = "Jan 1, 2020"
end_date = "Mar 31, 2020"
idx = pd.date_range(start_date,end_date)
print(f"Length of index is {len(idx)}")
@ -114,22 +119,164 @@ This will give us a table like this:
| 6 | 7 | Pandas |
| 7 | 8 | very |
| 8 | 9 | much |
## 🚀 Challenge
**Note** that we can also get this table layout by transposing the previous table, eg. by writing
```python
df = pd.DataFrame([a,b]).T..rename(columns={ 0 : 'A', 1 : 'B' })
```
Here `.T` means the operation of transposing the DataFrame, i.e. changing rows and columns, and `rename` operation allows us to rename columns to match the previous example.
Here are a few most important operations we can perform on DataFrames:
**Column selection**. We can select individual columns by writing `df['A']` - this operation returns a Series. We can also select a subset of columns into another DataFrame by writing `df[['B','A']]` - this return another DataFrame.
**Filtering** only certain rows by criteria. For example, to leave only rows with column `A` greater than 5, we can write `df[df['A']>5]`.
> **Note**: The way filtering works is the following. The expression `df['A']<5` returns a boolean series, which indicates whether expression is `True` or `False` for each element of the original series `df['A']`. When boolean series is used as an index, it returns subset of rows in the DataFrame. Thus it is not possible to use arbitrary Python boolean expression, for example, writing `df[df['A']>5 and df['A']<7]` would be wrong. Instead, you should use special `&` operation on boolean series, writing `df[(df['A']>5) & (df['A']<7)]` (*brackets are important here*).
**Creating new computable columns**. We can easily create new computable columns for our DataFrame by using intuitive expression like this:
```python
df['DivA'] = df['A']-df['A'].mean()
```
This example calculates divergence of A from its mean value. What actually happens here is we are computing a series, and then assigning this series to the left-hand-side, creating another column. Thus, we cannot use any operations that are not compatible with series, for example, the code below is wrong:
```python
# Wrong code -> df['ADescr'] = "Low" if df['A'] < 5 else "Hi"
df['LenB'] = len(df['B']) # <- Wrong result
```
The latter example, while being syntactically correct, gives us wrong result, because it assigns the length of series `B` to all values in the column, and not the length of individual elements as we intended.
If we need to compute complex expressions like this, we can use `apply` function. The last example can be written as follows:
```python
df['LenB'] = df['B'].apply(lambda x : len(x))
# or
df['LenB'] = df['B'].apply(len)
```
After operations above, we will end up with the following DataFrame:
| | A | B | DivA | LenB |
|---|---|---|---|---|
| 0 | 1 | I | -4.0 | 1 |
| 1 | 2 | like | -3.0 | 4 |
| 2 | 3 | to | -2.0 | 2 |
| 3 | 4 | use | -1.0 | 3 |
| 4 | 5 | Python | 0.0 | 6 |
| 5 | 6 | and | 1.0 | 3 |
| 6 | 7 | Pandas | 2.0 | 6 |
| 7 | 8 | very | 3.0 | 4 |
| 8 | 9 | much | 4.0 | 4 |
**Selecting rows based on numbers** can be done using `iloc` construct. For example, to select first 5 rows from the DataFrame:
```python
df.iloc[:5]
```
**Grouping** is often used to get a result similar to *pivot tables* in Excel. Suppose that we want to compute mean value of column `A` for each given number of `LenB`. Then we can group our DataFrame by `LenB`, and call `mean`:
```python
df.groupby(by='LenB').mean()
```
If we need to compute mean and the number of elements in the group, then we can use more complex `aggregate` function:
```python
df.groupby(by='LenB') \
.aggregate({ 'DivA' : len, 'A' : lambda x: x.mean() }) \
.rename(columns={ 'DivA' : 'Count', 'A' : 'Mean'})
```
This gives us the following table:
| LenB | Count | Mean |
|------|-------|------|
| 1 | 1 | 1.000000 |
| 2 | 1 | 3.000000 |
| 3 | 2 | 5.000000 |
| 4 | 3 | 6.333333 |
| 6 | 2 | 6.000000 |
### Getting Data
We have seen how easy it is to construct Series and DataFrames from Python objects. However, data usually comes in the form of a text file, or an Excel table. Luckily, Pandas offers us a simple way to load data from disk. For example, reading CSV file is as simple as this:
```python
df = pd.read_csv('file.csv')
```
We will see more examples of loading data, including fetching it from external web sites, in the "Challenge" section
### Printing and Plotting
A Data Scientist often has to explore the data, thus it is important to be able to visualize it. When DataFrame is big, many times we want just to make sure we are doing everything correctly by printing out the first few rows. This can be done by calling `df.head()`. If you are running it from Jupyter Notebook, it will print out the DataFrame in a nice tabular form.
We have also seen the usage of `plot` function to visualize some columns. While `plot` is very useful for many tasks, and supports many different graph types via `kind=` parameter, you can always use raw `matplotlib` library to plot something more complex. We will cover data visualization in detail in separate course lessons.
This overview covers most important concepts of Pandas, however, the library is very rich, and there is no limit to what you can do with it! Let's now apply this knowledge for solving specific problem.
## 🚀 Challenge 1: Analyzing COVID Spread
First problem we will focus on is modelling of epidemic spread of COVID-19. In order to do that, we will use the data on the number of infected individuals in different countries, provided by the [Center for Systems Science and Engineering](https://systems.jhu.edu/) (CSSE) at [Johns Hopkins University](https://jhu.edu/). Dataset is available in [this GitHub Repository](https://github.com/CSSEGISandData/COVID-19).
Since we want to demonstrate how to deal with data, we invite you to open [`notebook-covidspread.ipynb`](notebook-covidspread.ipynb) and read it from top to bottom. You can also execute cells, and do some challenges that we have leaf for you along the way.
Since we want to demonstrate how to deal with data, we invite you to open [`notebook-covidspread.ipynb`](notebook-covidspread.ipynb) and read it from top to bottom. You can also execute cells, and do some challenges that we have left for you at the end.
![COVID Spread](images/covidspread.png)
> If you do not know how to run code in Jupyter Notebook, have a look at [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
## Working with Unstructured Data
While data very often comes in tabular form, in some cases we need to deal with less structured data, for example, text or images. In this case, to apply data processing techniques we have seen above, we need to somehow **extract** structured data. Here are a few examples:
* Extracting keywords from text, and seeing how often those keywords appear
* Using neural networks to extract information about objects on the picture
* Getting information on emotions of people on video camera feed
## 🚀 Challenge 2: Analyzing COVID Papers
In this challenge, we will continue with the topic of COVID pandemic, and focus on processing scientific papers on the subject. There is [CORD-19 Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) with more than 7000 (at the time of writing) papers on COVID, available with metadata and abstracts (and for about half of them there is also full text provided).
A full example of analyzing this dataset using [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health/?WT.mc_id=acad-31812-dmitryso) cognitive service is described [in this blog post](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/). We will discuss simplified version of this analysis.
> **NOTE**: We do not provide a copy of the dataset as part of this repository. You may first need to download the [`metadata.csv`](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv) file from [this dataset on Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Registration with Kaggle may be required. You may also download the dataset without registration [from here](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html), but it will include all full texts in addition to metadata file.
## Post-Lecture Quiz
Open [`notebook-papers.ipynb`](notebook-papers.ipynb) and read it from top to bottom. You can also execute cells, and do some challenges that we have left for you at the end.
[Post-lecture quiz]()
![Covid Medical Treatment](images/covidtreat.png)
## Processing Image Data
Recently, very powerful AI models have been developed that allow us to understand images. There are many tasks that can be solved using pre-trained neural networks, or cloud services. Some examples include:
* **Image Classification**, which can help you categorize the image into one of the pre-defined classes. You can easily train your own image classifiers using services such as [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=acad-31812-dmitryso)
* **Object Detection** to detect different objects in the image. Services such as [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=acad-31812-dmitryso) can detect a number of common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=acad-31812-dmitryso) model to detect some specific objects of interest.
* **Face Detection**, including Age, Gender and Emotion detection. This can be done via [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=acad-31812-dmitryso).
All those cloud services can be called using [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=acad-31812-dmitryso), and thus can be easily incorporated into your data exploration workflow.
Here are some examples of exploring data from Image data sources:
* In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/) we explore Instagram photos, trying to understand what makes people give more likes to a photo. We first extract as much information from pictures as possible using [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=acad-31812-dmitryso), and then use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=acad-31812-dmitryso) to build interpretable model.
* In [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies) we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=acad-31812-dmitryso) to extract emotions on people on photographs from events, in order to try to understand what makes people happy.
## Conclusion
Whether you already have structured or unstructured data, using Python you can perform all steps related to data processing and understanding. It is probably the most flexible way of data processing, and that is the reason the majority of data scientists use Python as their primary tool. Learning Python in depth is probably a good idea if you are serious about your data science journey!
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/13)
## Review & Self Study
**Books**
* [Wes McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython](https://www.amazon.com/gp/product/1491957662)
**Online Resources**
* Official [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) tutorial
* [Documentation on Pandas Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)
**Learning Python**
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse)
* [Take your First Steps with Python](https://docs.microsoft.com/learn/paths/python-first-steps/?WT.mc_id=acad-31812-dmitryso) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=acad-31812-dmitryso)
## Assignment
[Assignment Title](assignment.md)
[Perform more detailed data study for the challenges above](assignment.md)
## Credits
This lesson has been authored with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)

@ -0,0 +1,23 @@
# Assignment for Data Processing in Python
In this assignment, we will ask you to elaborate on the code we have started developing in our challenges. The assignment consists of two parts:
## COVID-19 Spread Modelling
- [ ] Plot $R_t$ graphs for 5-6 different countries on one plot for comparison, or using several plots side-by-side
- [ ] See how the number of deaths and recoveries correlate with number of infected cases.
- [ ] Find out how long a typical disease lasts by visually correlating infection rate and deaths rate and looking for some anomalies. You may need to look at different countries to find that out.
- [ ] Calculate the fatality rate and how it changes over time. *You may want to take into account the length of the disease in days to shift one time series before doing calculations*
## COVID-19 Papers Analysis
- [ ] Build co-occurrence matrix of different medications, and see which medications often occur together (i.e. mentioned in one abstract). You can modify the code for building co-occurrence matrix for medications and diagnoses.
- [ ] Visualize this matrix using heatmap.
- [ ] As a stretch goal, visualize the co-occurrence of medications using [chord diagram](https://en.wikipedia.org/wiki/Chord_diagram). [This library](https://pypi.org/project/chord/) may help you draw a chord diagram.
- [ ] As another stretch goal, extract dosages of different medications (such as **400mg** in *take 400mg of chloroquine daily*) using regular expressions, and build dataframe that shows different dosages for different medications. **Note**: consider numeric values that are in close textual vicinity of the medicine name.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
All tasks are complete, graphically illustrated and explained, including at least one of two stretch goals | More than 5 tasks are complete, no stretch goals are attempted, or the results are not clear | Less than 5 (but more than 3) tasks are complete, visualizations do not help to demonstrate the point

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

@ -1,15 +1,19 @@
# Working with Data: Data Preparation
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/08-DataPreparation.png)|
|:---:|
|Data Preparation - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## Pre-Lecture Quiz
[Pre-lecture quiz]()
[Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/14)
## 🚀 Challenge
## Post-Lecture Quiz
[Post-lecture quiz]()
[Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/15)
## Review & Self Study

@ -1,13 +1,16 @@
# Working with Data
[Brief description about the lessons in this section]
![data love](images/data-love.jpg)
> Photo by <a href="https://unsplash.com/@swimstaralex?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Alexander Sinn</a> on <a href="https://unsplash.com/s/photos/data?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
In these lessons, you will learn some of the ways that data can be managed, manipulated, and used in applications. You will learn about relational and non-relational databases and how data can be stored in them. You'll learn the fundamentals of working with Python to manage data, and you'll discover some of the many ways that you can work with Python to manage and mine data.
### Topics
1. [Spreadsheets](05-spreadsheets/README.md)
2. [Relational Databases](06-relational-databases/README.md)
3. [NoSQL](07-nosql/README.md)
4. [Python](08-python/README.md)
5. [Cleaning and Transformations](09-cleaning-transformations/README.md)
1. [Relational databases](05-relational-databases/README.md)
2. [Non-relational databases](06-non-relational/README.md)
3. [Working with Python](07-python/README.md)
4. [Preparing data](08-data-preparation/README.md)
### Credits
These lessons were written with ❤️ by [Christopher Harrison](https://twitter.com/geektrainer), [Dmitry Soshnikov](https://twitter.com/shwars) and [Jasmine Greenaway](https://twitter.com/paladique)

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.4 MiB

@ -1,9 +1,11 @@
# Visualizing Quantities
In this lesson, you will use three different libraries to learn how to create interesting visualizations all around the concept of quantity. Using a cleaned dataset about the birds of Minnesota, you can learn many interesting facts about local wildlife.
## Pre-Lecture Quiz
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/09-Visualizing-Quantities.png)|
|:---:|
| Visualizing Quantities - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
[Pre-lecture quiz]()
In this lesson you will explore how to use one of the many available Python libraries to learn how to create interesting visualizations all around the concept of quantity. Using a cleaned dataset about the birds of Minnesota, you can learn many interesting facts about local wildlife.
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/16)
## Observe wingspan with Matplotlib
@ -192,9 +194,7 @@ In this plot you can see the range, per category, of the Minimum Length and Maxi
## 🚀 Challenge
This bird dataset offers a wealth of information about different types of birds within a particular ecosystem. Search around the internet and see if you can find other bird-oriented datasets. Practice building charts and graphs around these birds to discover facts you didn't realize.
## Post-Lecture Quiz
[Post-lecture quiz]()
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/17)
## Review & Self Study

@ -1,9 +1,12 @@
# Visualizing Distributions
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/10-Visualizing-Distributions.png)|
|:---:|
| Visualizing Distributions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In the previous lesson, you learned some interesting facts about a dataset about the birds of Minnesota. You found some erroneous data by visualizing outliers and looked at the differences between bird categories by their maximum length.
## Pre-Lecture Quiz
[Pre-lecture quiz]()
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/18)
## Explore the birds dataset
Another way to dig into data is by looking at its distribution, or how the data is organized along an axis. Perhaps, for example, you'd like to learn about the general distribution, for this dataset, of maximum wingspan or maximum body mass for the birds of Minnesota.
@ -178,9 +181,7 @@ Perhaps it's worth researching whether the cluster of 'Vulnerable' birds accordi
Histograms are a more sophisticated type of chart than basic scatterplots, bar charts, or line charts. Go on a search on the internet to find good examples of the use of histograms. How are they used, what do they demonstrate, and in what fields or areas of inquiry do they tend to be used?
## Post-Lecture Quiz
[Post-lecture quiz]()
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/19)
## Review & Self Study

@ -1,20 +1,22 @@
# Visualizing Proportions
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/11-Visualizing-Proportions.png)|
|:---:|
|Visualizing Proportions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you will use a different nature-focused dataset to visualize proportions, such as how many different types of fungi populate a given dataset about mushrooms. Let's explore these fascinating fungi using a dataset sourced from Audubon listing details about 23 species of gilled mushrooms in the Agaricus and Lepiota families. You will experiment with tasty visualizations such as:
- Pie charts 🥧
- Donut charts 🍩
- Waffle charts 🧇
> 💡 A very interesting project called [Charticulator](https://charticulator.com) by Microsoft Research offers a free drag and drop interface for data visualizations. In one of their tutorials they also use this mushroom dataset! So you can explore the data and learn the library at the same time: https://charticulator.com/tutorials/tutorial4.html
## Pre-Lecture Quiz
> 💡 A very interesting project called [Charticulator](https://charticulator.com) by Microsoft Research offers a free drag and drop interface for data visualizations. In one of their tutorials they also use this mushroom dataset! So you can explore the data and learn the library at the same time: [Charticulator tutorial](https://charticulator.com/tutorials/tutorial4.html).
[Pre-lecture quiz]()
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/20)
## Get to know your mushrooms 🍄
Mushrooms are very interesting. Let's import a dataset to study them.
Mushrooms are very interesting. Let's import a dataset to study them:
```python
import pandas as pd
@ -32,7 +34,7 @@ A table is printed out with some great data for analysis:
| Edible | Bell | Smooth | White | Bruises | Anise | Free | Close | Broad | Brown | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Meadows |
| Poisonous | Convex | Scaly | White | Bruises | Pungent | Free | Close | Narrow | Brown | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban |
Right away, you notice that all the data is textual. You will have to edit this data to be able to use it in a chart. Most of the data, in fact, is represented as an object:
Right away, you notice that all the data is textual. You will have to convert this data to be able to use it in a chart. Most of the data, in fact, is represented as an object:
```python
print(mushrooms.select_dtypes(["object"]).columns)
@ -74,7 +76,7 @@ plt.pie(edibleclass['population'],labels=labels,autopct='%.1f %%')
plt.title('Edible?')
plt.show()
```
Voila, a pie chart showing the proportions of this data according to these two classes of mushroom. It's quite important to get the order of labels correct, especially here, so be sure to verify the order with which the label array is built!
Voila, a pie chart showing the proportions of this data according to these two classes of mushrooms. It's quite important to get the order of the labels correct, especially here, so be sure to verify the order with which the label array is built!
![pie chart](images/pie1.png)
@ -82,7 +84,7 @@ Voila, a pie chart showing the proportions of this data according to these two c
A somewhat more visually interesting pie chart is a donut chart, which is a pie chart with a hole in the middle. Let's look at our data using this method.
Take a look at the various habitats where mushrooms grow.
Take a look at the various habitats where mushrooms grow:
```python
habitat=mushrooms.groupby(['habitat']).count()
@ -108,9 +110,9 @@ plt.show()
![donut chart](images/donut.png)
This code draws a chart and a center circle, then adds that center circle in. Edit the width of the center circle by changing `0.40` to another value.
This code draws a chart and a center circle, then adds that center circle in the chart. Edit the width of the center circle by changing `0.40` to another value.
Donut charts can be tweaked several ways to change the labels. The labels in particular can be highlighted for readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut).
Donut charts can be tweaked in several ways to change the labels. The labels in particular can be highlighted for readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut).
Now that you know how to group your data and then display it as a pie or donut, you can explore other types of charts. Try a waffle chart, which is just a different way of exploring quantity.
## Waffles!
@ -151,20 +153,18 @@ fig = plt.figure(
)
```
Using a waffle chart, you can plainly see the proportions of cap color of this mushroom dataset. Interestingly, there are many green-capped mushrooms!
Using a waffle chart, you can plainly see the proportions of cap colors of this mushrooms dataset. Interestingly, there are many green-capped mushrooms!
![waffle chart](images/waffle.png)
✅ Pywaffle supports icons within the charts that use any icon available in [Font Awesome](https://fontawesome.com/). Do some experiments to create an even more interesting waffle chart using icons instead of squares.
In this lesson you learned three ways to visualize proportions. First, you need to group your data into categories and then decide which is the best way to display the data - pie, donut, or waffle. All are delicious and gratify the user with an instant snapshot of a dataset.
In this lesson, you learned three ways to visualize proportions. First, you need to group your data into categories and then decide which is the best way to display the data - pie, donut, or waffle. All are delicious and gratify the user with an instant snapshot of a dataset.
## 🚀 Challenge
Try recreating these tasty charts in [Charticulator](https://charticulator.com).
## Post-Lecture Quiz
[Post-lecture quiz]()
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/21)
## Review & Self Study

@ -2,10 +2,10 @@
## Instructions
Did you know you can create donut, pie and waffle charts in Excel? Using a dataset of your choice, create these three charts right in an Excel spreadsheet
Did you know you can create donut, pie, and waffle charts in Excel? Using a dataset of your choice, create these three charts right in an Excel spreadsheet.
## Rubric
| Exemplary | Adequate | Needs Improvement |
| ------------------------------------------------------- | ------------------------------------------------- | ------------------------------------------------------ |
| An Excel spreadsheet is presented with all three charts | An Excel spreadsheet is presented with two charts | An Excel spreadsheet is presented with only one charts |
| An Excel spreadsheet is presented with all three charts | An Excel spreadsheet is presented with two charts | An Excel spreadsheet is presented with only one chart |

@ -1,14 +1,16 @@
# Visualizing Relationships: All About Honey 🍯
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/12-visualizing-relationships.png)|
|:---:|
|Visualizing Relationships - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Continuing with the nature focus of our research, let's discover interesting visualizations to show the relationships between various types of honey, according to a dataset derived from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
This dataset of about 600 items displays honey production in many U.S. states. So, for example, you can look at the number of colonies, yield per colony, total production, stocks, price per pound, and value of the honey produced in a given state from 1998-2012, with one row per year for each state.
It will be interesting to visualize the relationship between a given state's production per year and, for example, the price of honey in that state. Alternately, you could visualize the relationship between states' honey yield per colony. This year span covers the devastating 'CCD' or 'Colony Collapse Disorder' first seen in 2006 (http://npic.orst.edu/envir/ccd.html), so it is a poignant dataset to study. 🐝
## Pre-Lecture Quiz
[Pre-lecture quiz]()
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/22)
In this lesson, you can use Seaborn, which you use before, as a good library to visualize relationships between variables. Particularly interesting is the use of Seaborn's `relplot` function that allows scatter plots and line plots to quickly visualize '[statistical relationships](https://seaborn.pydata.org/tutorial/relational.html?highlight=relationships)', which allow the data scientist to better understand how variables relate to each other.
@ -162,9 +164,7 @@ Go, bees, go!
## 🚀 Challenge
In this lesson, you learned a bit more about other uses of scatterplots and line grids, including facet grids. Challenge yourself to create a facet grid using a different dataset, maybe one you used prior to these lessons. Note how long they take to create and how you need to be careful about how many grids you need to draw using these techniques.
## Post-Lecture Quiz
[Post-lecture quiz]()
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/23)
## Review & Self Study

@ -1,5 +1,9 @@
# Making Meaningful Visualizations
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/13-MeaningfulViz.png)|
|:---:|
| Meaningful Visualizations - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
> "If you torture the data long enough, it will confess to anything" -- [Ronald Coase](https://en.wikiquote.org/wiki/Ronald_Coase)
One of the basic skills of a data scientist is the ability to create a meaningful data visualization that helps answer questions you might have. Prior to visualizing your data, you need to ensure that it has been cleaned and prepared, as you did in prior lessons. After that, you can start deciding how best to present the data.
@ -13,14 +17,12 @@ In this lesson, you will review:
5. How to build animated or 3D charting solutions
6. How to build a creative visualization
[Pre-lecture quiz]()
## Pre-Lecture Quiz
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/24)
## Choose the right chart type
In previous lessons, you experimented with building all kinds of interesting data visualizations using Matplotlib and Seaborn for charting. In general, you can select the [right kind of chart](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) for the question you are asking using this table:
| You need to: | You should use: |
| -------------------------- | ------------------------------- |
| Show data trends over time | Line |
@ -31,11 +33,12 @@ In previous lessons, you experimented with building all kinds of interesting dat
| Show proportions | Pie, Donut, Waffle |
> ✅ Depending on the makeup of your data, you might need to convert it from text to numeric to get a given chart to support it.
## Avoid deception
Even if a data scientist is careful to choose the right chart for the right data, there are plenty of ways that data can be displayed in a way to prove a point, often at the cost of undermining the data itself. There are many examples of deceptive charts and infographics!
[![Deceptive Charts by Alberto Cairo](./images/tornado.png)](https://www.youtube.com/Low28hx4wyk "Deceptive charts")
[![How Charts Lie by Alberto Cairo](./images/tornado.png)](https://www.youtube.com/watch?v=oX74Nge8Wkw "How charts lie")
> 🎥 Click the image above for a conference talk about deceptive charts
@ -79,21 +82,25 @@ While [color meaning](https://colormatters.com/color-symbolism/the-meanings-of-c
| orange | vibrance |
If you are tasked with building a chart with custom colors, ensure that your charts are both accessible and the color you choose coincides with the meaning you are trying to convey.
## Styling your charts for readability
Charts are not meaningful if they are not readable! Take a moment to consider styling the width and height of your chart to scale well with your data. If one variable (such as all 50 states) need to be displayed, show them vertically on the Y axis if possible so as to avoid a horizontally-scrolling chart.
Label your axes, provide a legend if necessary, and offer tooltips for better comprehension of data.
If your data is textual and verbose on the X-axis, you can angle the text for better readability. [Matplotlib](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html) offers 3d plotting, if you data supports it. Sophisticated data visualizations can be produced using `mpl_toolkits.mplot3d`.
If your data is textual and verbose on the X axis, you can angle the text for better readability. [Matplotlib](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html) offers 3d plotting, if you data supports it. Sophisticated data visualizations can be produced using `mpl_toolkits.mplot3d`.
![3d plots](images/3d.png)
## Animation and 3D chart display
Some of the best data visualizations today are animated. Shirley Wu has amazing ones done with D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/)', where each flower is a visualization of a movie. Another example for the Guardian is 'bussed out', an interactive experience combining visualizations with Greensock and D3 plus a scrollytelling article format to show how NYC handles its homeless problem by busing people out of the city.
Some of the best data visualizations today are animated. Shirley Wu has amazing ones done with D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/)', where each flower is a visualization of a movie. Another example for the Guardian is 'bussed out', an interactive experience combining visualizations with Greensock and D3 plus a scrollytelling article format to show how NYC handles its homeless problem by bussing people out of the city.
![busing](images/busing.png)
> "Bussed Out: How America Moves its Homeless" from [the Guardian](https://www.theguardian.com/us-news/ng-interactive/2017/dec/20/bussed-out-america-moves-homeless-people-country-study). Visualizations by Nadieh Bremer & Shirley Wu
While this lesson is insufficient to go into depth to teach these powerful visualization libraries, try your hand at D3 in a Vue.js app using a library to display a visualization of the book "Dangerous Liaisons" as an animated social network.
> "Les Liaisons Dangereuses" is an epistolary novel, or a novel presented as a series of letters. Written in 1782 by Choderlos de Laclos, it tells the story of the vicious, morally-bankrupt social maneuvers of two dueling protagonists of the French aristocracy in the late 18th century, the Vicomte de Valmont and the Marquise de Merteuil. Both meet their demise in the end but not without inflicting a great deal of social damage. The novel unfolds as a series of letters written to various people in their circles, plotting for revenge or simply to make trouble. Create a visualization of these letters to discover the major kingpins of the narrative, visually.
@ -101,6 +108,7 @@ While this lesson is insufficient to go into depth to teach these powerful visua
You will complete a web app that will display an animated view of this social network. It uses a library that was built to create a [visual of a network](https://github.com/emiliorizzo/vue-d3-network) using Vue.js and D3. When the app is running, you can pull the nodes around on the screen to shuffle the data around.
![liaisons](images/liaisons.png)
## Project: Build a chart to show a network using D3.js
> This lesson folder includes a `solution` folder where you can find the completed project, for your reference.
@ -129,14 +137,15 @@ Loop through the .json object to capture the 'to' and 'from' data for the letter
}
this.links.push({ sid: f, tid: t });
}
```
```
Run your app from the terminal (npm run serve) and enjoy the visualization!
## 🚀 Challenge
Take a tour of the internet to discover deceptive visualizations. How does the author fool the user, and is it intentional? Try correcting the visualizations to show how they should look.
## Post-Lecture Quiz
[Post-lecture quiz]()
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/25)
## Review & Self Study
@ -150,9 +159,10 @@ Take a look at these interest visualizations for historical assets and artifacts
https://handbook.pubpub.org/
Look through this article on how animation can enhance your visualizations
Look through this article on how animation can enhance your visualizations:
https://medium.com/@EvanSinar/use-animation-to-supercharge-data-visualization-cd905a882ad4
## Assignment
[Build your own custom vis](assignment.md)
[Build your own custom visualization](assignment.md)

@ -1710,15 +1710,6 @@
"integrity": "sha512-nQyp0o1/mNdbTO1PO6kHkwSrmgZ0MT/jCCpNiwbUjGoRN4dlBhqJtoQuCnEOKzgTVwg0ZWiCoQy6SxMebQVh8A==",
"dev": true
},
"ansi-styles": {
"version": "4.3.0",
"resolved": "https://registry.npmjs.org/ansi-styles/-/ansi-styles-4.3.0.tgz",
"integrity": "sha512-zbB9rCJAT1rbjiVDb2hqKFHNYLxgtk8NURxZ3IZwD3F6NtxbXZQCnnSi1Lkx+IDohdPlFp222wVALIheZJQSEg==",
"dev": true,
"requires": {
"color-convert": "^2.0.1"
}
},
"cacache": {
"version": "13.0.1",
"resolved": "https://registry.npmjs.org/cacache/-/cacache-13.0.1.tgz",
@ -1745,48 +1736,6 @@
"unique-filename": "^1.1.1"
}
},
"chalk": {
"version": "4.1.2",
"resolved": "https://registry.npmjs.org/chalk/-/chalk-4.1.2.tgz",
"integrity": "sha512-oKnbhFyRIXpUuez8iBMmyEa4nbj4IOQyuhc/wy9kY7/WVPcwIO9VA668Pu8RkO7+0G76SLROeyw9CpQ061i4mA==",
"dev": true,
"requires": {
"ansi-styles": "^4.1.0",
"supports-color": "^7.1.0"
}
},
"color-convert": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/color-convert/-/color-convert-2.0.1.tgz",
"integrity": "sha512-RRECPsj7iu/xb5oKYcsFHSppFNnsj/52OVTRKb4zP5onXwVF3zVmmToNcOfGC+CRDpfK/U584fMg38ZHCaElKQ==",
"dev": true,
"requires": {
"color-name": "~1.1.4"
}
},
"color-name": {
"version": "1.1.4",
"resolved": "https://registry.npmjs.org/color-name/-/color-name-1.1.4.tgz",
"integrity": "sha512-dOy+3AuW3a2wNbZHIuMZpTcgjGuLU/uBL/ubcZF9OXbDo8ff4O8yVp5Bf0efS8uEoYo5q4Fx7dY9OgQGXgAsQA==",
"dev": true
},
"has-flag": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/has-flag/-/has-flag-4.0.0.tgz",
"integrity": "sha512-EykJT/Q1KjTWctppgIAgfSO0tKVuZUjhgMr17kqTumMl6Afv3EISleU7qZUzoXDFTAHTDC4NOoG/ZxU3EvlMPQ==",
"dev": true
},
"loader-utils": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/loader-utils/-/loader-utils-2.0.0.tgz",
"integrity": "sha512-rP4F0h2RaWSvPEkD7BLDFQnvSf+nK+wr3ESUjNTyAGobqrijmW92zc+SO6d4p4B1wh7+B/Jg1mkQe5NYUEHtHQ==",
"dev": true,
"requires": {
"big.js": "^5.2.2",
"emojis-list": "^3.0.0",
"json5": "^2.1.2"
}
},
"source-map": {
"version": "0.6.1",
"resolved": "https://registry.npmjs.org/source-map/-/source-map-0.6.1.tgz",
@ -1803,15 +1752,6 @@
"minipass": "^3.1.1"
}
},
"supports-color": {
"version": "7.2.0",
"resolved": "https://registry.npmjs.org/supports-color/-/supports-color-7.2.0.tgz",
"integrity": "sha512-qpCAvRl9stuOHveKsn7HncJRvv501qIacKzQlO/+Lwxc9+0q2wLyv4Dfvt80/DPn2pqOBsJdDiogXGR9+OvwRw==",
"dev": true,
"requires": {
"has-flag": "^4.0.0"
}
},
"terser-webpack-plugin": {
"version": "2.3.8",
"resolved": "https://registry.npmjs.org/terser-webpack-plugin/-/terser-webpack-plugin-2.3.8.tgz",
@ -1828,17 +1768,6 @@
"terser": "^4.6.12",
"webpack-sources": "^1.4.3"
}
},
"vue-loader-v16": {
"version": "npm:vue-loader@16.5.0",
"resolved": "https://registry.npmjs.org/vue-loader/-/vue-loader-16.5.0.tgz",
"integrity": "sha512-WXh+7AgFxGTgb5QAkQtFeUcHNIEq3PGVQ8WskY5ZiFbWBkOwcCPRs4w/2tVyTbh2q6TVRlO3xfvIukUtjsu62A==",
"dev": true,
"requires": {
"chalk": "^4.1.0",
"hash-sum": "^2.0.0",
"loader-utils": "^2.0.0"
}
}
}
},
@ -5691,9 +5620,9 @@
}
},
"glob-parent": {
"version": "5.1.1",
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-5.1.1.tgz",
"integrity": "sha512-FnI+VGOpnlGHWZxthPGR+QhR78fuiK0sNLkHQv+bL9fQi57lNNdquIbna/WrfROrolq8GK5Ek6BiMwqL/voRYQ==",
"version": "5.1.2",
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-5.1.2.tgz",
"integrity": "sha512-AOIgSQCepiJYwP3ARnGx+5VnTu2HBYdzbGP45eLw1vr3zB3vZLeyed1sC9hnbcOc9/SrMyM5RPQrkGz4aS9Zow==",
"dev": true,
"requires": {
"is-glob": "^4.0.1"
@ -11068,6 +10997,87 @@
}
}
},
"vue-loader-v16": {
"version": "npm:vue-loader@16.5.0",
"resolved": "https://registry.npmjs.org/vue-loader/-/vue-loader-16.5.0.tgz",
"integrity": "sha512-WXh+7AgFxGTgb5QAkQtFeUcHNIEq3PGVQ8WskY5ZiFbWBkOwcCPRs4w/2tVyTbh2q6TVRlO3xfvIukUtjsu62A==",
"dev": true,
"optional": true,
"requires": {
"chalk": "^4.1.0",
"hash-sum": "^2.0.0",
"loader-utils": "^2.0.0"
},
"dependencies": {
"ansi-styles": {
"version": "4.3.0",
"resolved": "https://registry.npmjs.org/ansi-styles/-/ansi-styles-4.3.0.tgz",
"integrity": "sha512-zbB9rCJAT1rbjiVDb2hqKFHNYLxgtk8NURxZ3IZwD3F6NtxbXZQCnnSi1Lkx+IDohdPlFp222wVALIheZJQSEg==",
"dev": true,
"optional": true,
"requires": {
"color-convert": "^2.0.1"
}
},
"chalk": {
"version": "4.1.2",
"resolved": "https://registry.npmjs.org/chalk/-/chalk-4.1.2.tgz",
"integrity": "sha512-oKnbhFyRIXpUuez8iBMmyEa4nbj4IOQyuhc/wy9kY7/WVPcwIO9VA668Pu8RkO7+0G76SLROeyw9CpQ061i4mA==",
"dev": true,
"optional": true,
"requires": {
"ansi-styles": "^4.1.0",
"supports-color": "^7.1.0"
}
},
"color-convert": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/color-convert/-/color-convert-2.0.1.tgz",
"integrity": "sha512-RRECPsj7iu/xb5oKYcsFHSppFNnsj/52OVTRKb4zP5onXwVF3zVmmToNcOfGC+CRDpfK/U584fMg38ZHCaElKQ==",
"dev": true,
"optional": true,
"requires": {
"color-name": "~1.1.4"
}
},
"color-name": {
"version": "1.1.4",
"resolved": "https://registry.npmjs.org/color-name/-/color-name-1.1.4.tgz",
"integrity": "sha512-dOy+3AuW3a2wNbZHIuMZpTcgjGuLU/uBL/ubcZF9OXbDo8ff4O8yVp5Bf0efS8uEoYo5q4Fx7dY9OgQGXgAsQA==",
"dev": true,
"optional": true
},
"has-flag": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/has-flag/-/has-flag-4.0.0.tgz",
"integrity": "sha512-EykJT/Q1KjTWctppgIAgfSO0tKVuZUjhgMr17kqTumMl6Afv3EISleU7qZUzoXDFTAHTDC4NOoG/ZxU3EvlMPQ==",
"dev": true,
"optional": true
},
"loader-utils": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/loader-utils/-/loader-utils-2.0.0.tgz",
"integrity": "sha512-rP4F0h2RaWSvPEkD7BLDFQnvSf+nK+wr3ESUjNTyAGobqrijmW92zc+SO6d4p4B1wh7+B/Jg1mkQe5NYUEHtHQ==",
"dev": true,
"optional": true,
"requires": {
"big.js": "^5.2.2",
"emojis-list": "^3.0.0",
"json5": "^2.1.2"
}
},
"supports-color": {
"version": "7.2.0",
"resolved": "https://registry.npmjs.org/supports-color/-/supports-color-7.2.0.tgz",
"integrity": "sha512-qpCAvRl9stuOHveKsn7HncJRvv501qIacKzQlO/+Lwxc9+0q2wLyv4Dfvt80/DPn2pqOBsJdDiogXGR9+OvwRw==",
"dev": true,
"optional": true,
"requires": {
"has-flag": "^4.0.0"
}
}
}
},
"vue-style-loader": {
"version": "4.1.2",
"resolved": "https://registry.npmjs.org/vue-style-loader/-/vue-style-loader-4.1.2.tgz",

@ -9,11 +9,11 @@ Visualizing data is one of the most important tasks of a data scientist. Images
In these five lessons, you will explore data sourced from nature and create interesting and beautiful visualizations using various techniques.
### Topics
1. [Quantities](10-visualization-quantities/README.md)
1. [Distribution](11-visualization-distributions/README.md)
1. [Proportions](12-visualization-proportions/README.md)
1. [Relationships](13-visualization-relationships/README.md)
1. [Making Meaningful Visualizations](14-meaningful-visualizations/README.md)
1. [Visualizing quantities](09-visualization-quantities/README.md)
1. [Visualizing distribution](10-visualization-distributions/README.md)
1. [Visualizing proportions](11-visualization-proportions/README.md)
1. [Visualizing relationships](12-visualization-relationships/README.md)
1. [Making Meaningful Visualizations](13-meaningful-visualizations/README.md)
### Credits

@ -1,6 +1,21 @@
# Introduction to the Data Science Lifecycle
At this point you've probably come to the realization that that data science is a process. This process can be broken down into 5 stages.
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/14-DataScience-Lifecycle.png)|
|:---:|
| Introduction to the Data Science Lifecycle - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## Pre-Lecture Quiz
[Pre-lecture quiz]()
At this point you've probably come to the realization that data science is a process. This process can be broken down into 5 stages:
- Capturing
- Processing
- Analysis
- Communication
- Maintenance
This lesson focuses on 3 parts of the life cycle: capturing, processing and maintenance.
@ -10,7 +25,7 @@ This lesson focuses on 3 parts of the life cycle: capturing, processing and main
## Capturing
The first stage of the lifecycle is very important as the next stages are dependent on it. Its practically two stages combined into one: acquiring the data and defining the purpose and problems that need to be addressed.
Defining the goals of the project will require deeper context into the problem or question. First, we need to identify and acquire those who need their problem solved. These may be stakeholders in a business or sponsors of the project who can help identify who or what will benefit from this project as well as what, and why they need it. A well-defined goal should be measurable and quantifiable to define an acceptable result.
Defining the goals of the project will require deeper context into the problem or question. First, we need to identify and acquire those who need their problem solved. These may be stakeholders in a business or sponsors of the project, who can help identify who or what will benefit from this project as well as what, and why they need it. A well-defined goal should be measurable and quantifiable to define an acceptable result.
Questions a data scientist may ask:
- Has this problem been approached before? What was discovered?
@ -32,7 +47,7 @@ Questions a data scientist may ask about the data:
## Processing
The processing stage of the lifecycle focuses on discovering patterns in the data as well as modeling. Some techniques used to in the processing stage requires statistical methods to uncover the patterns. Typically, this would be a tedious task for a human to do with a large data set and will rely on computers to do the heavy lifting to speed up the process. This stage is also where data science and machine learning will intersect. As you learned in the first lesson, machine learning is the process of building models to understand the data. Models are a representation of the relationship between variables in the data that help predict outcomes.
The processing stage of the lifecycle focuses on discovering patterns in the data as well as modeling. Some techniques used in the processing stage require statistical methods to uncover the patterns. Typically, this would be a tedious task for a human to do with a large data set and will rely on computers to do the heavy lifting to speed up the process. This stage is also where data science and machine learning will intersect. As you learned in the first lesson, machine learning is the process of building models to understand the data. Models are a representation of the relationship between variables in the data that help predict outcomes.
Common techniques used in this stage are covered in the ML for Beginners curriculum. Follow the links to learn more about them:
@ -54,15 +69,10 @@ On premise refers to hosting managing the data on your own equipment, like ownin
**Cold vs hot data**
When training your models, you may require more training data. If youre content with your model, more data will arrive for a model to serve its purpose. In any case the cost of storing and accessing data will increase as you accumulate more of it. Separating rarely used data, known as cold data from frequently accessed hot data can be a cheaper data storage option through hardware or software services. If cold data needs to be accessed, it may take a little longer to retrieve in comparison to hot data.
Below is an example of the cost of owning an Azure Storage Account
[screenshot of Azure cost calculator]
Putting it all together: The Data Lake
### Managing Data
As you work with data you may discover that some of the data needs to be cleaned using some of the techniques covered in the lesson focused on [data preparation](2-Working-With-Data\08-data-preparation) to build accurate models. When new data arrives, it will need some of the same applications to maintain consistency in quality. Some projects will involve use of an automated tool for cleansing, aggregation, and compression before the data is moved to its final location. Azure Data Factory is an example of one of these tools.
#### Securing the Data
### Securing the Data
One of the main goals of securing data is ensuring that those working it are in control of what is collected and in what context it is being used. Keeping data secure involves limiting access to only those who need it, adhering to local laws and regulations, as well as maintaining ethical standards, as covered in the [ethics lesson](1-Introduction\02-ethics).
Heres some things that a team may do with security in mind:
@ -72,18 +82,27 @@ Heres some things that a team may do with security in mind:
- Let only certain project members alter the data
## Pre-Lecture Quiz
## 🚀 Challenge
[Pre-lecture quiz]()
There are many versions of the Data Science Lifecycle, where each step may have different names and number of stages but will contain the same processes mentioned within this lesson.
## 🚀 Challenge
Explore the [Team Data Science Process lifecycle](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle) and the [Cross-industry standard process for data mining](https://www.datascience-pm.com/crisp-dm-2/). Name 3 similarities and differences between the two.
## Post-Lecture Quiz
|Team Data Science Process (TDSP)|Cross-industry standard process for data mining (CRISP-DM)|
|--|--|
|![](..\images\tdsp-lifecycle2.png)> Photo by [Microsoft](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle)| ![](..\images\CRISP-DM.png)> Photo by [Data Science Process Alliance](https://www.datascience-pm.com/crisp-dm-2/)
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
Applying the Data Science Lifecycle involves multiple roles and tasks, where some may focus on particular parts of each stage. The Team Data Science Process provides a few resources that explain the types of roles and tasks that someone may have in a project.
* [Team Data Science Process roles and tasks](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/roles-tasks)
* [Execute data science tasks: exploration, modeling, and deployment](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/execute-data-science-tasks)
## Assignment
[Assignment Title](assignment.md)
[Exploring and Assessing a Dataset](assignment.md)

@ -0,0 +1,20 @@
# Exploring and Assessing a Dataset
A client has approached your team for help in investigating a taxi customer's seasonal spending habits in New York City.
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Your team is in the [Capturing](Readme.md#Capturing) stage of the Data Science Lifecycle and you are in charge of exploring the dataset. You have been provided a notebook and data from Azure Open Datasets to explore and assess if the data can answer the client's question. You have decided to select a small sample of 1 summer month and 1 winter month in the year 2019.
## Instructions
In this directory is a [notebook](notebook.ipynb) that uses Python to load yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets) for the months of January and July 2019. These datasets have been joined together in a Pandas dataframe.
Your task is to identify the columns that are the most likely required to answer this question, then reorganize the joined dataset so that these columns are displayed first.
Finally, write 3 questions that you would ask the client for more clarification and better understanding of the problem.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |

@ -0,0 +1,76 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\r\n",
"\r\n",
"Licensed under the MIT License."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"# Exploring NYC Taxi data in Winter and Summer\r\n",
"\r\n",
"Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to explore the columns that have been provided.\r\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"!pip install pandas"
],
"outputs": [],
"metadata": {
"scrolled": true
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"import pandas as pd\r\n",
"import glob\r\n",
"\r\n",
"path = '../../data/Taxi/yellow_tripdata_2019-{}.csv'\r\n",
"july_taxi = pd.read_csv(path.format('07'))\r\n",
"january_taxi = pd.read_csv(path.format('01'))\r\n",
"\r\n",
"df = pd.concat([january_taxi, july_taxi])\r\n",
"\r\n",
"print(df)"
],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.9.7 64-bit ('venv': venv)"
},
"language_info": {
"mimetype": "text/x-python",
"name": "python",
"pygments_lexer": "ipython3",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"version": "3.9.7",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"name": "04-nyc-taxi-join-weather-in-pandas",
"notebookId": 1709144033725344,
"interpreter": {
"hash": "6b9b57232c4b57163d057191678da2030059e733b8becc68f245de5a75abe84e"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -1,7 +1,17 @@
# Title
# Analyzing for answers
This continues the process of the lifecycle
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle.. You have been provided a notebook and data from Azure Open Datasets to explore. For summer you choose June, July, and August and for winter you choose January, February, and December.
## Instructions
In this directory is a [notebook](notebook.ipynb) that uses Python to load 6 months of yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets) and Integrated Surface Data from NOAA. These datasets have been joined together in a Pandas dataframe.
Your task is to ___
## Rubric
Exemplary | Adequate | Needs Improvement

@ -0,0 +1,25 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"source": [
"# print(pd.read_csv('../../data/Taxi/yellow_tripdata_2019-01.csv'))\r\n",
"# all_files = glob.glob('../../data/Taxi/*.csv')\r\n",
"\r\n",
"# df = pd.concat((pd.read_csv(f) for f in all_files))\r\n",
"# print(df)"
],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"orig_nbformat": 4,
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -1,19 +1,220 @@
# The Data Science Lifecycle: Communication
## Pre-Lecture Quiz
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/16-Communicating.png)|
|:---:|
| Data Science Lifecycle: Communication - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
[Pre-lecture quiz]()
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/30)
## 🚀 Challenge
Test your knowledge of what's to come with the Pre-Lecture Quiz above!
# Introduction
## Post-Lecture Quiz
### What is Communication?
Lets start this lesson by defining what is means to communicate. **To communicate is to convey or exchange information.** Information can be ideas, thoughts, feelings, messages, covert signals, data anything that a **_sender_** (someone sending information) wants a **_receiver_** (someone receiving information) to understand. In this lesson, we will refer to senders as communicators, and receivers as the audience.
[Post-lecture quiz]()
### Data Communication & Storytelling
We understand that when communicating, the aim is to convey or exchange information. But when communicating data, your aim shouldn't be to simply pass along numbers to your audience. Your aim should be to communicate a story that is informed by your data - effective data communication and storytelling go hand-in-hand. Your audience is more likely to remember a story you tell, than a number you give. Later in this lesson, we will go over a few ways that you can use storytelling to communicate your data more effectively.
## Review & Self Study
### Types of Communication
Throughout this lesson two different types of communication will be discussed, One-Way Communication and Two-Way Communication.
**One way communication** happens when a sender sends information to a receiver, without any feedback or response. We see examples of one-way communication every day in bulk/mass emails, when the news delivers the most recent stories, or even when a television commercial comes on and informs you about why their product is great. In each of these instances, the sender is not seeking an exchange of information. They are only seeking to convey or deliver information.
## Assignment
**Two-way communication** happens when all involved parties act as both senders and receivers. A sender will begin by communicating to a receiver, and the receiver will provide feedback or a response. Two-way communication is what we traditionally think of when we talk about communication. We usually think of people engaged in a conversation - either in person, or over a phone call, social media, or text message.
[Assignment Title](assignment.md)
When communicating data, there will be cases where you will be using one-way communication (think about presenting at a conference, or to a large group where questions wont be asked directly after) and there will be cases where you will use two-way communication (think about using data to persuade a few stakeholders for buy-in, or to convince a teammate that time and effort should be spent building something new).
# Effective Communication
### Your Responsibilities as a communicator
When communicating, it is your job to make sure that your receiver(s) are taking away the information that you want them to take away. When youre communicating data, you dont just want your receivers to takeaway numbers, you want your receivers to takeaway a story thats informed by your data. A good data communicator is a good storyteller.
How do you tell a story with data? There are infinite ways but below are 6 that we will talk about in this lesson.
1. Understand Your Audience, Your Medium, & Your Communication Method
2. Begin with the End in Mind
3. Approach it Like an Actual Story
4. Use Meaningful Words & Phrases
5. Use Emotion
Each of these strategies is explained in greater detail below.
### 1. Understand Your Audience, Your Channel & Your Communication Method
The way you communicate with family members is likely different than the way you communicate with your friends. You probably use different words and phrases that the people youre speaking to are more likely to understand. You should take the same approach when communicating data. Think about who youre communicating to. Think about their goals and the context that they have around the situation that youre explaining to them.
You can likely group the majority of your audience them within a category. In a _Harvard Business Review_ article, “[How to Tell a Story with Data](http://blogs.hbr.org/2013/04/how-to-tell-a-story-with-data/),” Dell Executive Strategist Jim Stikeleather identifies five categories of audiences.
- **Novice**: first exposure to the subject, but doesnt want
oversimplification
- **Generalist**: aware of the topic, but looking for an overview
understanding and major themes
- **Managerial**: in-depth, actionable understanding of intricacies and
interrelationships with access to detail
- **Expert**: more exploration and discovery and less storytelling with
great detail
- **Executive**: only has time to glean the significance and conclusions of
weighted probabilities
These categories can inform the way you present data to your audience.
In addition to thinking about your audience's category, you should also consider the channel you're using to communicate with your audience. Your approach should be slightly different if you're writing a memo or email vs having a meeting or presenting at a conference.
On top of understanding your audience, knowing how you will be communicating with them (using one-way communication or two-way) is also critical.
If you are communicating with a majority Novice audience and youre using one-way communication, you must first educate the audience and give them proper context. Then you must present your data to them and tell them what your data means and why your data matters. In this instance, you may want to be laser focused on driving clarity, because your audience will not be able to ask you any direct questions.
If you are communicating with a majority Managerial audience and youre using two-way communication, you likely wont need to educate your audience or provide them with much context. You may be able to jump straight into discussing the data that youve collected and why it matters. In this scenario though, you should be focused on timing and controlling your presentation. When using two-way communication (especially with a Managerial audience who is seeking an “actionable understanding of intricacies and interrelationships with access to detail”) questions may pop up during your interaction that may take the discussion in a direction that doesnt relate to the story that youre trying to tell. When this happens, you can take action and move the discussion back on track with your story.
### 2. Begin With The End In Mind
Beginning with the end in mind means understanding your intended takeaways for your audience before you start communicating with them. Being thoughtful about what you want your audience to takeaway ahead of time can help you craft a story that your audience can follow. Beginning with the end in mind is appropriate for both one-way communication and two-way communication.
How do you begin with the end in mind? Before communicating your data, write down your key takeaways. Then, every step of the way as you're preparing the story that you want to tell with your data, ask yourself, "How does this integrate into the story I'm telling?"
Be Aware While starting with the end in mind is ideal, you dont want to communicate only the data that supports your intended takeaways. Doing this is called Cherry-Picking, which happens when a communicator only communicates data that supports the point they are tying to make and ignores all other data.
If all the data that you collected clearly supports your intended takeaways, great. But if there is data that you collected that doesnt support your takeaways, or even supports an argument against your key takeaways, you should communicate that data as well. If this happens, be upfront with your audience and let them know why you're choosing to stick with your story even though all the data doesn't necessarily support it.
### 3. Approach it Like an Actual Story
A traditional story happens in 5 Phases. You may have heard these phases expressed as Exposition, Rising Action, Climax, Falling Action, and Denouncement. Or the easier to remember Context, Conflict, Climax, Closure, Conclusion. When communicating your data and your story, you can take a similar approach.
You can begin with context, set the stage and make sure your audience is all on the same page. Then introduce the conflict. Why did you need to collect this data? What problems were you seeking to solve? After that, the climax. What is the data? What does the data mean? What solutions does the data tell us we need? Then you get to the closure, where you can reiterate the problem, and the proposed solution(s). Lastly, we come to the conclusion, where you can summarize your key takeaways and the next steps you recommend the team takes.
### 4. Use Meaningful Words & Phrases
If you and I were working together on a product, and I said to you "Our users take a long time to onboard onto our platform," how long would you estimate that "long time" to be? An hour? A week? It's hard to know. What if I said that to an entire audience? Everyone in the audience may end up with a different idea of how long users take to onboard onto our platform.
Instead, what if I said "Out users take, on average, 3 minutes to sign up and onboard onto our platform."
That messaging is more clear. When communicating data, it can be easy to think that everyone in your audience is thinking just like you. But that is not always the case. Driving clarity around your data and what it means is one of your responsibilities as a communicator. If the data or your story is not clear, your audience will have a hard time following, and it is less likely that they will understand your key takeaways.
You can communicate data more clearly when you use meaningful words and phrases, instead of vague ones. Below are a few examples.
- We had an *impressive* year!
- One person could think a impressive means a 2% - 3% increase in revenue, and one person could think it means a 50% - 60% increase.
- Our users' success rates increased *dramatically*.
- How large of an increase is a dramatic increase?
- This undertaking will require *significant* effort.
- How much effort is significant?
Using vague words could be useful as an introduction to more data that's coming, or as a summary of the story that you've just told. But consider ensuring that every part of your presentation is clear for your audience.
### 5. Use Emotion
Emotion is key in storytelling. It's even more important when you're telling a story with data. When you're communicating data, everything is focused on the takeaways you want your audience to have. When you evoke an emotion for an audience it helps them empathize, and makes them more likely to take action. Emotion also increases the likelihood that an audience will remember your message.
You may have encountered this before with TV commercials. Some commercials are very somber, and use a sad emotion to connect with their audience and make the data that they're presenting really stand out. Or, some commercials are very upbeat and happy may make you associate their data with a happy feeling.
How do you use emotion when communicating data? Below are a couple of ways.
- Use Testimonials and Personal Stories
- When collecting data, try to collect both quantitative and qualitative data, and integrate both types of data when you're communicating. If your data is primarily quantitative, seek stories from individuals to learn more about their experience with whatever your data is telling you.
- Use Imagery
- Images help an audience see themselves in a situation. When you use
images, you can push an audience toward the emotion that you feel
they should have about your data.
- Use Color
- Different colors evoke different emotions. Popular colors and the emotions they evoke are below. Be aware, that colors could have different meanings in different cultures.
- Blue usually evokes emotions of peace and trust
- Green is usually related to the nature and the environment
- Red is usually passion and excitement
- Yellow is usually optimism and happiness
# Communication Case Study
Emerson is a Product Manager for a mobile app. Emerson has noticed that customers submit 42% more complaints and bug reports on the weekends. Emerson also noticed that customers who submit a complaint that goes unanswered after 48 hours are more 32% more likely to give the app a rating of 1 or 2 in the app store.
After doing research, Emerson has a couple of solutions that will address the issue. Emerson sets up a 30-minute meeting with the 3 company leads to communicate the data and the proposed solutions.
During this meeting, Emersons goal is to have the company leads understand that the 2 solutions below can improve the apps rating, which will likely translate into higher revenue.
**Solution 1.** Hire customer service reps to work on weekends
**Solution 2.** Purchase a new customer service ticketing system where customer service reps can easily identify which complaints have been in the queue the longest so they can tell which to address most immediately.
In the meeting, Emerson spends 5 minutes explaining why having a low rating on the app store is bad, 10 minutes explaining the research process and how the trends were identified, 10 minutes going through some of the recent customer complaints, and the last 5 minutes glossing over the 2 potential solutions.
Was this an effective way for Emerson to communicate during this meeting?
During the meeting, one company lead fixated on the 10 minutes of customer complaints that Emerson went through. After the meeting, these complaints were the only thing that this team lead remembered. Another company lead primarily focused on Emerson describing the research process. The third company lead did remember the solutions proposed by Emerson but wasnt sure how those solutions could be implemented.
In the situation above, you can see that there was a significant gap between what Emerson wanted the team leads to take away, and what they ended up taking away from the meeting. Below is another approach that Emerson could consider.
How could Emerson improve this approach?
Context, Conflict, Climax, Closure, Conclusion
**Context** - Emerson could spend the first 5 minutes introducing the entire situation and making sure that the team leads understand how the problems affect metrics that are critical to the company, like revenue.
It could be laid out this way: "Currently, our app's rating in the app store is a 2.5. Ratings in the app store are critical to App Store Optimization, which impacts how many users see our app in search, and how our app is viewed to perspective users. And ofcourse, the number of users we have is tied directly to revenue."
**Conflict** Emerson could then move to talk for the next 5 minutes or so on the conflict.
It could go like this: “Users submit 42% more complaints and bug reports on the weekends. Customers who submit a complaint that goes unanswered after 48 hours are more 32% less likely to give our app a rating over a 2 in the app store. Improving our app's rating in the app store to a 4 would improve our visibility by 20-30%, which I project would increase revenue by 10%." Of course, Emerson should be prepared to justify these numbers.
**Climax** After laying the groundwork, Emerson could then move to the Climax for 5 or so minutes.
Emerson could introduce the proposed solutions, lay out how those solutions will address the issues outlined, how those solutions could be implemented into existing workflows, how much the solutions cost, what the ROI of the solutions would be, and maybe even show some screenshots or wireframes of how the solutions would look if implemented. Emerson could also share testimonials from users who took over 48 hours to have their complaint addressed, and even a testimonial from a current customer service representative within the company who has comments on the current ticketing system.
**Closure** Now Emerson can spend 5 minutes restating the problems faced by the company, revisit the proposed solutions, and review why those solutions are the right ones.
**Conclusion** Because this is a meeting with a few stakeholders where two-way communication will be used, Emerson could then plan to leave 10 minutes for questions, to make sure that anything that was confusing to the team leads could be clarified before the meeting is over.
If Emerson took approach #2, it is much more likely that the team leads will take away from the meeting exactly what Emerson intended for them to take away that the way complaints and bugs are handled could be improved, and there are 2 solutions that could be put in place to make that improvement happen. This approach would be a much more effective approach to communicating the data, and the story, that Emerson wants to communicate.
# Conclusion
### Summary of main points
- To communicate is to convey or exchange information.
- When communicating data, your aim shouldn't be to simply pass along numbers to your audience. Your aim should be to communicate a story that is informed by your data.
- There are 2 types of communication, One-Way Communication (information is communicated with no intention of a response) and Two-Way Communication (information is communicated back and forth.)
- There are many strategies you can use to telling a story with your data, 5 strategies we went over are:
- Understand Your Audience, Your Medium, & Your Communication Method
- Begin with the End in Mind
- Approach it Like an Actual Story
- Use Meaningful Words & Phrases
- Use Emotion
### Post Lecture Quiz
### Recommended Resources for Self Study
[The Five C's of Storytelling - Articulate Persuasion](http://articulatepersuasion.com/the-five-cs-of-storytelling/)
[1.4 Your Responsibilities as a Communicator Business Communication for Success (umn.edu)](https://open.lib.umn.edu/businesscommunication/chapter/1-4-your-responsibilities-as-a-communicator/)
[How to Tell a Story with Data (hbr.org)](https://hbr.org/2013/04/how-to-tell-a-story-with-data)
[Two-Way Communication: 4 Tips for a More Engaged Workplace (yourthoughtpartner.com)](https://www.yourthoughtpartner.com/blog/bid/59576/4-steps-to-increase-employee-engagement-through-two-way-communication)
[6 succinct steps to great data storytelling - BarnRaisers, LLC (barnraisersllc.com)](https://barnraisersllc.com/2021/05/02/6-succinct-steps-to-great-data-storytelling/)
[How to Tell a Story With Data | Lucidchart Blog](https://www.lucidchart.com/blog/how-to-tell-a-story-with-data)
[6 Cs of Effective Storytelling on Social Media | Cooler Insights](https://coolerinsights.com/2018/06/effective-storytelling-social-media/)
[The Importance of Emotions In Presentations | Ethos3 - A Presentation Training and Design Agency](https://ethos3.com/2015/02/the-importance-of-emotions-in-presentations/)
[Data storytelling: linking emotions and rational decisions (toucantoco.com)](https://www.toucantoco.com/en/blog/data-storytelling-dataviz)
[Emotional Advertising: How Brands Use Feelings to Get People to Buy (hubspot.com)](https://blog.hubspot.com/marketing/emotions-in-advertising-examples)
[Choosing Colors for Your Presentation Slides | Think Outside The Slide](https://www.thinkoutsidetheslide.com/choosing-colors-for-your-presentation-slides/)
[How To Present Data [10 Expert Tips] | ObservePoint](https://resources.observepoint.com/blog/10-tips-for-presenting-data)
[Microsoft Word - Persuasive Instructions.doc (tpsnva.org)](https://www.tpsnva.org/teach/lq/016/persinstr.pdf)
[The Power of Story for Your Data (thinkhdi.com)](https://www.thinkhdi.com/library/supportworld/2019/power-story-your-data.aspx)
[Common Mistakes in Data Presentation (perceptualedge.com)](https://www.perceptualedge.com/articles/ie/data_presentation.pdf)
[Infographic: Here are 15 Common Data Fallacies to Avoid (visualcapitalist.com)](https://www.visualcapitalist.com/here-are-15-common-data-fallacies-to-avoid/)
[Cherry Picking: When People Ignore Evidence that They Dislike Effectiviology](https://effectiviology.com/cherry-picking/#How_to_avoid_cherry_picking)
[Tell Stories with Data: Communication in Data Science | by Sonali Verghese | Towards Data Science](https://towardsdatascience.com/tell-stories-with-data-communication-in-data-science-5266f7671d7)
[1. Communicating Data - Communicating Data with Tableau [Book] (oreilly.com)](https://www.oreilly.com/library/view/communicating-data-with/9781449372019/ch01.html)
## [Post-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/31)
Review what you've just learned with the Post-Lecture Quiz above!

@ -1,13 +1,15 @@
# The Data Science Lifecycle
[Brief description about the lessons in this section]
![communication](images/communication.jpg)
> Photo by <a href="https://unsplash.com/@headwayio?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Headway</a> on <a href="https://unsplash.com/s/photos/communication?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
In these lessons, you'll explore some of the aspects of the Data Science lifeycle, including analysis and communication around data.
### Topics
1. [Capturing](15-capturing/README.md)
1. [Processing](16-processing/README.md)
1. [Analyzing](17-analyzing/README.md)
1. [Communication](18-communication/README.md)
1. [Maintaining](19-maintaining/README.md)
1. [Introduction](14-Introduction/README.md)
2. [Analyzing](15-Analyzing/README.md)
3. [Communication](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/4-Data-Science-Lifecycle/16-communication)
### Credits
These lessons were written with ❤️ by [Jalen McGee](https://twitter.com/JalenMCG)

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.2 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 279 KiB

@ -1,4 +1,10 @@
# Data Science in the Cloud: in progress
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/17-DataScience-Cloud.png)|
|:---:|
| Data Science In The Cloud: Introduction - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you will learn the fundamental principles of the Cloud, then you will see why it can be interesting for you to use Cloud services to run your data science projects and we'll look at some examples of data science projects run in the Cloud.
@ -11,16 +17,17 @@ In this lesson, you will learn the fundamental principles of the Cloud, then you
The Cloud, or Cloud Computing, is the delivery of a wide range of pay-as-you-go computing services hosted on an infrastructure over the internet. Services include solutions such as storage, databases, networking, software, analytics, and intelligent services.
We usually differentiate the Public, Private and Hybrid clouds:
We usually differentiate the Public, Private and Hybrid clouds as follows:
* Public cloud: a public cloud is owned and operated by a third-party cloud service provider which delivers their computing resources over the Internet to the public
* Public cloud: a public cloud is owned and operated by a third-party cloud service provider which delivers its computing resources over the Internet to the public.
* Private cloud: refers to cloud computing resources used exclusively by a single business or organization, with services and an infrastructure maintained on a private network.
* Hybrid cloud: the hybrid cloud is a system that combines public and private clouds. Users opt for an on-premises datacenter, while allowing data and applications to be run on one or more public clouds.
Most cloud computing services fall into three categories: infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS).
* Infrastructure as a service (IaaS): users rent an IT infrastructure — servers and virtual machines (VMs), storage, networks, operating systems
* Platform as a service (PaaS): users rent an environment for developing, testing, delivering, and managing software applications. Users dont need to worry about setting up or managing the underlying infrastructure of servers, storage, network, and databases needed for development.
* Software as a service (SaaS): users get access to software applications over the Internet, on demand and typically on a subscription basis. Users dont need to worry about hosting, managing the software application, the underlying infrastructure or the maintenance, like software upgrades and security patching.
Most cloud computing services fall into three categories: Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS).
* Infrastructure as a Service (IaaS): users rent an IT infrastructure such as servers and virtual machines (VMs), storage, networks, operating systems
* Platform as a Service (PaaS): users rent an environment for developing, testing, delivering, and managing software applications. Users dont need to worry about setting up or managing the underlying infrastructure of servers, storage, network, and databases needed for development.
* Software as a Service (SaaS): users get access to software applications over the Internet, on demand and typically on a subscription basis. Users dont need to worry about hosting and managing the software application, the underlying infrastructure or the maintenance, like software upgrades and security patching.
Some of the largest Cloud providers are Amazon Web Services, Google Cloud Platform and Microsoft Azure.
@ -29,21 +36,21 @@ Some of the largest Cloud providers are Amazon Web Services, Google Cloud Platfo
Developers and IT professionals chose to work with the Cloud for many reasons, including the following:
* Innovation: you can power your applications by integrating innovative services created by Cloud providers directly into your apps
* Innovation: you can power your applications by integrating innovative services created by Cloud providers directly into your apps.
* Flexibility: you only pay for the services that you need and can choose from a wide range of services. You typically pay as you go and adapt your services according to your evolving needs.
* Budget: you dont need to make initial investments to purchase hardware and software, set up and run on-site datacenters and you can just pay for what you use
* Scalability: your resources can scale according to the needs of your project, which means that your apps can use more or less computing power, storage and bandwidth, by adapting to external factors at any given time
* Productivity: you can focus on your business rather than spend time on tasks that can be managed by someone else, such as managing datacenters
* Reliability: cloud computing offers several ways to continuously back up your data and you can set up disaster recovery plans to keep your business going
* Security: you can benefit from policies, technologies, and controls that strengthen the security of your project
* Budget: you dont need to make initial investments to purchase hardware and software, set up and run on-site datacenters and you can just pay for what you use.
* Scalability: your resources can scale according to the needs of your project, which means that your apps can use more or less computing power, storage and bandwidth, by adapting to external factors at any given time.
* Productivity: you can focus on your business rather than spending time on tasks that can be managed by someone else, such as managing datacenters.
* Reliability: Cloud Computing offers several ways to continuously back up your data and you can set up disaster recovery plans to keep your business and services going, even in times of crisis.
* Security: you can benefit from policies, technologies and controls that strengthen the security of your project.
These are some of the most reasons why people choose to use Cloud services. Now that we have a better understanding of what the Cloud is and what its main benefits are, let's look more specifically into the jobs of Data scientists and developers working with data, and how the Cloud can help them with several of the specific challenges they face:
These are some of the most common reasons why people choose to use Cloud services. Now that we have a better understanding of what the Cloud is and what its main benefits are, let's look more specifically into the jobs of Data scientists and developers working with data, and how the Cloud can help them with several challenges they might face:
* Storing large amounts of data: instead of buying, managing and protecting big servers, you can store your data directly in the cloud, with solutions such as Azure Cosmos DB, Azure SQL Database and Azure Data Lake Storage
* Performing data integration: data integration is an essential part of data science, that lets you go from data collection to taking actions. With data integration services offered in the cloud, you can collect, transform and integrate data from various sources into a single data warehouse, with Data Factory
* Processing data: processing vast amounts of data requires a lot of computing power, and not everyone has access to machines powerful enough for that, which is why many people choose to directly harness the clouds huge computing power to run and deploy their solutions
* Using data analytics services: to turn your data into actionable insights, with Azure Synapse Analytics, Azure Stream Analytics, Azure Databricks
* Using Machine Learning and data intelligence services: Instead of starting from scratch, you can use machine learning algorithms offered by the cloud provider, with services such as AzureML, and you can use cognitive services such as speech-to-text, text to speech, computer vision and more
* Storing large amounts of data: instead of buying, managing and protecting big servers, you can store your data directly in the cloud, with solutions such as Azure Cosmos DB, Azure SQL Database and Azure Data Lake Storage.
* Performing Data Integration: data integration is an essential part of Data Science, that lets you make a transition from data collection to taking actions. With data integration services offered in the cloud, you can collect, transform and integrate data from various sources into a single data warehouse, with Data Factory.
* Processing data: processing vast amounts of data requires a lot of computing power, and not everyone has access to machines powerful enough for that, which is why many people choose to directly harness the clouds huge computing power to run and deploy their solutions.
* Using data analytics services: cloud services like Azure Synapse Analytics, Azure Stream Analytics and Azure Databricksto help you turn your data into actionable insights.
* Using Machine Learning and data intelligence services: Instead of starting from scratch, you can use machine learning algorithms offered by the cloud provider, with services such as AzureML. You can also use cognitive services such as speech-to-text, text to speech, computer vision and more.
## Examples of Data Science in the Cloud
@ -53,11 +60,11 @@ Lets make this more tangible by looking at a couple of scenarios.
### Real-time social media sentiment analysis
Well start with a scenario commonly studied by people who start with machine learning: social media sentiment analysis in real time.
Let's say you run a news media website, and you want to leverage live data to understand what content your readers could be interested in. To know more about that, you can build a program that performs real-time sentiment analysis of data from Twitter publications, on topics that are relevant to your readers.
Let's say you run a news media website and you want to leverage live data to understand what content your readers could be interested in. To know more about that, you can build a program that performs real-time sentiment analysis of data from Twitter publications, on topics that are relevant to your readers.
The key indicators you will look at is the volume of tweets on specific topics (hashtags), and sentiment, which is established using analytics tools that perform sentiment analysis around the specified topics.
The key indicators you will look at is the volume of tweets on specific topics (hashtags) and sentiment, which is established using analytics tools that perform sentiment analysis around the specified topics.
The steps necessary to create this projects are the following:
The steps necessary to create this project are as follows:
* Create an event hub for streaming input, which will collect data from Twitter
* Configure and start a Twitter client application, which will call the Twitter Streaming APIs
@ -66,18 +73,18 @@ The steps necessary to create this projects are the following:
* Create an output sink and specify the job output
* Start the job
To view the full process, check out the [documentation](https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends).
To view the full process, check out the [documentation](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?WT.mc_id=academic-40229-cxa&ocid=AID30411099).
### Scientific papers analysis
Lets take another example, of a project created by Dmitry Soshnikov, one of the authors of this curriculum.
Lets take another example of a project created by [Dmitry Soshnikov](http://soshnikov.com), one of the authors of this curriculum.
Dmitry created a tool that analyses COVID papers. By reviewing this project, you will see how you can create a tool that extracts knowledge from scientific papers, gains insights, and get a tool that helps researchers navigate large collections of papers in a meaningful way.
Dmitry created a tool that analyses COVID papers. By reviewing this project, you will see how you can create a tool that extracts knowledge from scientific papers, gains insights and helps researchers navigate through large collections of papers in an efficient way.
Let's see the different steps used for this:
* Extracting and pre-processing information with Text Analytics
* Using Azure ML to parallelize the processing
* Storing and querying information with Cosmos DB
* Extracting and pre-processing information with [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
* Using [Azure ML](https://azure.microsoft.com/services/machine-learning?WT.mc_id=academic-40229-cxa&ocid=AID3041109) to parallelize the processing
* Storing and querying information with [Cosmos DB](https://azure.microsoft.com/services/cosmos-db?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
* Create an interactive dashboard for data exploration and visualization using Power BI
To see the full process, visit [Dmitrys blog](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/).
@ -88,11 +95,10 @@ As you can see, we can leverage Cloud services in many ways to perform Data Scie
## Footnote
Sources:
* https://azure.microsoft.com/overview/what-is-cloud-computing
* https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends
* https://azure.microsoft.com/overview/what-is-cloud-computing?ocid=AID3041109
* https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?ocid=AID3041109
* https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/
## Post-Lecture Quiz
[Post-lecture quiz]()

@ -1,5 +1,9 @@
# Data Science in the Cloud: The "Low code/No code" way
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/18-DataScience-Cloud.png)|
|:---:|
| Data Science In The Cloud: Low Code - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Table of contents:
- [Data Science in the Cloud: The "Low code/No code" way](#data-science-in-the-cloud-the-low-codeno-code-way)
@ -25,31 +29,45 @@ Table of contents:
## Pre-Lecture Quiz
[Pre-lecture quiz]()
1. What is the cloud?
1. A collection of Databases for storing big data.
2. TRUE : A collection of pay-as-you-go computing services over the internet.
3. A visible mass of particles suspended in the air.
2. What are some advantages of the cloud?
1. TRUE : Flexibility, Scalability, Reliability, Security
2. Flexibility, Scalability, Reliability, Variability
3. Clarity, Scalability, Reliability, Variability
3. Which one is not necessarily a good reason for choosing the cloud?
1. Using Machine Learning and data intelligence services
2. Processing large amounts of data
3. TRUE : Storing sensitive/confidential governmental data
## 1. Introduction
### 1.1 What is Azure Machine Learning?
Data scientists expend a lot of effort exploring and pre-processing data, and trying various types of model-training algorithms to produce accurate models, which is time consuming, and often makes inefficient use of expensive compute hardware.
Data scientists expend a lot of effort exploring and pre-processing data, and trying various types of model-training algorithms to produce accurate models. These tasks are time consuming, and often make inefficient use of expensive compute hardware.
[Azure ML](https://docs.microsoft.com/EN-US/azure/machine-learning/overview-what-is-azure-machine-learning) is a cloud-based platform for building and operating machine learning solutions in Azure. It includes a wide range of features and capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage. Most importantly, it helps data scientists increase their efficiency by automating many of the time-consuming tasks associated with training models; and it enables them to use cloud-based compute resources that scale effectively to handle large volumes of data while incurring costs only when actually used.
[Azure ML](https://docs.microsoft.com/azure/machine-learning/overview-what-is-azure-machine-learning?WT.mc_id=academic-40229-cxa&ocid=AID3041109) is a cloud-based platform for building and operating machine learning solutions in Azure. It includes a wide range of features and capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage. Most importantly, it helps them to increase their efficiency by automating many of the time-consuming tasks associated with training models; and it enables them to use cloud-based compute resources that scale effectively, to handle large volumes of data while incurring costs only when actually used.
Azure ML provides all the tools developers and data scientists need for their machine learning workflows, including:
Azure ML provides all the tools developers and data scientists need for their machine learning workflows. These include:
- **Azure Machine Learning Studio** is a web portal in Azure Machine Learning for low-code and no-code options for model training, deployment, automation, tracking and asset management. The studio integrates with the Azure Machine Learning SDK for a seamless experience.
- **Jupyter Notebooks** to quickly prototype and test ML models
- **Azure Machine Learning Designer** allows to drag-n-drop modules to build experiments and then deploy pipelines in a low-code environment.
- **Automated machine learning UI (AutoML)** automates iterative tasks of machine learning model development allowing to build ML models with high scale, efficiency, and productivity all while sustaining model quality.
- **Data labeling**: an assisted ML tool to automatically label data.
- **Machine learning extension for Visual Studio Code** provides a full-featured development environment for building and managing ML projects.
- **Machine learning CLI** provides commands for managing Azure ML resources from the command line.
- **Integration with open-source frameworks** such as PyTorch, TensorFlow, and scikit-learn and many more for training, deploying, and managing the end-to-end machine learning process.
- **MLflow** is an open-source library for managing the life cycle of your machine learning experiments. **MLFlow Tracking** is a component of MLflow that logs and tracks your training run metrics and model artifacts, no matter your experiment's environment.
- **Azure Machine Learning Studio**: it is a web portal in Azure Machine Learning for low-code and no-code options for model training, deployment, automation, tracking and asset management. The studio integrates with the Azure Machine Learning SDK for a seamless experience.
- **Jupyter Notebooks**: quickly prototype and test ML models.
- **Azure Machine Learning Designer**: allows to drag-n-drop modules to build experiments and then deploy pipelines in a low-code environment.
- **Automated machine learning UI (AutoML)** : automates iterative tasks of machine learning model development, allowing to build ML models with high scale, efficiency, and productivity, all while sustaining model quality.
- **Data Labelling**: an assisted ML tool to automatically label data.
- **Machine learning extension for Visual Studio Code**: provides a full-featured development environment for building and managing ML projects.
- **Machine learning CLI**: provides commands for managing Azure ML resources from the command line.
- **Integration with open-source frameworks** such as PyTorch, TensorFlow, scikit-learn and many more for training, deploying, and managing the end-to-end machine learning process.
- **MLflow**: It is an open-source library for managing the life cycle of your machine learning experiments. **MLFlow Tracking** is a component of MLflow that logs and tracks your training run metrics and model artifacts, irrespective of your experiment's environment.
### 1.2 The Heart Failure Prediction Project
### 1.2 The Heart Failure Prediction Project:
What better way to learn than actually doing a project! In this lesson, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema.
There is no doubt that making and building projects is the best to put your skills and knowledge to test. In this lesson, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema:
![project-schema](img/project-schema.PNG)
Both ways has its pro and cons. The Low code/No code way is easier to start with because it is mostly interacting with a GUI (Graphical User Interface) without knowledge of code required. This method is great at the beginning of a project to quickly test if a project is viable and to create POC (Proof Of Concept). However, once a project grows and things need to be production ready, it is not maintainable to create resources by hand through the GUI. We need to programmatically automate everything, from the creation of resources, to the deployment of a model. This is where knowing how to use the Azure ML SDK is critical.
Each way has its own pros and cons. The Low code/No code way is easier to start with as it involves interacting with a GUI (Graphical User Interface), with no pior knowledge of code required. This method enables quick testing of the project's viability and to create POC (Proof Of Concept). However, as the project grows and things need to be production ready, it is not feasible to create resources through GUI. We need to programmatically automate everything, from the creation of resources, to the deployment of a model. This is where knowing how to use the Azure ML SDK becomes crucial.
| | Low code/No code | Azure ML SDK |
|-------------------|------------------|---------------------------|
@ -57,34 +75,34 @@ Both ways has its pro and cons. The Low code/No code way is easier to start with
| Time to develop | Fast and easy | Depends on code expertise |
| Production ready | No | Yes |
### 1.3 The Heart Failure Dataset
### 1.3 The Heart Failure Dataset:
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, accounting for 31% of all deaths worlwide. Environmental and behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol could be used as features for estimation models. Being able to estimate the probability of developping a CVD could be of great to prevent attacks for high risk people.
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, accounting for 31% of all deaths worldwide. Environmental and behavioural risk factors such as use of tobacco, unhealthy diet and obesity, physical inactivity and harmful use of alcohol could be used as features for estimation models. Being able to estimate the probability of the development of a CVD could be of great use to prevent attacks in high risk people.
Kaggle has made publically available a [Heart Failure dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) that we are going to use for this project. You can download the dataset now. This is a tabular dataset with 13 columns (12 features and 1 target variable) and contains 299 rows.
Kaggle has made a [Heart Failure dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) publically available, that we are going to use for this project. You can download the dataset now. This is a tabular dataset with 13 columns (12 features and 1 target variable) and 299 rows.
| | Variable name | Type | Description | Example |
|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
| 1 | age | numerical | age of the patient | 25 |
| 2 | anaemia | boolean | Decrease of red blood cells or hemoglobin | 0 or 1 |
| 3 | creatinine_phosphokinase | numerical | Level of the CPK enzyme in the blood | 542 |
| 2 | anaemia | boolean | Decrease of red blood cells or haemoglobin | 0 or 1 |
| 3 | creatinine_phosphokinase | numerical | Level of CPK enzyme in the blood | 542 |
| 4 | diabetes | boolean | If the patient has diabetes | university.degree |
| 5 | ejection_fraction | numerical | Percentage of blood leaving the heart at each contraction | 45 |
| 5 | ejection_fraction | numerical | Percentage of blood leaving the heart on each contraction | 45 |
| 6 | high_blood_pressure | boolean | If the patient has hypertension | 0 or 1 |
| 7 | platelets | numerical | Platelets in the blood | 149000 |
| 8 | serum_creatinine | numerical | Level of serum creatinine in the blood | 0.5 |
| 9 | serum_sodium | numerical | Level of serum sodium in the blood | jun |
| 10 | sex | boolean | Woman or man | 0 or 1 |
| 10 | sex | boolean | woman or man | 0 or 1 |
| 11 | smoking | boolean | If the patient smokes | 285 |
| 12 | time | numerical | follow-up period (days) | 4 |
|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
| 21 | DEATH_EVENT [Target] | boolean | if the patient deceased during the follow-up period | 0 or 1 |
| 21 | DEATH_EVENT [Target] | boolean | if the patient dies during the follow-up period | 0 or 1 |
Once you have the dataset, we can start the project in Azure.
## 2. Low code/No code training of a model in Azure ML Studio
### 2.1 Create an Azure ML workspace
To train a model in Azure ML you first need to create an Azure ML workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model. [Learn more](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace)
To train a model in Azure ML you first need to create an Azure ML workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-workspace?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
It is recommended to use the most up-to-date browser that's compatible with your operating system. The following browsers are supported:
@ -95,7 +113,7 @@ It is recommended to use the most up-to-date browser that's compatible with your
To use Azure Machine Learning, create a workspace in your Azure subscription. You can then use this workspace to manage data, compute resources, code, models, and other artifacts related to your machine learning workloads.
> **_NOTE:_** Your Azure subscription will be charged a small amount for data storage as long as the Azure Machine Learning workspace exists in your subscription, so we recommend you delete the Azure Machine Learning workspace when you are no longer using it.
> **_NOTE:_** Your Azure subscription will be charged a small amount for data storage as long as the Azure Machine Learning workspace exists in your subscription, so we recommend you to delete the Azure Machine Learning workspace when you are no longer using it.
1. Sign into the [Azure portal](https://ms.portal.azure.com/) using the Microsoft credentials associated with your Azure subscription.
2. Select **Create a resource**
@ -110,7 +128,7 @@ To use Azure Machine Learning, create a workspace in your Azure subscription. Yo
![workspace-3](img/workspace-3.PNG)
Fill in the settings:
Fill in the settings as follows:
- Subscription: Your Azure subscription
- Resource group: Create or select a resource group
- Workspace name: Enter a unique name for your workspace
@ -123,7 +141,7 @@ To use Azure Machine Learning, create a workspace in your Azure subscription. Yo
![workspace-4](img/workspace-4.PNG)
- Click the create + review and then on the create button
3. Wait for your workspace to be created (it can take a few minutes). Then go to it in the portal. You can find it through the Machine Learning Azure service.
3. Wait for your workspace to be created (this can take a few minutes). Then go to it in the portal. You can find it through the Machine Learning Azure service.
4. On the Overview page for your workspace, launch Azure Machine Learning studio (or open a new browser tab and navigate to https://ml.azure.com), and sign into Azure Machine Learning studio using your Microsoft account. If prompted, select your Azure directory and subscription, and your Azure Machine Learning workspace.
![workspace-5](img/workspace-5.PNG)
@ -132,7 +150,7 @@ To use Azure Machine Learning, create a workspace in your Azure subscription. Yo
![workspace-6](img/workspace-6.PNG)
You can manage your workspace using the Azure portal, but for data scientists and Machine Learning operations engineers, Azure Machine Learning studio provides a more focused user interface for managing workspace resources.
You can manage your workspace using the Azure portal, but for data scientists and Machine Learning operations engineers, Azure Machine Learning Studio provides a more focused user interface for managing workspace resources.
### 2.2 Compute Resources
@ -150,7 +168,7 @@ Some key factors are to consider when creating a compute resource and those choi
**Do you need CPU or GPU ?**
A CPU (Central Processing Unit) is the electronic circuitry that executes instructions comprising a computer program. A GPU (Graphics Processing Unit) is specialized electronic circuit that can execute graphics-related code at a very high rate.
A CPU (Central Processing Unit) is the electronic circuitry that executes instructions comprising a computer program. A GPU (Graphics Processing Unit) is a specialized electronic circuit that can execute graphics-related code at a very high rate.
The main difference between CPU and GPU architecture is that a CPU is designed to handle a wide-range of tasks quickly (as measured by CPU clock speed), but are limited in the concurrency of tasks that can be running. GPUs are designed for parallel computing and therfore are much better at deep learning tasks.
@ -163,11 +181,11 @@ The main difference between CPU and GPU architecture is that a CPU is designed t
**Cluster Size**
Larger clusters are more expensive but will result in better responsiveness. Therefore, if you have time and not much money, you should start with a small cluster. Conversely, if you have money but not much time, you should start with a larger cluster.
Larger clusters are more expensive but will result in better responsiveness. Therefore, if you have time but not enough money, you should start with a small cluster. Conversely, if you have money but not much time, you should start with a larger cluster.
**VM Size**
Depending on your time and budgetary constrains, you can vary the size of your RAM, disk, number of cores and higher clock speed. Increasing all those parameters will be ore expensive but will result in better performance.
Depending on your time and budgetary constraints, you can vary the size of your RAM, disk, number of cores and clock speed. Increasing all those parameters will be costlier, but will result in better performance.
**Dedicated or Low-Priority Instances ?**
@ -176,17 +194,17 @@ This is another consideration of time vs money, since interruptible instances ar
#### 2.2.2 Creating a compute cluster
In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, go to compute and you will see the different compute resources we just discussed (i.e compute instances, compute clusters, inference clusters and attached compute). For this project, we are going to need a compute cluster for the model training. In the Studio, Click on the "Compute" menu, then the "Compute cluster" tab and click on the "+ New" button to create a compute cluster.
In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, go to compute and you will be able to see the different compute resources we just discussed (i.e compute instances, compute clusters, inference clusters and attached compute). For this project, we are going to need a compute cluster for model training. In the Studio, Click on the "Compute" menu, then the "Compute cluster" tab and click on the "+ New" button to create a compute cluster.
![22](img/cluster-1.PNG)
1. Choose your options: Dedicated vs Low priority, CPU or GPU, VM size and core number (you can keep the default settings for this project).
2. Click in the Next button.
2. Click on the Next button.
![23](img/cluster-2.PNG)
3. Give the cluster a compute name
4. Choose your options: Min/Max number of nodes, Idle seconds before scale down, SSH access. Note that if the min number of nodes is 0, you will save money when the cluster is idle. Note that the higher the number of max node, the shorter the training the will be. The max number of nodes recommended is 3.
4. Choose your options: Minimum/Maximum number of nodes, Idle seconds before scale down, SSH access. Note that if the minimum number of nodes is 0, you will save money when the cluster is idle. Note that the higher the number of maximum nodes, the shorter the training will be. The maximum number of nodes recommended is 3.
5. Click on the "Create" button. This step may take a few minutes.
![29](img/cluster-3.PNG)
@ -195,7 +213,7 @@ Awesome! Now that we have a Compute cluster, we need to load the data to Azure M
### 2.3 Loading the Dataset
1. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier click on "Datasets" in the left menu and click on the "+ Create dataset" button to create a dataset. Choose the "From local files" option and select the Kaggle dataset we downloaded earlier.
1. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, click on "Datasets" in the left menu and click on the "+ Create dataset" button to create a dataset. Choose the "From local files" option and select the Kaggle dataset we downloaded earlier.
![24](img/dataset-1.PNG)
@ -207,12 +225,12 @@ Awesome! Now that we have a Compute cluster, we need to load the data to Azure M
![26](img/dataset-3.PNG)
Great now that the dataset is in place and the compute cluster is created, we can start the training of the model!
Great! Now that the dataset is in place and the compute cluster is created, we can start the training of the model!
### 2.4 Low code/No Code training with AutoML
Traditional machine learning model development is resource-intensive, requiring significant domain knowledge and time to produce and compare dozens of models.
Automated machine learning (AutoML), is the process of automating the time-consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality. It greatly accelerates the time it takes to get production-ready ML models with great ease and efficiency. [Learn more](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml)
Traditional machine learning model development is resource-intensive, requires significant domain knowledge and time to produce and compare dozens of models.
Automated machine learning (AutoML), is the process of automating the time-consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity, all while sustaining model quality. It reduces the time it takes to get production-ready ML models, with great ease and efficiency. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
1. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier click on "Automated ML" in the left menu and select the dataset you just uploaded. Click Next.
@ -222,7 +240,7 @@ Automated machine learning (AutoML), is the process of automating the time-consu
![28](img/aml-2.PNG)
3. Choose "Classification" and Click Finish. This step might take between 30 min to 1 hour depending on your compute cluster size.
3. Choose "Classification" and Click Finish. This step might take between 30 minutes to 1 hour, depending upon your compute cluster size.
![30](img/aml-3.PNG)
@ -230,12 +248,12 @@ Automated machine learning (AutoML), is the process of automating the time-consu
![31](img/aml-4.PNG)
Here you can see the detailed description of the best model that AutoML generated. You can also explore other modes generated in the Models tab. Take a few minutes to explore the models in the Explanations (preview button). Once you have chosen the model you want to use (here we will chose the best model selected by autoML), we will see how we can deploy it.
Here you can see a detailed description of the best model that AutoML generated. You can also explore other modes generated in the Models tab. Take a few minutes to explore the models in the Explanations (preview button). Once you have chosen the model you want to use (here we will chose the best model selected by autoML), we will see how we can deploy it.
## 3. Low code/No Code model deployment and endpoint consumption
### 3.1 Model deployment
The automated machine learning interface allows you to deploy the best model as a web service in a few steps. Deployment is the integration of the model so it can predict on new data and identify potential areas of opportunity. For this project, deployment to a web service means that medical applications will be able to consume the model to have live predictions of their patients risk to have a heart attack.
The automated machine learning interface allows you to deploy the best model as a web service in a few steps. Deployment is the integration of the model so that it can make predictions based on new data and identify potential areas of opportunity. For this project, deployment to a web service means that medical applications will be able to consume the model to be able to make live predictions of their patients risk to get a heart attack.
In the best model description, click on the "Deploy" button.
@ -245,7 +263,7 @@ In the best model description, click on the "Deploy" button.
![deploy-2](img/deploy-2.PNG)
16. Once it is deployed, go click on the Endpoint tab and click on the endpoint you just deployed. You can find here all the details you need to know about the endpoint.
16. Once it has been deployed, click on the Endpoint tab and click on the endpoint you just deployed. You can find here all the details you need to know about the endpoint.
![deploy-3](img/deploy-3.PNG)
@ -270,7 +288,7 @@ The `url` variable is the REST endpoint found in the consume tab and the `api_ke
```python
b'"{\\"result\\": [true]}"'
```
This means that the prediction of heart failure for the data given is true. This makes sens because if you look more closely at the data automatically generated in the script, everythin is at 0 and false by default. You can change the data with the following input sample:
This means that the prediction of heart failure for the data given is true. This makes sense because if you look more closely at the data automatically generated in the script, everything is at 0 and false by default. You can change the data with the following input sample:
```python
data = {
@ -311,15 +329,38 @@ The script should return :
```python
b'"{\\"result\\": [true, false]}"'
```
Congratulations! You just consumed the model deployed and trained it on Azure ML !
> **_NOTE:_** Once you are done with the project, don't forget to delete all the resources.
## 🚀 Challenge
Look closely at the model explainations and details that AutoML generated for the top models. Try to understand why the best model is better than the other ones. What algorithms were compared? What are the differences between them? Why is the best one performing better in this case?
## Post-Lecture Quiz
[Post-lecture quiz]()
1. What do I need to create before accessing Azure ML Studio?
1. TRUE: A workspace
2. A compute instance
3. A compute cluster
2. Which of the following tasks are supported by Automated ML?
1. Image generation
2. TRUE : Classification
3. Natural Language generation
3. In which case do you need GPU over CPU?
1. When you have tabular data
2. When you have enough money to afford it
3. TRUE: When you do Deep Learning
## Review & Self Study
In this lesson, you learned how to train, deploy and consume a model to predict heart failure risk in a Low code/No code fashion in the cloud. If you have not done it yet, dive deeper into the model explainations that AutoML generated for the top models and try to understand why the best model is better than others.
You can go further into Low code/No code AutoML by reading this [documentation](https://docs.microsoft.com/azure/machine-learning/tutorial-first-experiment-automated-ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109).
## Assignment

@ -2,7 +2,7 @@
## Instructions
We saw how to use the Azure ML platform to train, deploy and consume a model in a Low code/No code fashion. Now look around for some data that you could use to train an other model, deploy it and consume it. You can look for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/catalog/?WT.mc_id=academic-15963-cxa).
We saw how to use the Azure ML platform to train, deploy and consume a model in a Low code/No code fashion. Now look around for some data that you could use to train an other model, deploy it and consume it. You can look for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog?WT.mc_id=academic-40229-cxa&ocid=AID3041109).
## Rubric

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 26 KiB

Before

Width:  |  Height:  |  Size: 52 KiB

After

Width:  |  Height:  |  Size: 52 KiB

Before

Width:  |  Height:  |  Size: 63 KiB

After

Width:  |  Height:  |  Size: 63 KiB

Before

Width:  |  Height:  |  Size: 86 KiB

After

Width:  |  Height:  |  Size: 86 KiB

Before

Width:  |  Height:  |  Size: 44 KiB

After

Width:  |  Height:  |  Size: 44 KiB

Before

Width:  |  Height:  |  Size: 28 KiB

After

Width:  |  Height:  |  Size: 28 KiB

Before

Width:  |  Height:  |  Size: 71 KiB

After

Width:  |  Height:  |  Size: 71 KiB

@ -1,5 +1,9 @@
# Data Science in the Cloud: The "Azure ML SDK" way
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/19-DataScience-Cloud.png)|
|:---:|
| Data Science In The Cloud: Azure ML SDK - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Table of contents:
- [Data Science in the Cloud: The "Azure ML SDK" way](#data-science-in-the-cloud-the-azure-ml-sdk-way)
@ -27,6 +31,18 @@ Table of contents:
[Pre-lecture quiz]()
1. On what increasing the cluster size would NOT have an impact?
1. Responsiveness
2. Cost
3. TRUE: Model performance
2. What is a benefit of using low code tools?
1. TRUE: No expertise of code required
2. Automatically label the dataset
3. Better security of the model
3. What is AutoML?
1. A tool for automating the preprocessing of data
2. A tool for automating the deployment of models
3. TRUE: A tool for automating the development of models
## 1. Introduction
@ -42,7 +58,7 @@ Key areas of the SDK include:
- Use automated machine learning, which accepts configuration parameters and training data. It automatically iterates through algorithms and hyperparameter settings to find the best model for running predictions.
- Deploy web services to convert your trained models into RESTful services that can be consumed in any application.
[Learn more about the Azure Machine Learning SDK](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py)
[Learn more about the Azure Machine Learning SDK](https://docs.microsoft.com/en-us/python/api/overview/azure/ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
In the [previous lesson](../18-tbd/README.md), we saw how to train, deploy and consume a model in a Low code/No code fashion. We used the Heart Failure dataset to generate and Heart failure prediction model. In this lesson, we are going to do the exact same thing but using the Azure Machine Learning SDK.
@ -77,6 +93,9 @@ Congratulations, you have just created a compute instance! We will use this comp
Refer the [previous lesson](../18-tbd/README.md) in the section **2.3 Loading the Dataset** if you have not uploaded the dataset yet.
### 2.4 Creating Notebooks
> **_NOTE:_** For the next step you can either create a new notebook from scratch, or you can upload the [notebook we created](notebook.ipynb) in you Azure ML Studio. To upload it, simply click on the "Notebook" menu and upload the notebook.
Notebook are a really important part of the data science process. They can be used to Conduct Exploratory Data Analysis (EDA), call out to a computer cluster to train a model, call out to an inference cluster to deploy an endpoint.
To create a Notebook, we need a compute node that is serving out the jupyter notebook instance. Go back to the [Azure ML workspace](https://ml.azure.com/) and click on Compute instances. In the list of compute instances you should see the [compute instance we created earlier](#22-create-a-compute-instance).
@ -92,7 +111,7 @@ Now that we have a Notebook, we can start training the model with Azure ML SDK.
### 2.5 Training a model
First of all, if you ever have a doubt, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py). In contains all the necessary information to understand the modules we are going to see in this lesson.
First of all, if you ever have a doubt, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109). In contains all the necessary information to understand the modules we are going to see in this lesson.
#### 2.5.1 Setup Workspace, experiment, compute cluster and dataset
@ -115,7 +134,7 @@ To get or create an experiment from a workspace, you request the experiment usin
Now you need to create a compute cluster for the training using the following code. Note that this step can take a few minutes.
```python
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute import AmlCompute
aml_name = "heart-f-cluster"
try:
@ -140,7 +159,7 @@ df.describe()
```
#### 2.5.2 AutoML Configuration and training
To set the AutoML configuration, use the [AutoMLConfig class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig(class)?view=azure-ml-py).
To set the AutoML configuration, use the [AutoMLConfig class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig(class)?WT.mc_id=academic-40229-cxa&ocid=AID3041109).
As described in the doc, there are a lot of parameters with which you can play with. For this project, we will use the following parameters:
@ -192,18 +211,18 @@ RunDetails(remote_run).show()
### 3.1 Saving the best model
The `remote_run` an object of type [AutoMLRun](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?view=azure-ml-py). This object contains the method `get_output()` which returns the best run and the corresponding fitted model.
The `remote_run` an object of type [AutoMLRun](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?WT.mc_id=academic-40229-cxa&ocid=AID3041109). This object contains the method `get_output()` which returns the best run and the corresponding fitted model.
```python
best_run, fitted_model = remote_run.get_output()
```
You can see the parameters used for the best model by just printing the fitted_model and see the properties of the best model by using the [get_properties()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml_core_Run_get_properties) method.
You can see the parameters used for the best model by just printing the fitted_model and see the properties of the best model by using the [get_properties()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml_core_Run_get_properties?WT.mc_id=academic-40229-cxa&ocid=AID3041109) method.
```python
best_run.get_properties()
```
Now register the model with the [register_model](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?view=azure-ml-py#register-model-model-name-none--description-none--tags-none--iteration-none--metric-none-) method.
Now register the model with the [register_model](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?view=azure-ml-py#register-model-model-name-none--description-none--tags-none--iteration-none--metric-none-?WT.mc_id=academic-40229-cxa&ocid=AID3041109) method.
```python
model_name = best_run.properties['model_name']
script_file_name = 'inference/score.py'
@ -216,13 +235,13 @@ model = best_run.register_model(model_name = model_name,
```
### 3.2 Model Deployment
Once the best model is saved, we can deploy it with the [InferenceConfig](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py) class. InferenceConfig represents the configuration settings for a custom environment used for deployment. The [AciWebservice](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py) class represents a machine learning model deployed as a web service endpoint on Azure Container Instances. A deployed service is created from a model, script, and associated files. The resulting web service is a load-balanced, HTTP endpoint with a REST API. You can send data to this API and receive the prediction returned by the model.
Once the best model is saved, we can deploy it with the [InferenceConfig](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py?ocid=AID3041109) class. InferenceConfig represents the configuration settings for a custom environment used for deployment. The [AciWebservice](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py) class represents a machine learning model deployed as a web service endpoint on Azure Container Instances. A deployed service is created from a model, script, and associated files. The resulting web service is a load-balanced, HTTP endpoint with a REST API. You can send data to this API and receive the prediction returned by the model.
The model is deployed using the [deploy](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py#deploy-workspace--name--models--inference-config-none--deployment-config-none--deployment-target-none--overwrite-false--show-output-false-) method.
The model is deployed using the [deploy](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py#deploy-workspace--name--models--inference-config-none--deployment-config-none--deployment-target-none--overwrite-false--show-output-false-?WT.mc_id=academic-40229-cxa&ocid=AID3041109) method.
```python
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.webservice import AciWebservice
inference_config = InferenceConfig(entry_script=script_file_name, environment=best_run.get_environment())
@ -273,20 +292,38 @@ response
```
This should output `'{"result": [false]}'`. This means that the patient input we sent to the endpoint generated the prediction `false` which means this person is not likely to have a heart attack.
Congratulations! You just consumed the model deployed and trained on Azure ML!
Congratulations! You just consumed the model deployed and trained on Azure ML with the Azure ML SDK!
> **_NOTE:_** Once you are done with the project, don't forget to delete all the resources.
## 🚀 Challenge
There are many other things you can do through the SDK, unfortunately, we can not view them all in this lesson. But good news, learning how to skim through the SDK documentation can take you a long way on your own. Have a look at the Azure ML SDK documentation and find the `Pipeline` class that allows you to create pipelines. A Pipeline is a collection of steps which can be executed as a workflow.
**HINT:** Go to the [SDK documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-40229-cxa&ocid=AID3041109) and type keywords in the search bar like "Pipeline". You should have the `azureml.pipeline.core.Pipeline` class in the search results.
## Post-Lecture Quiz
[Post-lecture quiz]()
1. What is the reason for creating an AutoMLConfig?
1. It is where the training and the testing data are split
2. TRUE : It provides all the details of your AutoML experiment
3. It is where you specify the model to be trained
2. Which of the following metrics is supported by Automated ML for a classification task?
1. TRUE : accuracy
2. r2_score
3. normalized_root_mean_error
3. What is NOT an advantage of using the SDK?
1. It can be used to automate multiple tasks and runs
2. It makes it easier to programmatically edit runs
3. It can be used throught a Graphical User Interface
## Review & Self Study
In this lesson, you learned how to train, deploy and consume a model to predict heart failure risk with the Azure ML SDK in the cloud. Check this [documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-40229-cxa&ocid=AID3041109) for further information about the Azure ML SDK. Try to create your own model with the Azure ML SDK.
## Assignment
[Assignment Title](assignment.md)
[Data Science project using Azure ML SDK](assignment.md)

@ -0,0 +1,11 @@
# Data Science project using Azure ML SDK
## Instructions
We saw how to use the Azure ML platform to train, deploy and consume a model with the Azure ML SDK. Now look around for some data that you could use to train an other model, deploy it and consume it. You can look for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog?WT.mc_id=academic-40229-cxa&ocid=AID3041109).
## Rubric
| Exemplary | Adequate | Needs Improvement |
|-----------|----------|-------------------|
|When doing the AutoML Configuration, you went through the SDK documentation to see what parameters you could use. You ran a training on a dataset through AutoML using Azure ML SDK, and you checked the model explanations. You deployed the best model and you were able to consume it through the Azure ML SDK. | You ran a training on a dataset through AutoML using Azure ML SDK, and you checked the model explanations. You deployed the best model and you were able to consume it through the Azure ML SDK. | You ran a training on a dataset through AutoML using Azure ML SDK. You deployed the best model and you were able to consume it through the Azure ML SDK. |

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save