Jasmine 4 years ago
commit 5daf199c57

@ -4,16 +4,106 @@
[Pre-lecture quiz]() [Pre-lecture quiz]()
## What is Data?
In our everyday life, we are always surrounded by **data**. The text you are reading now is data, the list of phone numbers of your friends in your smartphone is data, as well as current time displayed on your watch. As human beings, we naturally operate with data, counting the amount of money we have, or writing letters to our friends.
However, data became much more important with the creation of **computers**. The main role of computers is to perform *computations*, but they need data to operate on. Thus, we need to understand how computers store and process data.
With the emergence of Internet, the role of computers as data handling devices increased. If you think of it, we now use computers more and more for data processing and communication, rather than actual computations. When we write an e-mail to a friend, or search some information on the Internet - we are essentially creating, storing, transmitting, and manipulating data.
> Can you remember the last time you have used computers to actually compute something?
## What is Data Science?
In [Wikipedia](https://en.wikipedia.org/wiki/Data_science), **Data Science** is defined as *scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains*.
This definition highlights the following important aspects of data science:
* The main goal of data science is to **extract knowledge** from data, in order words - to **understand** data, find some hidden relationships and build a **model**.
* Data science uses **scientific methods**, such as probability and statistics. In fact, when the term *data science* was first introduced, some people argued that data science is just a new fancy name for statistics. Nowadays it becomes evident that the field is much more broad.
* Obtained knowledge should be applied to produce some **actionable insights**.
* We should be able to operate on both **structured** and **unstructured** data. We will come back to discuss different types of data later in the course.
* **Application domain** is an important concept, and data scientist often needs at least some degree of expertise in the problem domain.
> Another important aspect of Data Science is that it studies how data can be gathered, stored and operated upon using computers. While statistics gives us mathematical foundations, data science applies mathematical concepts to actually draw insights from data.
One of the ways (attributed to [Jim Gray](https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist))) to look at the data science is to consider it to be a separate paradigm of science:
* **Empyrical**, in which we rely mostly on observations and results of experiments
* **Theoretical**, where new concepts emerge from existing scientific knowledge
* **Computational**, where we discover new principles based on some computational experiments
* **Data-Driven**, based on discovering relationships and patterns in the data
## Other Related Fields
Since data is pervasive concept, data science itself is also a broad field, touching many other related disciplines.
<dl>
<dt>Databases</dt>
<dd>
The most obvious thing to consider is **how to store** the data, i.e. how to structure them in a way that allows faster processing. There are different types of databases that store structured and unstructured data, which [we will consider in our course](../../2-Working-With-Data/README.md).
</dd>
<dt>Big Data</dt>
<dd>
Often we need to store and process really large quantities of data with relatively simple structure. There are special approaches and tools to store that data in a distributed manner on a computer cluster, and process them efficiently.
</dd>
<dt>Machine Learning</dt>
<dd>
One of the ways to understand the data is to **build a model** that will be able to predict desired outcome. Being able to learn such models from data is the area studied in **machine learning**. You may want to have a look at our [Machine Learning for Beginners](https://github.com/microsoft/ML-For-Beginners/) Curriculum to get deeper into that field.
</dd>
<dt>Artificial Intelligence</dt>
<dd>
As machine learning, artificial intelligence also relies on data, and it involves building high complexity models that will exhibit the behavior similar to a human being. Also, AI methods often allow us to turn unstructured data (eg. natural language) into structured by extracting some insights.
</dd>
<dt>Visualization</dt>
<dd>
Vast amounts of data are incomprehensible for a human being, but once we create useful visualizations - we can start making much more sense of data, and drawing some conclusions. Thus, it is important to know many ways to visualize information - something that we will cover in [Section 3](../../3-Data-Visualization/README.md) of our course. Related fields also include **Infographics**, and **Human-Computer Interaction** in general.
</dd>
</dl>
## Types of Data
As we have already mentioned - data is everywhere, we just need to capture it in the right way! It is useful to distinguish between **structured** and **unstructured** data - the former are typically represented in some well-structured form, often as a table or number of tables, while latter is just a collection of files. Sometimes we can also talk about **semistructured** data, that have some sort of a structure that may vary greatly.
| Structured | Semi-structured | Unstructured |
|----------- |-----------------|--------------|
| List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopaedia Brittanica |
| Temperature in all rooms of a building at every minute for the last 20 years | Collection of scientific papers in JSON format with authors, data of publication, and abstract | File share with corporate documents |
| Data for age and gender for all people entering the building | Internet pages | Raw video feed from surveillance camera |
## Where to get Data
There are many possible sources of data, and it will be impossible to list all of them! However, let's mention some of the typical places where you can get data:
## What you can do with Data
## Digitalization and Digital Transformation
In the last decade, many businesses started to understand the importance of data when making business decisions. To apply data science principles to running a business one first needs to collect some data, i.e. somehow turn business processes into digital form. This is known as **digitalization**, and followed by using data science techniques to guide decisions it often leads to significant increase of productivity (or even business pivot), called **digital transformation**.
Let's consider an example. Suppose, we have a data science course (like this one), which we deliver online to students, and we want to use data science to improve it. How can we do it?
We can start with thinking "what can be digitized?". The simplest way would be to measure time it takes each student to complete each module, and the obtained knowledge (eg. by giving multiple-choice test at the end of each module). By averaging time-to-complete across all students, we can find out which modules cause the most problems to students, and work on simplifying them.
> You may argue that this approach is not ideal, because modules can be of different length. It is probably more fair to divide the time by the length of the module (in number of characters), and compare those values instead.
When we start analyzing results of multiple-choice tests, we can try to find out specific concepts that students understand poorly, and improve the content. To do that, we need to design tests in such a way that each question maps to a certain concept or chunk of knowledge.
If we want to get even more complicated, we can plot the time taken for each module against the age category of students. We might find out that for some age categories it takes inappropriately long time to complete the module, or students drop out at certain point. This can help us provide age recommendation for the module, and minimize people's dissatisfaction from wrong expectations.
## 🚀 Challenge ## 🚀 Challenge
In this challenge, we will try to find concepts relevant to the field of Data Science by looking at texts. We will take Wikipedia article on Data Science, download and process the text, and then build a word cloud like this one:
![Word Cloud for Data Science](images/ds_wordcloud.png)
## Post-Lecture Quiz ## Post-Lecture Quiz
[Post-lecture quiz]() [Post-lecture quiz]()
## Review & Self Study ## Review & Self Study
## Assignment ## Assignment
[Assignment Title](assignment.md) [Assignment Title](assignment.md)

Binary file not shown.

After

Width:  |  Height:  |  Size: 236 KiB

File diff suppressed because one or more lines are too long

@ -0,0 +1,109 @@
## 1. Ethics Fundamentals
[Back To Introduction](README.md#introduction)
### 1.1 What is Ethics?
The term "ethics" [comes from](https://en.wikipedia.org/wiki/Ethics) the Greek term "ethikos" - and its root "ethos", meaning _character or moral nature_. Think of ethics as the set of **shared values** or **moral principles** that govern our behavior in society. Our code of ethics is based on widely accepted ideas on what is _right vs. wrong_, creating informal rules (or "norms") that we follow voluntarily to ensure the good of the community.
Ethics is critical for scientific research and technology advancement. The [Research Ethics timeline](https://www.niehs.nih.gov/research/resources/bioethics/timeline/index.cfm) gives examples from the past four centuries - including Charles Babbage's 1830 [Reflections on the Decline of Science in England ..](https://books.google.com/books/about/Reflections_on_the_Decline_of_Science_in.html) where he discusses dishonesty in data science approaches including fabrication of data to support desired outcomes. Ethics became _guardrails_ to prevent data misuse and protect society from unintended consequences or harms.
_Applied Ethics_ is about the practical adoption of ethical principles and practices when developing new processes or products. It's about asking moral questions ("is this right or wrong?"), evaluating tradeoffs ("does this help or harm society more?") and taking informed actions to ensure compliance at individual and organizational levels. Ethics are *not* laws. But they can influence the creation of legal or social frameworks that support governance such as:
* **Professional codes of conduct.** | For users or groups e.g., The [Hippocratic Oath](https://en.wikipedia.org/wiki/Hippocratic_Oath) (460-370 BC) for medical ethics defined principles like data confidentiality (led to _doctor-patient privilege_ laws) and non-maleficence (popularly known as _first, do no harm_) that are still adopted today.
* **Regulatory standards** | For organizations or industries e.g., The [1996 Health Insurance Portability and Accountability Act](https://en.wikipedia.org/wiki/Health_Insurance_Portability_and_Accountability_Act) (HIPAA) mandated theft and fraud protections for _personally identifiable information_ (PII) collected by the healthcare industry - and stipulated how that data could be used or disclosed.
### 1.2 What is Data Ethics?
Data ethics is the application of ethics considerations to the domain of big data and data-driven algorithms.
* [Wikipedia](https://en.wikipedia.org/wiki/Big_data_ethics) defines big data ethics as _systemizing, defending, and recommending, concepts of right and wrong conduct in relation to data_ - focusing on implications for **personal data**.
* A [Royal Society article](https://royalsocietypublishing.org/doi/full/10.1098/rsta.2016.0360#sec-1) defines data ethics as a new branch of ethics that _studies and evaluates moral problems related to **data, algorithms and corresponding practices** .. to formulate and support morally good solutions (right conduct or values)_.
The first definition puts it in perspective of users ("personal data") while the second puts it in perspective of operations ("data, algorithms, practices") where:
* `data` = generation, recording, curation, dissemination, sharing & usage
* `algorithms` = AI, agents, machine learning, bots
* `practices` = responsible innovation, ethical hacking, codes of conduct
Based on this, we can define data ethics as the study and evaluation of _moral questions_ related to data collection, algorithm development, and industry-wide models for governance. We'll explore these questions in the "how" section, but first let's talk about the "why".
### 1.3 Why Data Ethics?
To answer this question, let's look at recent trends in the big data and AI industries:
* [_Statista_](https://www.statista.com/statistics/871513/worldwide-data-created/) - By 2025, we will be creating and consuming over **180 zettabytes of data**.
* _[Gartner](https://www.gartner.com/smarterwithgartner/gartner-top-10-trends-in-data-and-analytics-for-2020/)_ - By 2022, 35% of large orgs will buy & sell data in **online Marketplaces and Exchanges**
* _[Gartner](https://www.gartner.com/smarterwithgartner/2-megatrends-dominate-the-gartner-hype-cycle-for-artificial-intelligence-2020/)_ - AI **democratization and industrialization** are the new Hype Cycle megatrends.
The first trend tells us that _data scientists_ will have unprecedented levels of access to personal data at global scale, building algorithms to fuel an AI-driven economy. The second trend tells us that economies of scale and efficiencies in distribution will make it easier and cheaper for _developers_ to integrate AI into more everyday consumer experiences.
The potential for harm occurs when algorithms and AI get _weaponized_ against society in unforeseen ways. In [Weapons of Math Destruction](https://www.youtube.com/watch?v=TQHs8SA1qpk) author Cathy O'Neil talks about the three elements of AI algorithms that pose a danger to society: _opacity_, _scale_ and _damage_.
* **Opacity** refers to the black box nature of many algorithms - do we understand why a specific decision was made, and can we _explain or interpret_ the data reasoning that drove the predictions behind it?
* **Scale** refers to the speed with which algorithms can be deployed and replicated - how quickly can a minor algorithm design flaw get "baked in" with use, leading to irreversible societal harms to affected users?
* **Damage** refers to the social and economic impact of poor algorithmic decision-making - how can bad or unrepresentative data lead to unfair algorithms that disproportionately harm specific user groups?
So why does data ethics matter? Because democratization of AI can speed up weaponization, creating harms at scale in the absence of ethical guardrails. While industrialization of AI will motivate better governance - giving data ethics an important role in shaping policies and standards for developing responsible AI solutions.
### 1.4 How To Apply Ethics?
We know what data ethics is, and why it matters. But how do we _apply_ ethical principles or practices as data scientists or developers? It starts with us asking the right questions at every step of our data-driven pipelines and processes. These [Six questions about data science ethics](https://halpert3.medium.com/six-questions-about-data-science-ethics-252b5ae31fec) are a good starting point:
1. Is the data fair and unbiased?
2. Is the data being used fairly and ethically?
3. Is (user) privacy being protected?
4. To whom does data belong, company or user?
5. What effects do data and algorithms have on society?
6. Is the data manipulated or deceiving?
The [22 questions for ethics in data and AI](https://medium.com/the-organization/22-questions-for-ethics-in-data-and-ai-efb68fd19429) article expands this into a framework, grouping questions by stage of processing: _design_, _implementation & management_, _systems & organization_. The [O'Reilly Ethics and Data Science](https://resources.oreilly.com/examples/0636920203964/) book advocates strongly for _checklists_, asking simple `have we done this? (y/n)` questions that improve ethics oversight without the overheads caused by analysis paralysis.
And tools like [deon](https://deon.drivendata.org/)
make it frictionless to integrate [ethics checklists](https://deon.drivendata.org/#data-science-ethics-checklist) into your project workflows. Deon builds on [industry practices](https://deon.drivendata.org/#checklist-citations), shares [real-world examples](https://deon.drivendata.org/examples/) that put the ethical challenges in context, and allows practitioners to derive custom checklists from the defaults, to suit specific scenarios or industries.
### 1.5 Ethics Concepts
Ethics checklists often revolve around yes/no questions related to core ethics concepts and challenges. Let's look at _a subset_ of these issues - inspired in part by the [deon ethics checklist](https://deon.drivendata.org/#data-science-ethics-checklist) - in two contexts: data (collection and storage) and algorithms (analysis and modeling).
**Data Collection & Storage**
* _Ownership_: Does the user own the data? Or the organization? Is there an agreement that defines this?
* _Informed Consent_: Did human subjects give permission for data capture & understand purpose/usage?
* _Collection Bias_: Is data representative of audience? Did we identify and mitigate biases?
* _Data Security_: Is data stored and transmitted securely? Are valid access controls enforced?
* _Data Privacy_: Does data contain personally identifiable information? Is anonymity preserved?
* _Right to be Forgotten_: Does user have mechanism to request deletion of their personal information?
**Data Modeling & Analysis**
* _Data Validity_: Does data capture relevant features? Is it timeless? Is the data model valid?
* _Misrepresentation_: Does analysis communicate honestly reported data in a deceptive manner?
* _Auditability_: Is the data analysis or algorithm design documented well enough to be reproducible later?
* _Explainability_: Can we explain why the data model or learning algorithm made a specific decision?
* _Fairness_: Is the model fair (e.g., shows similar accuracy) across diverse groups of affected users?
Finally, let's talk about two abstract concepts that often underlie users' ethics concerns around technology:
* **Trust**: Can we trust an organization with our personal data? Can we trust that algorithmic decisions are fair and do no harm? Can we trust that information is not misrepresented?
* **Choice**: Do I have free will when I make a choice in a consumer UI/UX? Are data-driven [choice architectures](https://en.wikipedia.org/wiki/Choice_architecture) nudging me towards good choices or are [dark patterns](https://www.darkpatterns.org/) working against my self-interest?
### 1.6 Ethics History
Knowing ethics concepts is one thing - understanding the intent behind them, and the potential harms or societal consequences they bring, is another. Let's look at some case studies that help frame ethics discussions in a more concrete way with real-world examples.
| Historical Example | Ethics Issues |
|---|---|
| _[Facebook Data Breach](https://www.npr.org/2021/04/09/986005820/after-data-breach-exposes-530-million-facebook-says-it-will-not-notify-users)_ exposes data for 530M users. Facebook pays $5B to FTC, does not notify users. | Data Privacy, Data Security, Transparency, Accountability |
| [Tuskegee Syphillis Study](https://en.wikipedia.org/wiki/Tuskegee_syphilis_experiment) - African-American men were enrolled in study without being told its true purpose. Treatments were withheld. | Informed Consent, Fairness, Social / Economic Harms |
| [MIT Gender Shades Study](http://gendershades.org/index.html) - evaluated accuracy of industry AI gender classification models (used by law enforcement), detected bias | Fairness, Social/ Economic Harms, Collection Bias |
| [Learning app ABCmouse pays $10 million to settle FTC complaint it trapped parents in subscription they couldnt cancel](https://www.washingtonpost.com/business/2020/09/04/abcmouse-10-million-ftc-settlement/) - user experience masked context, nudged user towards choices with financial harms| Misrepresentation, Free Choice, Dark Patterns, Economic Harms |
| [Netflix Prize Dataset de-anonymized by correlation](https://www.wired.com/2007/12/why-anonymous-data-sometimes-isnt/) - showed how Netflix prize dataset of 500M users was easily de-anonymized by cross-correlation with public IMDb comments (and other such datasets) | Data Privacy, Anonymity, De-identification |
| [Georgia COVID-19 cases not declining as quickly as data suggested](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) - graphs released had x-axis not ordered chronologically, misleading viewers| Misrepresentation, Social Harms |
We covered just a subset of examples, but recommend you explore these resources for more:
* [Ethics Unwrapped](https://ethicsunwrapped.utexas.edu/case-studies) - ethics dilemmas across diverse industries.
* [Data Science Ethics course](https://www.coursera.org/learn/data-science-ethics#syllabus) - landmark case studies in data ethics.
* [Where things have gone wrong](https://deon.drivendata.org/examples/) - deon checklist examples of ethical issues

@ -0,0 +1,3 @@
## 2. Data Collection
[Back To Introduction](README.md)

@ -0,0 +1,3 @@
## 3. Data Privacy
[Back To Introduction](README.md)

@ -0,0 +1,3 @@
## 4. Algorithm Fairness
[Back To Introduction](README.md)

@ -0,0 +1,3 @@
## 5. Societal Consequences
[Back To Introduction](README.md)

@ -0,0 +1,3 @@
## 6. Summary & Resources
[Back To Introduction](README.md)

@ -1,106 +1,61 @@
# Data Ethics # Data Ethics
> Summary Sketchnote from [Nitya Narasimhan](https://twitter.com/nitya) / [SketchTheDocs](https://twitter.com/sketchthedocs) ## Pre-Lecture Quiz 🎯
<br/>
## Pre-Lecture Quiz
[Pre-lecture quiz]() [Pre-lecture quiz]()
## Introduction ## Sketchnote 🖼
This lesson dives into a critical topic for the modern data scientist: _data ethics_.
In this lesson we'll cover:
1. _[Fundamentals](#1-fundamentals)_ - Principles & History
2. _[Data Collection](#2-data-collection)_ - Ownership & Consent
3. _[Data Privacy](#3-data-privacy)_ - Protection & Anonymity
4. _[Algorithms & Fairness](#4-algorithms-and-fairness)_ - Unfairness, Harms & Bias
5. _[Tools & Frameworks](5-tools-and-frameworks)_ - Codes, Checklists & Frameworks
6. _[Summary](6-summary)_ - Related Work
<br/>
## 1. Fundamentals
| Topics|
|--|
| 1.1 What is Ethics and why do we care?|
| 1.2 History and challenges |
| 1.3 Concepts in Ethics|
| 1.4 Ethical Principles and Responsible AI|
<br/> | A Visual Guide to Data Ethics by [Nitya Narasimhan](https://twitter.com/nitya) / [(@sketchthedocs)](https://sketchthedocs.dev)|
|---|
| <br/><br/><br/><br/><br/><br/><br/><br/> |
## 2. Data Collection
| Topics| ## Introduction
|--|
| 2.1 Data Ownership & Intellectual Property |
| 2.2 Ethics & Human Consent |
| 2.3 Data Quality & Representation |
| 2.4 The 5Cs Framework |
<br/>
## 3. Data Privacy What is ethics? What does data ethics mean, and how is it relevant to data scientists and developers in the context of big data, machine learning, and artificial intelligence? This lesson explores these ideas under the following sections:
| Topics| * [**Fundamentals**](1-fundamentals) - Understand definitions, motivation and core concepts.
|--| * [**Data Collection**](2-collection) - Explore data ethics issues around data ownership, user consent and control.
| 3.1 Data Privacy & Degrees of Privacy | * [**Data Privacy**](3-privacy) - Understand degrees of privacy, challenges in anonymity and leakage, and user rights.
| 3.2 Data Anonymity & De-Identification | * [**Algorithm Fairness**](4-fairness) - Explore consequences & harms of algorithm bias and data misrepresentation.
| 3.3 Challenges & Frameworks | * [**Societal Consequences**](5-consequences) - Explore socio-economic issues and case studies related to data ethics.
| 3.4 Case Studies | * [**Summary & Resources**](6-summary) - Wrap-up with a review of current data ethics practices and resources.
<br/> ---
## 4. Algorithms and Fairness [1. Ethics Fundamentals](1-fundamentals.md ':include')
| Topics| [2. Data Collection](2-collection.md ':include')
|--|
| 4.1 Fairness, Unfairness & Harms |
| 4.2 Data Validity & Misrepresentation |
| 4.3 Algorithm Bias & Mitigation |
| 4.4 Case Studies |
<br/> [3. Data Privacy](3-privacy.md ':include')
## 5. Tools and Frameworks [4. Algorithm Fairness](4-fairness.md ':include')
| Topics| [5. Societal Consequences](5-consequences.md ':include')
|--|
| 5.1 Data Ethics & Culture |
| 5.2 Codes of Conduct & Checklists |
| 5.3 Industry Frameworks (Google, IBM, Microsoft, Facebook) |
| 5.4 Government Frameworks (UK, US, India) |
<br/> [6. Summary & Resources](6-summary.md ':include')
## 6. Summary ---
| Topics| ## Challenge 🚀
|--|
| 6.1 Understanding Ethics (History) |
| 6.2 Applying Ethics (Principles) |
| 6.3 Evolving Ethics (Research) |
| 6.4 Further Reading (References) |
<br/> ## Post-Lecture Quiz 🎯
## 🚀 Challenge [Post-lecture quiz]()
## Post-Lecture Quiz ## Review & Self Study
[Post-lecture quiz]()
---
# Assignment
## Review & Self Study [Assignment Title](assignment.md ':include')
---
## Assignment # Resources
[Assignment Title](assignment.md) [Related Resources](resources.md ':include')

@ -1,4 +1,4 @@
# Title ## Title
## Instructions ## Instructions

@ -0,0 +1,3 @@
## Courses
## Articles

@ -1,9 +1,137 @@
# A Brief Introduction to Statistics and Probability # A Brief Introduction to Statistics and Probability
Statistics and Probability Theory are two highly related areas of Mathematics that are highly relevant to Data Science. It is possible to operate with data without deep knowledge of mathematics, but it is still better to know at least some basic concepts. Here we will present a short introduction that will help you get started.
## Pre-Lecture Quiz ## Pre-Lecture Quiz
[Pre-lecture quiz]() [Pre-lecture quiz]()
## Probability and Random Variables
**Probability** is a number between 0 and 1 that expresses how probable an **event** is. It is defined as a number of positive outcomes (that lead to the event), divided by total number of outcomes, given that all outcomes are equally probable. For example, when we roll a dice, the probability that we get an even number is 3/6 = 0.5.
When we talk about events, we use **random variables**. For example, the random variable that represents a number obtained when rolling a dice would take values from 1 to 6. Set of numbers from 1 to 6 is called **sample space**. We can talk about probability of a random variable taking a certain value, for example P(X=3)=1/6.
The random variable in previous example is called **discrete**, because it has a countable sample space, i.e. there are separate values that can be enumerated. There are cases when sample space is a range of real numbers, or the whole set of real numbers. Such variables are called **continuous**. An good example is the time when the bus arrives.
## Probability Distribution
In the case of discrete random variables, it is easy to describe the probability of each event by a function P(X). For each value *s* from sample space *S* it will give a number from 0 to 1, such that the sum of all values of P(X=s) for all events would be 1.
The most well-known discrete distribution is **uniform distribution**, in which there is a sample space of N elements, with equal probability of 1/N for each of them.
It is more difficult to describe the probability distribution of a continuous variable, with values drawn from some interval [a,b], or the whole set of real numbers &Ropf;. Consider the case of bus arrival time. In fact, for each exact arrival time $t$, the probability of a bus arriving at exactly that time is 0!
> Now you know that events with 0 probability happen, and very often! At least each time when the bus arrives!
We can only talk about the probability of a variable falling in a given interval of values, eg. P(t<sub>1</sub>&le;X&lt;t<sub>2</sub>). In this case, probability distribution is described by a **probability density function** p(x), such that
<img src="http://www.sciweavers.org/tex2img.php?eq=P%28t_1%5Cle%20X%3Ct_2%29%3D%5Cint_%7Bt_1%7D%5E%7Bt_2%7Dp%28x%29dx&bc=White&fc=Black&im=jpg&fs=12&ff=arev&edit=0" align="center" border="0" alt="P(t_1\le X<t_2)=\int_{t_1}^{t_2}p(x)dx" width="228" height="51" >
An continuous analog of uniform distribution is called **continuous uniform**, which is defined on a finite interval. A probability that the value X falls into an interval of length l is proportional to l, and rises up to 1.
Another important distribution is **normal distribution**, which we will talk about in more detail below.
## Mean, Variance and Standard Deviation
Suppose we draw a sequence of n samples of a random variable X: x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>n</sub>. We can define **mean** (or **arithmetic average**) value of the sequence in the traditional way as (x<sub>1</sub>+x<sub>2</sub>+x<sub>n</sub>)/n. As we grow the size of the sample (i.e. take the limit with n&rarr;&infin;), we will obtain the mean (also called **expectation**) of the distribution. We will denote expectation by **E**(x).
> It can be demonstrated that for any discrete distribution with values {x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>N</sub>} and corresponding probabilities p<sub>1</sub>, p<sub>2</sub>, ..., p<sub>N</sub>, the expectation would equal to E(X)=x<sub>1</sub>p<sub>1</sub>+x<sub>2</sub>p<sub>2</sub>+...+x<sub>N</sub>p<sub>N</sub>.
To identify how far the values are spread, we can compute the variance &sigma;<sup>2</sup> = &sum;(x<sub>i</sub> - &mu;)<sup>2</sup>/n, where &mu; is the mean of the sequence. The value &sigma; is called **standard deviation**, and &sigma;<sup>2</sup> is called a **variance**.
## Mode, Median and Quartiles
Sometimes, mean does not adequately represent the "typical" value for data. For example, when there are a few extreme values that are completely out of range, they can affect the mean. Another good indication is a **median**, a value such that half of data points are lower than it, and another half - higher.
To help us understand the distribution of data, it is helpful to talk about **quartiles**:
* First quartile, or Q1, is a value, such that 25% of the data fall below it
* Third quartile, or Q3, is a value that 75% of the data fall below it
Graphically we can represent relationship between median and quartiles in a diagram called the **box plot**:
![Box Plot](images/boxplot_explanation.png)
Here we also computer **inter-quartile range** IQR=Q3-Q1, and so-called **outliers** - values, that lie outside the boundaries [Q1-1.5*IQR,Q3+1.5*IQR].
For finite distribution that contains small number of possible values, a good "typical" value is the one that appears the most frequently, which is called **mode**. It is often applied to categorical data, such as colors. Consider a situation when we have two groups of people - some that strongly prefer red, and others who prefer blue. If we code colors by numbers, the mean value for a favourite color would be somewhere in the orange-green spectrum, which does not indicate the actual preference on neither group. However, the mode would be either one of the colors, or both colors, if the number of people voting for them is equal (in this case we call the sample **multimodal**).
## Real-world Data
When we analyze data from real life, they often are not random variables as such, in a sense that we do not perform experiments with unknown result. For example, consider a team of baseball players, and their body data, such as height, weight and age. Those numbers are not exactly random, but we can still apply the same mathematical concepts. For example, a sequence of people's weights can be considered to be a sequence of values drawn from some random variable. Below is the sequence of weights of actual baseball players from [Major League Baseball](http://mlb.mlb.com/index.jsp), taken from [this dataset](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights) (for your convenience, only first 20 values are shown):
```
[180.0, 215.0, 210.0, 210.0, 188.0, 176.0, 209.0, 200.0, 231.0, 180.0, 188.0, 180.0, 185.0, 160.0, 180.0, 185.0, 197.0, 189.0, 185.0, 219.0]
```
Here is the box plot showing mean, median and quartiles for our data:
![Weight Box Plot](images/weight-boxplot.png)
> When working with real-world data, we assume that all data points are samples drawn from some probability distribution. This assumption allows us to apply machine learning techniques and build working predictive models.
To see what is the distribution of our data, we can plot a graph called a **histogram**. X-axis would contain a number of different weight intervals (so-called **bins**), and vertical axis would show the number of times our random variable sample was inside a given interval.
![Histogram of real world data](images/weight-histogram.png)
From this histogram you can see that all values are centered around certain mean weight, and the further we go from that weight - the fewer weights of that value are encountered. I.e., it is very improbable that a weight of a baseball player would be very different from the mean weight. Variance of weights show the extent to which weights are likely to differ from the mean.
> If we take weights of other people, not from the baseball league, the distribution is likely to be different. However, the shape of the distribution will be the same, but mean and variance would change. So, if we train our model on baseball players, it i likely to give wrong results when applied to students of a university, because the underlying distribution is different.
## Normal Distribution
The distribution of weights that we have seen above is very typical, and many measurements from real world follow the same type of distribution, but with different mean and variance. This distribution is called **normal distribution**, and it plays very important role in statistics.
Using normal distribution is a correct way to generate random weights of potential baseball players. Once we know mean weight `mean` and standard deviation `std`, we can generate 1000 weight samples in the following way:
```python
samples = np.random.normal(mean,std,1000)
```
If we plot the histogram of the generated samples we will see the picture very similar to the one shown above. And if we increase the number of samples and the number of bins, we can generate a picture of a normal distribution that is more close to ideal:
![Normal Distribution with mean=0 and std.dev=1](images/normal-histogram.png)
*Normal Distribution with mean=0 and std.dev=1*
## Law of Large Numbers and Central Limit Theorem
One of the reasons why normal distribution is so important is so-called **central limit theorem**. Suppose we have a large sample of independent N values X<sub>1</sub>, ..., X<sub>N</sub>, sampled from any distribution with mean &mu; and variance &sigma;<sup>2</sup>. Then, for sufficiently large N (in other words, when N&rarr;&infin;), the mean &Sigma;<sub>i</sub>X<sub>i</sub> would be normally distributed, with mean &mu; and variance &sigma;<sup>2</sup>/N.
> Another way to interpret central limit theorem is to say that regardless of distribution, when you compute the mean of a sum of any random variable values you end up with normal distribution.
From central limit theorem it also follows that, when N&rarr;&infin;, the probability of the sample mean to be equal to &mu; becomes 1. This is known as **the law of large numbers**.
## Covariance and Correlation
One of the things Data Science does is finding relations between data. We say that two sequences **correlate** when they exhibit the similar behavior at the same time, i.e. they either rise/fall simultaneously, or one sequence rises when another one falls and vice versa. In other words, there seems to be some relation between two sequences.
> Correlation does not necessarily indicate causal relationship between two sequences; sometimes both variables can depend on some external cause, or it can be purely by chance the two sequences correlate. However, strong mathematical correlation is a good indication that two variables are somehow connected.
Mathematically, the main concept that show the relation between two random variables is **covariance**, that is computed like this: Cov(X,Y) = **E**\[(X-**E**(X))(Y-**E**(Y))\]. We compute the deviation of both variables from their mean values, and then product of those deviations. If both variables deviate together, the product would always be a positive value, that would add up to positive covariance. If both variables deviate out-of-sync (i.e. one falls below average when another one rises above average), we will always get negative numbers, that will add up to negative covariance. If the deviations are not dependent, they will add up to roughly zero.
The absolute value of covariance does not tell us much on how large the correlation is, because it depends on the magnitude of actual values. To normalize it, we can divide covariance by standard deviation of both variables, to get **correlation**. The good thing is that correlation is always in the range of [-1,1], where 1 indicates strong positive correlation between values, -1 - strong negative correlation, and 0 - no correlation at all (variables are independent).
**Example**: We can compute correlation between weights and heights of baseball players from the dataset mentioned above:
```python
print(np.corrcoef(weights,heights))
```
As a result, we get **correlation matrix** like this one:
```
array([[1. , 0.52959196],
[0.52959196, 1. ]])
```
> Correlation matrix C can be computed for any number of input sequences S<sub>1</sub>, ..., S<sub>n</sub>. The value of C<sub>ij</sub> is the correlation between S<sub>i</sub> and S<sub>j</sub>, and diagonal elements are always 1 (which is also self-correlation of S<sub>i</sub>).
In our case, the value 0.53 indicates that there is some correlation between weight and height of a person. We can also make the scatter plot of one value against the other to see the relationship visually:
![Relationship between weight and height](images/weight-height-relationship.png)
> More examples of correlation and covariance can be found in [accompanying notebook](notebook.ipynb).
## 🚀 Challenge ## 🚀 Challenge

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.9 KiB

File diff suppressed because one or more lines are too long

@ -0,0 +1,14 @@
# Introduction
[Brief description about the lessons in this section]
### Topics
1. [Defining Data Science](01-defining-data-science/README.md)
2. [Data Science Ethics](02-ethics/README.md)
3. [Defining Data](03-defining-data/README.md)
4. [Introduction to Statistics and Probability](04-stats-and-probability/README.md)
### Credits

@ -3,10 +3,8 @@
In this lesson, you will use a different nature-focused dataset to visualize proportions, such as how many different types of fungi populate a given dataset about mushrooms. Let's explore these fascinating fungi using a dataset sourced from Audubon listing details about 23 species of gilled mushrooms in the Agaricus and Lepiota families. You will experiment with tasty visualizations such as: In this lesson, you will use a different nature-focused dataset to visualize proportions, such as how many different types of fungi populate a given dataset about mushrooms. Let's explore these fascinating fungi using a dataset sourced from Audubon listing details about 23 species of gilled mushrooms in the Agaricus and Lepiota families. You will experiment with tasty visualizations such as:
- Pie charts 🥧 - Pie charts 🥧
- Waffle charts 🧇
- Donut charts 🍩 - Donut charts 🍩
as well as - Waffle charts 🧇
- Stacked bar charts
## Pre-Lecture Quiz ## Pre-Lecture Quiz
@ -66,17 +64,98 @@ Now, if you print out the mushrooms data, you can see that it has been grouped i
If you follow the order presented in this table to create your class category labels, you can build a pie chart: If you follow the order presented in this table to create your class category labels, you can build a pie chart:
## Pie!
```python ```python
labels=['Edible','Poisonous'] labels=['Edible','Poisonous']
plt.pie(edibleclass['population'],labels=labels,autopct='%.1f %%') plt.pie(edibleclass['population'],labels=labels,autopct='%.1f %%')
plt.title('Edible?') plt.title('Edible?')
plt.show() plt.show()
``` ```
Voila, a pie chart showing the proportions of this data according to these two classes of mushroom. It's quite important to get the order of labels correct, especially here, so be sure to verify the order the label array is built! Voila, a pie chart showing the proportions of this data according to these two classes of mushroom. It's quite important to get the order of labels correct, especially here, so be sure to verify the order with which the label array is built!
![pie chart](images/pie1.png) ![pie chart](images/pie1.png)
## Donuts!
A somewhat more visually interesting pie chart is a donut chart, which is a pie chart with a hole in the middle. Let's look at our data using this method.
Take a look at the various habitats where mushrooms grow.
```python
habitat=mushrooms.groupby(['habitat']).count()
habitat
```
Here, you are grouping your data by habitat. There are 7 listed, so use those as labels for your donut chart:
```python
labels=['Grasses','Leaves','Meadows','Paths','Urban','Waste','Wood']
plt.pie(habitat['class'], labels=labels,
autopct='%1.1f%%', pctdistance=0.85)
center_circle = plt.Circle((0, 0), 0.40, fc='white')
fig = plt.gcf()
fig.gca().add_artist(center_circle)
plt.title('Mushroom Habitats')
plt.show()
```
![donut chart](images/donut.png)
This code draws a chart and a center circle, then adds that center circle in. Edit the width of the center circle by changing `0.40` to another value.
Donut charts can be tweaked several ways to change the labels. The labels in particular can be highlighted for readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut).
Now that you know how to group your data and then display it as a pie or donut, you can explore other types of charts. Try a waffle chart, which is just a different way of exploring quantity.
## Waffles!
A 'waffle' type chart is a different way to visualize quantities as a 2D array of squares. Try visualizing the different quantities of mushroom cap colors in this dataset. To do this, you need to install a helper library called [PyWaffle](https://pypi.org/project/pywaffle/) and use Matplotlib:
```python
pip install pywaffle
```
Select a segment of your data to group:
```python
capcolor=mushrooms.groupby(['cap-color']).count()
capcolor
```
Create a waffle chart by creating labels and then grouping your data:
```python
import pandas as pd
import matplotlib.pyplot as plt
from pywaffle import Waffle
data ={'color': ['brown', 'buff', 'cinnamon', 'green', 'pink', 'purple', 'red', 'white', 'yellow'],
'amount': capcolor['class']
}
df = pd.DataFrame(data)
fig = plt.figure(
FigureClass = Waffle,
rows = 100,
values = df.amount,
labels = list(df.color),
figsize = (30,30),
colors=["brown", "tan", "maroon", "green", "pink", "purple", "red", "whitesmoke", "yellow"],
)
```
Using a waffle chart, you can plainly see the proportions of cap color of this mushroom dataset. Interestingly, there are many green-capped mushrooms!
![waffle chart](images/waffle.png)
✅ Pywaffle supports icons within the charts that use any icon available in [Font Awesome](https://fontawesome.com/). Do some experiments to create an even more interesting waffle chart using icons instead of squares.
In this lesson you learned three ways to visualize proportions. First, you need to group your data into categories and then decide which is the best way to display the data - pie, donut, or waffle. All are delicious and gratify the user with an instant snapshot of a dataset.
## 🚀 Challenge ## 🚀 Challenge

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 59 KiB

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

@ -7970,9 +7970,9 @@
"dev": true "dev": true
}, },
"path-parse": { "path-parse": {
"version": "1.0.6", "version": "1.0.7",
"resolved": "https://registry.npmjs.org/path-parse/-/path-parse-1.0.6.tgz", "resolved": "https://registry.npmjs.org/path-parse/-/path-parse-1.0.7.tgz",
"integrity": "sha512-GSmOT2EbHrINBf9SR7CDELwlJ8AENk3Qn7OikK4nFYAu3Ote2+JYNVvkpAEQm3/TLNEJFD/xZJjzyxg3KBWOzw==", "integrity": "sha512-LDJzPVEEEPR+y48z93A0Ed0yXb8pAByGWo/k5YYdYgpY2/2EsOsksJrq7lOHxryrVOn1ejG6oAp8ahvOIQD8sw==",
"dev": true "dev": true
}, },
"path-to-regexp": { "path-to-regexp": {
@ -10760,9 +10760,9 @@
} }
}, },
"url-parse": { "url-parse": {
"version": "1.5.1", "version": "1.5.3",
"resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.1.tgz", "resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.3.tgz",
"integrity": "sha512-HOfCOUJt7iSYzEx/UqgtwKRMC6EU91NFhsCHMv9oM03VJcVo2Qrp8T8kI9D7amFf1cu+/3CEhgb3rF9zL7k85Q==", "integrity": "sha512-IIORyIQD9rvj0A4CLWsHkBBJuNqWpFQe224b6j9t/ABmquIS0qDU2pY6kl6AuOrL5OkCXHMCFNe1jBcuAggjvQ==",
"dev": true, "dev": true,
"requires": { "requires": {
"querystringify": "^2.1.1", "querystringify": "^2.1.1",
@ -10958,9 +10958,9 @@
} }
}, },
"vue-loader-v16": { "vue-loader-v16": {
"version": "npm:vue-loader@16.3.0", "version": "npm:vue-loader@16.5.0",
"resolved": "https://registry.npmjs.org/vue-loader/-/vue-loader-16.3.0.tgz", "resolved": "https://registry.npmjs.org/vue-loader/-/vue-loader-16.5.0.tgz",
"integrity": "sha512-UDgni/tUVSdwHuQo+vuBmEgamWx88SuSlEb5fgdvHrlJSPB9qMBRF6W7bfPWSqDns425Gt1wxAUif+f+h/rWjg==", "integrity": "sha512-WXh+7AgFxGTgb5QAkQtFeUcHNIEq3PGVQ8WskY5ZiFbWBkOwcCPRs4w/2tVyTbh2q6TVRlO3xfvIukUtjsu62A==",
"dev": true, "dev": true,
"requires": { "requires": {
"chalk": "^4.1.0", "chalk": "^4.1.0",
@ -10978,9 +10978,9 @@
} }
}, },
"chalk": { "chalk": {
"version": "4.1.1", "version": "4.1.2",
"resolved": "https://registry.npmjs.org/chalk/-/chalk-4.1.1.tgz", "resolved": "https://registry.npmjs.org/chalk/-/chalk-4.1.2.tgz",
"integrity": "sha512-diHzdDKxcU+bAsUboHLPEDQiw0qEe0qd7SYUn3HgcFlWgbDcfLGswOHYeGrHKzG9z6UYf01d9VFMfZxPM1xZSg==", "integrity": "sha512-oKnbhFyRIXpUuez8iBMmyEa4nbj4IOQyuhc/wy9kY7/WVPcwIO9VA668Pu8RkO7+0G76SLROeyw9CpQ061i4mA==",
"dev": true, "dev": true,
"requires": { "requires": {
"ansi-styles": "^4.1.0", "ansi-styles": "^4.1.0",

Loading…
Cancel
Save