The term "ethics" [comes from](https://en.wikipedia.org/wiki/Ethics) the Greek term "ethikos" - and its root "ethos", meaning _character or moral nature_. Think of ethics as the set of **shared values** or **moral principles** that govern our behavior in society. Our code of ethics is based on widely accepted ideas on what is _right vs. wrong_, creating informal rules (or "norms") that we follow voluntarily to ensure the good of the community.
Ethics is critical for scientific research and technology advancement. The [Research Ethics timeline](https://www.niehs.nih.gov/research/resources/bioethics/timeline/index.cfm) gives examples from the past four centuries - including Charles Babbage's 1830 [Reflections on the Decline of Science in England ..](https://books.google.com/books/about/Reflections_on_the_Decline_of_Science_in.html) where he discusses dishonesty in data science approaches including fabrication of data to support desired outcomes. Ethics became _guardrails_ to prevent data misuse and protect society from unintended consequences or harms.
_Applied Ethics_ is about the practical adoption of ethical principles and practices when developing new processes or products. It's about asking moral questions ("is this right or wrong?"), evaluating tradeoffs ("does this help or harm society more?") and taking informed actions to ensure compliance at individual and organizational levels. Ethics are *not* laws. But they can influence the creation of legal or social frameworks that support governance such as:
* **Professional codes of conduct.** | For users or groups e.g., The [Hippocratic Oath](https://en.wikipedia.org/wiki/Hippocratic_Oath) (460-370 BC) for medical ethics defined principles like data confidentiality (led to _doctor-patient privilege_ laws) and non-maleficence (popularly known as _first, do no harm_) that are still adopted today.
* **Regulatory standards** | For organizations or industries e.g., The [1996 Health Insurance Portability and Accountability Act](https://en.wikipedia.org/wiki/Health_Insurance_Portability_and_Accountability_Act) (HIPAA) mandated theft and fraud protections for _personally identifiable information_ (PII) collected by the healthcare industry - and stipulated how that data could be used or disclosed.
### 1.2 What is Data Ethics?
Data ethics is the application of ethics considerations to the domain of big data and data-driven algorithms.
* [Wikipedia](https://en.wikipedia.org/wiki/Big_data_ethics) defines big data ethics as _systemizing, defending, and recommending, concepts of right and wrong conduct in relation to data_ - focusing on implications for **personal data**.
* A [Royal Society article](https://royalsocietypublishing.org/doi/full/10.1098/rsta.2016.0360#sec-1) defines data ethics as a new branch of ethics that _studies and evaluates moral problems related to **data, algorithms and corresponding practices** .. to formulate and support morally good solutions (right conduct or values)_.
The first definition puts it in perspective of users ("personal data") while the second puts it in perspective of operations ("data, algorithms, practices") where:
* `practices` = responsible innovation, ethical hacking, codes of conduct
Based on this, we can define data ethics as the study and evaluation of _moral questions_ related to data collection, algorithm development, and industry-wide models for governance. We'll explore these questions in the "how" section, but first let's talk about the "why".
### 1.3 Why Data Ethics?
To answer this question, let's look at recent trends in the big data and AI industries:
* [_Statista_](https://www.statista.com/statistics/871513/worldwide-data-created/) - By 2025, we will be creating and consuming over **180 zettabytes of data**.
* _[Gartner](https://www.gartner.com/smarterwithgartner/gartner-top-10-trends-in-data-and-analytics-for-2020/)_ - By 2022, 35% of large orgs will buy & sell data in **online Marketplaces and Exchanges**
* _[Gartner](https://www.gartner.com/smarterwithgartner/2-megatrends-dominate-the-gartner-hype-cycle-for-artificial-intelligence-2020/)_ - AI **democratization and industrialization** are the new Hype Cycle megatrends.
The first trend tells us that _data scientists_ will have unprecedented levels of access to personal data at global scale, building algorithms to fuel an AI-driven economy. The second trend tells us that economies of scale and efficiencies in distribution will make it easier and cheaper for _developers_ to integrate AI into more everyday consumer experiences.
The potential for harm occurs when algorithms and AI get _weaponized_ against society in unforeseen ways. In [Weapons of Math Destruction](https://www.youtube.com/watch?v=TQHs8SA1qpk) author Cathy O'Neil talks about the three elements of AI algorithms that pose a danger to society: _opacity_, _scale_ and _damage_.
* **Opacity** refers to the black box nature of many algorithms - do we understand why a specific decision was made, and can we _explain or interpret_ the data reasoning that drove the predictions behind it?
* **Scale** refers to the speed with which algorithms can be deployed and replicated - how quickly can a minor algorithm design flaw get "baked in" with use, leading to irreversible societal harms to affected users?
* **Damage** refers to the social and economic impact of poor algorithmic decision-making - how can bad or unrepresentative data lead to unfair algorithms that disproportionately harm specific user groups?
So why does data ethics matter? Because democratization of AI can speed up weaponization, creating harms at scale in the absence of ethical guardrails. While industrialization of AI will motivate better governance - giving data ethics an important role in shaping policies and standards for developing responsible AI solutions.
### 1.4 How To Apply Ethics?
We know what data ethics is, and why it matters. But how do we _apply_ ethical principles or practices as data scientists or developers? It starts with us asking the right questions at every step of our data-driven pipelines and processes. These [Six questions about data science ethics](https://halpert3.medium.com/six-questions-about-data-science-ethics-252b5ae31fec) are a good starting point:
1. Is the data fair and unbiased?
2. Is the data being used fairly and ethically?
3. Is (user) privacy being protected?
4. To whom does data belong, company or user?
5. What effects do data and algorithms have on society?
6. Is the data manipulated or deceiving?
The [22 questions for ethics in data and AI](https://medium.com/the-organization/22-questions-for-ethics-in-data-and-ai-efb68fd19429) article expands this into a framework, grouping questions by stage of processing: _design_, _implementation & management_, _systems & organization_. The [O'Reilly Ethics and Data Science](https://resources.oreilly.com/examples/0636920203964/) book advocates strongly for _checklists_, asking simple `have we done this? (y/n)` questions that improve ethics oversight without the overheads caused by analysis paralysis.
And tools like [deon](https://deon.drivendata.org/)
make it frictionless to integrate [ethics checklists](https://deon.drivendata.org/#data-science-ethics-checklist) into your project workflows. Deon builds on [industry practices](https://deon.drivendata.org/#checklist-citations), shares [real-world examples](https://deon.drivendata.org/examples/) that put the ethical challenges in context, and allows practitioners to derive custom checklists from the defaults, to suit specific scenarios or industries.
### 1.5 Ethics Concepts
Ethics checklists often revolve around yes/no questions related to core ethics concepts and challenges. Let's look at _a subset_ of these issues - inspired in part by the [deon ethics checklist](https://deon.drivendata.org/#data-science-ethics-checklist) - in two contexts: data (collection and storage) and algorithms (analysis and modeling).
**Data Collection & Storage**
* _Ownership_: Does the user own the data? Or the organization? Is there an agreement that defines this?
* _Informed Consent_: Did human subjects give permission for data capture & understand purpose/usage?
* _Collection Bias_: Is data representative of audience? Did we identify and mitigate biases?
* _Data Security_: Is data stored and transmitted securely? Are valid access controls enforced?
* _Data Privacy_: Does data contain personally identifiable information? Is anonymity preserved?
* _Right to be Forgotten_: Does user have mechanism to request deletion of their personal information?
**Data Modeling & Analysis**
* _Data Validity_: Does data capture relevant features? Is it timeless? Is the data model valid?
* _Misrepresentation_: Does analysis communicate honestly reported data in a deceptive manner?
* _Auditability_: Is the data analysis or algorithm design documented well enough to be reproducible later?
* _Explainability_: Can we explain why the data model or learning algorithm made a specific decision?
* _Fairness_: Is the model fair (e.g., shows similar accuracy) across diverse groups of affected users?
Finally, let's talk about two abstract concepts that often underlie users' ethics concerns around technology:
* **Trust**: Can we trust an organization with our personal data? Can we trust that algorithmic decisions are fair and do no harm? Can we trust that information is not misrepresented?
* **Choice**: Do I have free will when I make a choice in a consumer UI/UX? Are data-driven [choice architectures](https://en.wikipedia.org/wiki/Choice_architecture) nudging me towards good choices or are [dark patterns](https://www.darkpatterns.org/) working against my self-interest?
### 1.6 Ethics History
Knowing ethics concepts is one thing - understanding the intent behind them, and the potential harms or societal consequences they bring, is another. Let's look at some case studies that help frame ethics discussions in a more concrete way with real-world examples.
| Historical Example | Ethics Issues |
|---|---|
| _[Facebook Data Breach](https://www.npr.org/2021/04/09/986005820/after-data-breach-exposes-530-million-facebook-says-it-will-not-notify-users)_ exposes data for 530M users. Facebook pays $5B to FTC, does not notify users. | Data Privacy, Data Security, Transparency, Accountability |
| [Tuskegee Syphillis Study](https://en.wikipedia.org/wiki/Tuskegee_syphilis_experiment) - African-American men were enrolled in study without being told its true purpose. Treatments were withheld. | Informed Consent, Fairness, Social / Economic Harms |
| [MIT Gender Shades Study](http://gendershades.org/index.html) - evaluated accuracy of industry AI gender classification models (used by law enforcement), detected bias | Fairness, Social/ Economic Harms, Collection Bias |
| [Learning app ABCmouse pays $10 million to settle FTC complaint it trapped parents in subscription they couldn’t cancel](https://www.washingtonpost.com/business/2020/09/04/abcmouse-10-million-ftc-settlement/) - user experience masked context, nudged user towards choices with financial harms| Misrepresentation, Free Choice, Dark Patterns, Economic Harms |
| [Netflix Prize Dataset de-anonymized by correlation](https://www.wired.com/2007/12/why-anonymous-data-sometimes-isnt/) - showed how Netflix prize dataset of 500M users was easily de-anonymized by cross-correlation with public IMDb comments (and other such datasets) | Data Privacy, Anonymity, De-identification |
| [Georgia COVID-19 cases not declining as quickly as data suggested](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) - graphs released had x-axis not ordered chronologically, misleading viewers| Misrepresentation, Social Harms |
We covered just a subset of examples, but recommend you explore these resources for more:
* [Ethics Unwrapped](https://ethicsunwrapped.utexas.edu/case-studies) - ethics dilemmas across diverse industries.
* [Data Science Ethics course](https://www.coursera.org/learn/data-science-ethics#syllabus) - landmark case studies in data ethics.
* [Where things have gone wrong](https://deon.drivendata.org/examples/) - deon checklist examples of ethical issues
| A Visual Guide to Data Ethics by [Nitya Narasimhan](https://twitter.com/nitya) / [(@sketchthedocs)](https://sketchthedocs.dev)|
|---|
| <br/><br/><br/><br/><br/><br/><br/><br/> |
> A Visual Guide to Data Ethics by [Nitya Narasimhan](https://twitter.com/nitya) / [(@sketchthedocs)](https://sketchthedocs.dev)
## Introduction
## 1. Introduction
What is ethics? What does data ethics mean, and how is it relevant to data scientists and developers in the context of big data, machine learning, and artificial intelligence? This lesson explores these ideas under the following sections:
This lesson will look at the field of _data ethics_ - from core concepts (ethical challenges & societal consequences) to applied ethics (ethical principles, practices and culture). Let's start with the basics: definitions and motivations.
* [**Fundamentals**](1-fundamentals) - Understand definitions, motivation and core concepts.
* [**Data Collection**](2-collection) - Explore data ethics issues around data ownership, user consent and control.
* [**Data Privacy**](3-privacy) - Understand degrees of privacy, challenges in anonymity and leakage, and user rights.
* [**Algorithm Fairness**](4-fairness) - Explore consequences & harms of algorithm bias and data misrepresentation.
* [**Societal Consequences**](5-consequences) - Explore socio-economic issues and case studies related to data ethics.
* [**Summary & Resources**](6-summary) - Wrap-up with a review of current data ethics practices and resources.
### 1.1 Definitions
---
**Ethics** [comes from the Greek word "ethikos" and its root "ethos"](https://en.wikipedia.org/wiki/Ethics). It refers to the set of _shared values and moral principles_ that govern our behavior in society and is based on widely-accepted ideas of _right vs. wrong_. Ethics are not laws! They can't be legally enforced but they can influence corporate initiatives and government regulations that help with compliance and governance.
**Data Ethics** is [defined as a new branch of ethics](https://royalsocietypublishing.org/doi/full/10.1098/rsta.2016.0360#sec-1) that "studies and evaluates moral problems related to _data, algorithms and corresponding practices_ .. to formulate and support morally good solutions" where:
* `data` = generation, recording, curation, dissemination, sharing and usage
* `algorithms` = AI, machine learning, bots
* `practices` = responsible innovation, ethical hacking, codes of conduct
**Applied Ethics** is the [_practical application of moral considerations_](https://en.wikipedia.org/wiki/Applied_ethics). If focuses on understanding how ethical issues impact real-world actions, products and processes, by asking moral questions - like _"is this fair?"_ and _"how can this harm individuals or society as a whole?"_ when working with big data and AI algorithms. Applied ethics practices can then focus on taking corrective measures - like employing checklists (_"did we test data model accruacy with diverse groups, for fairness?"_) - to minimize or prevent any unintended consequences.
**Ethics Culture**: Applied ethics focuses on identifying moral questions and adopting ethically-motivated actions with respect to real-world scenarios and projects. Ethics culture is about _operationalizing_ these practices, collaboratively and at scale, to ensure governances at the scale of organizations and industries. [Establishing an ethics culture](https://hbr.org/2019/05/how-to-design-an-ethical-organization) requires identifying and addressing _systemic_ issues (historical or ingrained) and creating norms & incentives htat keep members accountable for adherence to ethical principles.
### 1.2 Motivation
Let's look at some emerging trends in big data and AI:
* [By 2022](https://www.gartner.com/smarterwithgartner/gartner-top-10-trends-in-data-and-analytics-for-2020/) one-in-three large organizations will buy and sell data via online Marketplaces and Exchanges.
* [By 2025](https://www.statista.com/statistics/871513/worldwide-data-created/) we'll be creating and consuming over 180 zettabytes of data.
**Data scientists** will have unimaginable levels of access to personal and behavioral data, helping them develop the algorithms to fuel an AI-driven economy. This raises data ethics issues around _protection of data privacy_ with implications for individual rights around personal data collection and usage.
**App developers** will find it easier and cheaper to integrate AI into everday consumer experiences, thanks to the economies of scale and efficiencies of distribution in centralized exchanges. This raises ethical issues around the [_weaponization of AI_](https://www.youtube.com/watch?v=TQHs8SA1qpk) with implications for societal harms caused by unfairness, misrepresentation and systemic biases.
**Democratization and Industrialization of AI** are seen as the two megatrends in Gartner's 2020 [Hype Cycle for AI](https://www.gartner.com/smarterwithgartner/2-megatrends-dominate-the-gartner-hype-cycle-for-artificial-intelligence-2020/), shown below. The first positions developers to be a major force in driving increased AI adoption, while the second makes responsible AI and governance a priority for industries.
Data ethics are now **necessary guardrails** ensuring developers ask the right moral questions and adopt the right practices (to uphold ethical values). And they influence the regulations and frameworks defined (for governance) by governments and organizations.
## 2. Core Concepts
A data ethics culture requires an understanding of three things: the _shared values_ we embrace as a society, the _moral questions_ we ask (to ensure adherence to those values), and the potential _harms & consequences_ (of non-adherence).
### 2.1 Ethical AI Values
Our shared values reflect our ideas of wrong-vs-right when it comes to big data and AI. Different organizations have their own views of what responsible AI and ethical AI principles look like.
Here is an example - the [Responsible AI Framework](https://docs.microsoft.com/en-gb/azure/cognitive-services/personalizer/media/ethics-and-responsible-use/ai-values-future-computed.png) from Microsoft defines 6 core ethics principles for all products and processes to follow, when implementing AI solutions:
* **Accountability**: ensure AI designers & developers take _responsibility_ for its operation.
* **Transparency**: make AI operations and decisions _understandable_ to users.
* **Fairness**: understand biases and ensure AI _behaves comparably_ across target groups.
* **Reliability & Safety**: make sure AI behaves consistently, and _without malicious intent_.
* **Security & Privacy**: get _informed consent_ for data collection, provide data privacy controls.
* **Inclusiveness**: adapt AI behaviors to _broad range of human needs_ and capabilities.

Note that accountability and transparency are _cross-cutting_ concerns that are foundational to the top 4 values, and can be explored in their contexts. In the next section we'll look at the ethical challenges (moral questions) raised in two core contexts:
* Data Privacy - focused on **personal data** collection & use, with consequences to individuals.
* Fairness - focused on **algorithm** design & use, with consequences to society at large.
### 2.2 Ethics of Personal Data
[Personal data](https://en.wikipedia.org/wiki/Personal_data) or personally-identifiable information (PII) is _any data that relates to an identified or identifiable living individual_. It can also [extend to diverse pieces of non-personal data](https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en) that collectively can lead to the identification of a specific individual. Examples include: participant data from research studies, social media interactions, mobile & web app data, online commerce transactions and more.
Here are _some_ ethical concepts and moral questions to explore in context:
* **Data Ownership**. Who owns the data - user or organization? How does this impact users' rights?
* **Informed Consent**. Did users give permissions for data capture? Did they understand purpose?
* **Intellectual Property**. Does data have economic value? What are the users' rights & controls?
* **Data Privacy**. Is data secured from hacks/leaks? Is anonymity preserved on data use or sharing?
* **Right to be Forgotten**. Can user request their data be deleted or removed to reclaim privacy?
### 2.3 Ethics of Algorithms
Algorithm design begins with collecting & curating datasets relevant to a specific AI problem or domain, then processing & analyzing it to create models that can help predict outcomes or automate decisions in real-world applications. Moral questions can now arise in various contexts, at any one of these stages.
Here are _some_ ethical concepts and moral questions to explore in context:
* **Dataset Bias** - Is data representative of target audience? Have we checked for different [data biases](https://towardsdatascience.com/survey-d4f168791e57)?
* **Data Quality** - Does dataset and feature selection provide the required [data quality assurance](https://lakefs.io/data-quality-testing/)?
* **Algorithm Fairness** - Does the data model [systematically discriminate](https://towardsdatascience.com/what-is-algorithm-fairness-3182e161cf9f) against some subgroups?
* **Misrepresentation** - Are we [communicating honestly reported data in a deceptive manner?](https://www.sciencedirect.com/topics/computer-science/misrepresentation)
* **Explainable AI** - Are the results of AI [understandable by humans](https://en.wikipedia.org/wiki/Explainable_artificial_intelligence)? White-box (vs. black-box) models.
* **Free Choice** - Did user exercise free will or did algorithm nudge them towards a desired outcome?
### 2.3 Case Studies
The above are a subset of the core ethical challenges posed for big data and AI. More organizations are defining and adopting _responsible AI_ or _ethical AI_ frameworks that may identify additional shared values and related ethical challenges for specific domains or needs.
To understand the potential _harms and consequences_ of neglecting or violating these data ethics principles, it helps to explore this in a real-world context. Here are some famous case studies and recent examples to get you started:
* `1972`: The [Tuskegee Syphillis Study](https://en.wikipedia.org/wiki/Tuskegee_Syphilis_Study) is a landmark case study for **informed consent** in data science. African American men who participated in the study were promised free medical care _but deceived_ by researchers who failed to inform subjects of their diagnosis or about availability of treatment. Many subjects died; some partners or children were affected by complications. The study lasted 40 years.
* `2007`: The Netflix data prize provided researchers with [_10M anonymized movie rankings from 50K customers_](https://www.wired.com/2007/12/why-anonymous-data-sometimes-isnt/) to help improve recommendation algorithms. This became a landmark case study in **de-identification (data privacy)** where researchers were able to correlate the anonymized data with _other datasets_ (e.g., IMDb) that had personally identifiable information - helping them "de-anonymize" users.
* `2013`: The City of Boston [developed Street Bump](https://www.boston.gov/transportation/street-bump), an app that let citizens report potholes, giving the city better roadway data to find and fix issues. This became a case study for **collection bias** where [people in lower income groups had less access to cars and phones](https://hbr.org/2013/04/the-hidden-biases-in-big-data), making their roadway issues invisible in this app. Developers worked with academics to _equitable access and digital divides_ issues for fairness.
* `2018`: The MIT [Gender Shades Study](http://gendershades.org/overview.html) evaluated the accuracy of gender classification AI products, exposing gaps in accuracy for women and persons of color. A [2019 Apple Card](https://www.wired.com/story/the-apple-card-didnt-see-genderand-thats-the-problem/) seemed to offer less credit to women than men. Both these illustrated issues in **algorithmic fairness** and discrimination.
* `2020`: The [Georgia Department of Public Health released COVID-19 charts](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) that appeared to mislead citizens about trends in confirmed cases with non-chronological ordering on the x-axis. This illustrates **data misrepresentation** where honest data is presented dishonestly to support a desired narrative.
* `2020`: Learning app [ABCmouse paid $10M to settle an FTC complaint](https://www.washingtonpost.com/business/2020/09/04/abcmouse-10-million-ftc-settlement/) where parents were trapped into paying for subscriptions they couldn't cancel. This highlights the **illusion of free choice** in algorithmic decision-making, and potential harms from dark patterns that exploit user insights.
* `2021`: Facebook [Data Breach](https://www.npr.org/2021/04/09/986005820/after-data-breach-exposes-530-million-facebook-says-it-will-not-notify-users) exposed data from 530M users, resulting in a $5B settlement to the FTC. It however refused to notify users of the breach - raising issues like **data privacy**, **data security** and **accountability**, including user rights to redress for those affected.
Want to explore more case studies on your own? Check out these resources:
* [Ethics Unwrapped](https://ethicsunwrapped.utexas.edu/case-studies) - ethics dilemmas across diverse industries.
* [Data Science Ethics course](https://www.coursera.org/learn/data-science-ethics#syllabus) - landmark case studies in data ethics.
* [Where things have gone wrong](https://deon.drivendata.org/examples/) - deon checklist examples of ethical issues
## 3. Applied Ethics
We've learned about data ethics values, and the ethical challenges (+ moral questions) associated with adherence to these values. But how do we _implement_ these ideas in real-world contexts? Here are some tools & practices that can help.
### 3.1 Have Professional Codes
Professional codes are _moral guidelines_ for professional behavior, helping employees or members _make decisions that align with organizational principles_. Codes may not be legally enforceable, making them only as good as the willing compliance of members. An organization may inspire adherence by imposing incentives & penalties accordingly.
Professional _codes of conduct_ are prescriptive rules and responsibilities that members must follow to remain in good standing with an organization. A professional *code of ethics* is more [_aspirational_](https://keydifferences.com/difference-between-code-of-ethics-and-code-of-conduct.html), defining the shared values and ideas of the organization. The terms are sometimes used interchangeably.
Examples include:
* [Oxford Munich](http://www.code-of-ethics.org/code-of-conduct/) Code of Ethics
* [Data Science Association](http://datascienceassn.org/code-of-conduct.html) Code of Conduct (created 2013)
* [ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics) (since 1993)
### 3.2 Ask Moral Questions
Assuming you've already identified your shared values or ethical principles at a team or organization level, the next step is to identify the moral questions relevant to your specific use case and operational workflow.
Here are [6 basic questions about data ethics](https://halpert3.medium.com/six-questions-about-data-science-ethics-252b5ae31fec) that you can build on:
* Is the data you're collecting fair and unbiased?
* Is the data being used ethically and fairly?
* Is user privacy being protected?
* To whom does data belong - the company or the user?
* What effects do the data and algorithms have on society (individual and collective)?
* Is the data manipulated or deceptive?
For larger team or project scope, you can choose to expand on questions that reflect a specific stage of the workflow. For example here are [22 questions on ethics in data and AI](https://medium.com/the-organization/22-questions-for-ethics-in-data-and-ai-efb68fd19429) that were grouped into _design_, _implementation & management_, _systems & organization_ categories for convenience.
### 3.3 Adopt Ethics Checklists
While professional codes define required _ethical behavior_ from practitioners, they [have known limitations](https://resources.oreilly.com/examples/0636920203964/blob/master/of_oaths_and_checklists.md) for implementation, particularly in large-scale projects. In [Ethics and Data Science](https://resources.oreilly.com/examples/0636920203964/blob/master/of_oaths_and_checklists.md)), experts instead advocate for ethics checklists that can **connect principles to practices** in more deterministic and actionable ways.
Checklists convert questions into "yes/no" tasks that can be tracked and validated before product release. Tools like [deon](https://deon.drivendata.org/) make this frictionless, creating default checklists aligned to [industry recommendations](https://deon.drivendata.org/#checklist-citations) and enabling users to customize and integrate them into workflows using a command-line tool. Deon also provides [real-world examples](ttps://deon.drivendata.org/examples/) of ethical challenges to provide context for these decisions.
### 3.4 Track Ethics Compliance
**Ethics** is about doing the right thing, even if there are no laws to enforce it. **Compliance** is about following the law, when defined and where applicable.
**Governance** is the broader umbrella that covers all the ways in which an organization (company or government) operates to enforce ethical principles & comply with laws.
Companies are creating their own ethics frameworks (e.g., [Microsoft](https://www.microsoft.com/en-us/ai/responsible-ai), [IBM](https://www.ibm.com/cloud/learn/ai-ethics), [Google](https://ai.google/principles), [Facebook](https://ai.facebook.com/blog/facebooks-five-pillars-of-responsible-ai/), [Accenture](https://www.accenture.com/_acnmedia/PDF-149/Accenture-Responsible-AI-Final.pdf#zoom=50)) for governances, while state and national governments tend to focus on regulations that protect the data privacy and rights of their citizens.
[2. Data Collection](2-collection.md ':include')
Here are some landmark data privacy regulations to know:
* `1974`, [US Privacy Act](https://www.justice.gov/opcl/privacy-act-1974) - regulates _federal govt._ collection, use and disclosure of personal information.
* `1996`, [US Health Insurance Portability & Accountability Act (HIPAA)](https://www.cdc.gov/phlp/publications/topic/hipaa.html) - protects personal health data.
* `1998`, [US Children's Online Privacy Protection Act (COPPA)](https://www.ftc.gov/enforcement/rules/rulemaking-regulatory-reform-proceedings/childrens-online-privacy-protection-rule) - protects data privacy of children under 13.
* `2018`, [General Data Protection Regulation (GDPR)](https://gdpr-info.eu/) - provides user rights, data protection and privacy.
* `2018`, [California Consumer Privacy Act (CCPA)](https://www.oag.ca.gov/privacy/ccpa) gives consumers more _rights_ over their personal data.
[3. Data Privacy](3-privacy.md ':include')
In Aug 2021, China passed the [Personal Information Protection Law](https://www.reuters.com/world/china/china-passes-new-personal-data-privacy-law-take-effect-nov-1-2021-08-20/) (to go into effect Nov 1) which, with its Data Security Law, will create one of the strongest online data privacy regulations in the world.
There remains an intangible gap between compliance ("doing enough to meet the letter of the law") and addressing systemic issues ([like ossification, information asymmetry and distributional unfairness](https://www.coursera.org/learn/data-science-ethics/home/week/4)) that can create self-fulfilling feedback loops to weaponizes AI further. This is motivating calls for [formalizing data ethics cultures](https://www.codeforamerica.org/news/formalizing-an-ethical-data-culture/) in organizations, where everyone is empowered to [pull the Andon cord](https://en.wikipedia.org/wiki/Andon_(manufacturing) to raise ethics concerns early. And exploring [collaborative approaches to defining this culture](https://towardsdatascience.com/why-ai-ethics-requires-a-culture-driven-approach-26f451afa29f) that build emotional connections and consistent beliefs across organizations and industries.
---
@ -52,10 +191,10 @@ What is ethics? What does data ethics mean, and how is it relevant to data scien
Here we also computer **inter-quartile range** IQR=Q3-Q1, and so-called **outliers** - values, that lie outside the boundaries [Q1-1.5*IQR,Q3+1.5*IQR].
@ -92,6 +92,82 @@ If we plot the histogram of the generated samples we will see the picture very s
*Normal Distribution with mean=0 and std.dev=1*
## Confidence Intervals
When we talk about weights of baseball players, we assume that there is certain **random variable W** that corresponds to ideal probability distribution of weights of all baseball players. Our sequence of weights corresponds to a subset of all baseball players that we call **population**. An interesting question is, can we know the parameters of distribution of W, i.e. mean and variance?
The easiest answer would be to calculate mean and variance of our sample. However, it could happen that our random sample does not accurately represent complete population. Thus it makes sense to talk about **confidence interval**.
Suppose we have a sample X<sub>1</sub>, ..., X<sub>n</sub> from our distribution. Each time we draw a sample from our distribution, we would end up with different mean value μ. Thus μ can be considered to be a random variable. A **confidence interval** with confidence p is a pair of values (L<sub>p</sub>,R<sub>p</sub>), such that **P**(L<sub>p</sub>≤μ≤R<sub>p</sub>) = p, i.e. a probability of measured mean value falling within the interval equals to p.
It does beyond our short intro to discuss how those confidence intervals are calculated. Some more details can be found [on Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval). An example of calculating confidence interval for weights and heights is given in the [accompanying notebooks](notebook.ipynb).
| p | Weight mean |
|-----|-----------|
| 0.85 | 201.73±0.94 |
| 0.90 | 201.73±1.08 |
| 0.95 | 201.73±1.28 |
Notice that the higher is the confidence probability, the wider is the confidence interval.
## Hypothesis Testing
In our baseball players dataset, there are different player roles, that can be summarized below (look at the [accompanying notebook](notebook.ipynb) to see how this table can be calculated):
We can notice that the mean heights of first basemen is higher that that of second basemen. Thus, we may be tempted to conclude that **first basemen are higher than second basemen**.
> This statement is called **a hypothesis**, because we do not know whether the fact is actually true or not.
However, it is not always obvious whether we can make this conclusion. From the discussion above we know that each mean has an associated confidence interval, and thus this difference can just be a statistical error. We need some more formal way to test our hypothesis.
Let's compute confidence intervals separately for heights of first and second basemen:
| Confidence | First Basemen | Second Basemen |
|------------|---------------|----------------|
| 0.85 | 73.62..74.38 | 71.04..71.69 |
| 0.90 | 73.56..74.44 | 70.99..71.73 |
| 0.95 | 73.47..74.53 | 70.92..71.81 |
We can see that under no confidence the intervals overlap. That proves our hypothesis that first basemen are higher than second basemen.
More formally, the problem we are solving is to see if **two probability distributions are the same**, or at least have the same parameters. Depending on the distribution, we need to use different tests for that. If we know that our distributions are normal, we can apply **[Student t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)**.
In Student t-test, we compute so-called **t-value**, which indicates the difference between means, taking into account the variance. It is demonstrated that t-value follows **student distribution**, which allows us to get the threshold value for a given confidence level **p** (this can be computed, or looked up in the numerical tables). We then compare t-value to this threshold to approve or reject the hypothesis.
In Python, we can use **SciPy** package, which includes `ttest_ind` function (in addition to many other useful statistical functions!). It computes the t-value for us, and also does the reverse lookup of confidence p-value, so that we can just look at the confidence to draw the conclusion.
For example, our comparison between heights of first and second basemen give us the following results:
In our case, p-value is very low, meaning that there is strong evidence supporting that first basemen are taller.
> **Challenge**: Use the sample code in the notebook to test other hypothesis that: (1) First basemen and older that second basemen; (2) First basemen and taller than third basemen; (3) Shortstops are taller than second basemen
There are different types of hypothesis that we might want to test, for example:
* To prove that a given sample follows some distribution. In our case we have assumed that heights are normally distributed, but that needs formal statistical verification.
* To prove that a mean value of a sample corresponds to some predefined value
@ -189,17 +189,16 @@ In this plot you can see the range, per category, of the Minimum Length and Maxi

## 🚀 Challenge
This bird dataset offers a wealth of information about different types of birds within a particular ecosystem. Search around the internet and see if you can find other bird-oriented datasets. Practice building charts and graphs around these birds to discover facts you didn't realize.
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
This first lesson has given you some information about how to use Matplotlib to visualize quantities. Do some research around other ways to work with datasets for visualization. [Plotly](https://github.com/plotly/plotly.py) is one that we won't cover in these lessons, so take a look at what it can offer.
In this lesson, you worked with line charts, scatterplots, and bar charts to show interesting facts about this dataset. In this assignment, dig deeper into the dataset to discover a fact about a given type of bird. For example, create a notebook visualizing all the interesting data you can uncover about Snow Geese. Use the three plots mentioned above to tell a story in your notebook.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
A notebook is presented with good annotations, solid storytelling, and attractive graphs | The notebook is missing one of these elements | The notebook is missing two of these elements
@ -176,6 +176,7 @@ Perhaps it's worth researching whether the cluster of 'Vulnerable' birds accordi
## 🚀 Challenge
Histograms are a more sophisticated type of chart than basic scatterplots, bar charts, or line charts. Go on a search on the internet to find good examples of the use of histograms. How are they used, what do they demonstrate, and in what fields or areas of inquiry do they tend to be used?
## Post-Lecture Quiz
@ -183,7 +184,8 @@ Perhaps it's worth researching whether the cluster of 'Vulnerable' birds accordi
## Review & Self Study
In this lesson, you used Matplotlib and started working with Seaborn to show more sophisticated charts. Do some research on `kdeplot` in Seaborn, a "continuous probability density curve in one or more dimensions". Read through [the documentation](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) to understand how it works.
So far, you have worked with the Minnesota birds dataset to discover information about bird quantities and population density. Practice your application of these techniques by trying a different dataset, perhaps sourced from [Kaggle]. Build a notebook to tell a story about this dataset, and make sure to use histograms when discussing it.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
A notebook is presented with annotations about this dataset, including it source, and uses at least 5 histograms to discover facts about the data. | A notebook is presented with incomplete annotations or bugs | A notebook is presented without annotations and includes bugs
@ -6,6 +6,8 @@ In this lesson, you will use a different nature-focused dataset to visualize pro
- Donut charts 🍩
- Waffle charts 🧇
> 💡 A very interesting project called [Charticulator](https://charticulator.com) by Microsoft Research offers a free drag and drop interface for data visualizations. In one of their tutorials they also use this mushroom dataset! So you can explore the data and learn the library at the same time: https://charticulator.com/tutorials/tutorial4.html
## Pre-Lecture Quiz
[Pre-lecture quiz]()
@ -157,17 +159,26 @@ Using a waffle chart, you can plainly see the proportions of cap color of this m
In this lesson you learned three ways to visualize proportions. First, you need to group your data into categories and then decide which is the best way to display the data - pie, donut, or waffle. All are delicious and gratify the user with an instant snapshot of a dataset.
## 🚀 Challenge
Try recreating these tasty charts in [Charticulator](https://charticulator.com).
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
Sometimes it's not obvious when to use a pie, donut, or waffle chart. Here are some articles to read on this topic:
Did you know you can create donut, pie and waffle charts in Excel? Using a dataset of your choice, create these three charts right in an Excel spreadsheet
| An Excel spreadsheet is presented with all three charts | An Excel spreadsheet is presented with two charts | An Excel spreadsheet is presented with only one charts |
@ -10,15 +10,165 @@ It will be interesting to visualize the relationship between a given state's pro
[Pre-lecture quiz]()
In this lesson, you can use Seaborn, which you use before, as a good library to visualize relationships between variables. Particularly interesting is the use of Seaborn's `relplot` function that allows scatter plots and line plots to quickly visualize '[statistical relationships](https://seaborn.pydata.org/tutorial/relational.html?highlight=relationships)', which allow the data scientist to better understand how variables relate to each other.
## Scatterplots
Use a scatterplot to show how the price of honey has evolved, year over year, per state. Seaborn, using `relplot`, conveniently groups the state data and displays data points for both categorical and numeric data.
Let's start by importing the data and Seaborn:
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
honey = pd.read_csv('../../data/honey.csv')
honey.head()
```
You notice that the honey data has several interesting columns, including year and price per pound. Let's explore this data, grouped by U.S. state:
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
Create a basic scatterplot to show the relationship between the price per pound of honey and its U.S. state of origin. Make the `y` axis tall enough to display all the states:
Now, show the same data with a honey color scheme to show how the price evolves over the years. You can do this by adding a 'hue' parameter to show the change, year over year:
> ✅ Learn more about the [color palettes you can use in Seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html) - try a beautiful rainbow color scheme!
With this color scheme change, you can see that there's obviously a strong progression over the years in terms of honey price per pound. Indeed, if you look at a sample set in the data to verify (pick a given state, Arizona for example) you can see a pattern of price increases year over year, with few exceptions:
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year |
Another way to visualize this progression is to use size, rather than color. For colorblind users, this might be a better option. Edit your visualization to show an increase of price by an increase in dot circumference:
You can see the size of the dots gradually increasing.

Is this a simple case of supply and demand? Due to factors such as climate change and colony collapse, is there less honey available for purchase year over year, and thus the price increases?
To discover a correlation between some of the variables in this dataset, let's explore some line charts.
## Line charts
Question: Is there a clear rise in price of honey per pound year over year? You can most easily discover that by creating a single line chart:
Answer: Yes, with some exceptions around the year 2003:

✅ Because Seaborn is aggregating data around one line, it displays "the multiple measurements at each x value by plotting the mean and the 95% confidence interval around the mean". [source](https://seaborn.pydata.org/tutorial/relational.html). This time-consuming behavior can be disabled by adding `ci=None`.
Question: Well, in 2003 can we also see a spike in the honey supply? What if you look at total production year over year?
Answer: Not really. If you look at total production, it actually seems to have increased in that particular year, even though generally speaking the amount of honey being produced is in decline during these years.
Question: In that case, what could have caused that spike in the price of honey around 2003?
To discover this, you can explore a facet grid.
## Facet grids
Facet grids take one facet of your dataset (in our case, you can choose 'year' to avoid having too many facets produced). Seaborn can then make a plot for each of those facets of your chosen x and y coordinates for more easy visual comparison. Does 2003 stand out in this type of comparison?
Create a facet grid by continuing to use `relplot` as recommended by [Seaborn's documentation](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html?highlight=facetgrid#seaborn.FacetGrid).
```python
sns.relplot(
data=honey,
x="yieldpercol", y="numcol",
col="year",
col_wrap=3,
kind="line"
```
In this visualization, you can compare the yield per colony and number of colonies year over year, side by side with a wrap set at 3 for the columns:

For this dataset, nothing particularly stands out with regards to the number of colonies and their yield, year over year and state over state. Is there a different way to look at finding a correlation between these two variables?
## Dual-line Plots
Try a multiline plot by superimposing two lineplots on top of each other, using Seaborn's 'despine' to remove their top and right spines, and using `ax.twinx` [derived from Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.twinx.html). Twinx allows a chart to share the x axis and display two y axes. So, display the yield per colony and number of colonies, superimposed:
While nothing jumps out to the eye around the year 2003, it does allow us to end this lesson on a little happier note: while there are overall a declining number of colonies, their numbers might seem to be stabilizing and their yield per colony is actually increasing, even with fewer bees.
Go, bees, go!
🐝❤️
## 🚀 Challenge
In this lesson, you learned a bit more about other uses of scatterplots and line grids, including facet grids. Challenge yourself to create a facet grid using a different dataset, maybe one you used prior to these lessons. Note how long they take to create and how you need to be careful about how many grids you need to draw using these techniques.
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
Line plots can be simple or quite complex. Do a bit of reading in the [Seaborn documentation](https://seaborn.pydata.org/generated/seaborn.lineplot.html) on the various ways you can build them. Try to enhance the line charts you built in this lesson with other methods listed in the docs.
In this lesson you started looking at a dataset around bees and their honey production over a period of time that saw losses in the bee colony population overall. Dig deeper into this dataset and build a notebook that can tell the story of the health of the bee population, state by state and year by year. Do you discover anything interesting about this dataset?
| A notebook is presented with a story annotated with at least three different charts showing aspects of the dataset, state over state and year over year | The notebook lacks one of these elements | The notebook lacks two of these elements |
> "If you torture the data long enough, it will confess to anything" -- [Ronald Coase](https://en.wikiquote.org/wiki/Ronald_Coase)
One of the basic skills of a data scientist is the ability to create a meaningful data visualization that helps answer questions you might have. Prior to visualizing your data, you need to ensure that it has been cleaned and prepared, as you did in prior lessons. After that, you can start deciding how best to present the data.
In this lesson, you will review:
1. How to choose the right chart type
2. How to avoid deceptive charting
3. How to work with color
4. How to style your charts for readability
5. How to build animated or 3D charting solutions
6. How to build a creative visualization
[Pre-lecture quiz]()
## Pre-Lecture Quiz
## 🚀 Challenge
## Choose the right chart type
In previous lessons, you experimented with building all kinds of interesting data visualizations using Matplotlib and Seaborn for charting. In general, you can select the [right kind of chart](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) for the question you are asking using this table:
| Show relationships | Scatter, Line, Facet, Dual Line |
| Show distributions | Scatter, Histogram, Box |
| Show proportions | Pie, Donut, Waffle |
> ✅ Depending on the makeup of your data, you might need to convert it from text to numeric to get a given chart to support it.
## Avoid deception
Even if a data scientist is careful to choose the right chart for the right data, there are plenty of ways that data can be displayed in a way to prove a point, often at the cost of undermining the data itself. There are many examples of deceptive charts and infographics!
[](https://www.youtube.com/Low28hx4wyk "Deceptive charts")
> 🎥 Click the image above for a conference talk about deceptive charts
This chart reverses the X axis to show the opposite of the truth, based on date:

[This chart](https://media.firstcoastnews.com/assets/WTLV/images/170ae16f-4643-438f-b689-50d66ca6a8d8/170ae16f-4643-438f-b689-50d66ca6a8d8_1140x641.jpg) is even more deceptive, as the eye is drawn to the right to conclude that, over time, COVID cases have declined in the various counties. In fact, if you look closely at the dates, you find that they have been rearranged to give that deceptive downward trend.

This notorious example uses color AND a flipped Y axis to deceive: instead of concluding that gun deaths spiked after the passage of gun-friendly legislation, in fact the eye is fooled to think that the opposite is true:

This strange chart shows how proportion can be manipulated, to hilarious effect:

Comparing the incomparable is yet another shady trick. There is a [wonderful web site](https://tylervigen.com/spurious-correlations) all about 'spurious correlations' displaying 'facts' correlating things like the divorce rate in Maine and the consumption of margarine. A Reddit group also collects the [ugly uses](https://www.reddit.com/r/dataisugly/top/?t=all) of data.
It's important to understand how easily the eye can be fooled by deceptive charts. Even if the data scientist's intention is good, the choice of a bad type of chart, such as a pie chart showing too many categories, can be deceptive.
## Color
You saw in the 'Florida gun violence' chart above how color can provide an additional layer of meaning to charts, especially ones not designed using libraries such as Matplotlib and Seaborn which come with various vetted color libraries and palettes. If you are making a chart by hand, do a little study of [color theory](https://colormatters.com/color-and-design/basic-color-theory)
> ✅ Be aware, when designing charts, that accessibility is an important aspect of visualization. Some of your users might be color blind - does your chart display well for users with visual impairments?
Be careful when choosing colors for your chart, as color can convey meaning you might not intend. The 'pink ladies' in the 'height' chart above convey a distinctly 'feminine' ascribed meaning that adds to the bizarreness of the chart itself.
While [color meaning](https://colormatters.com/color-symbolism/the-meanings-of-colors) might be different in different parts of the world, and tend to change in meaning according to their shade. Generally speaking, color meanings include:
| Color | Meaning |
| ------ | ------------------- |
| red | power |
| blue | trust, loyalty |
| yellow | happiness, caution |
| green | ecology, luck, envy |
| purple | happiness |
| orange | vibrance |
If you are tasked with building a chart with custom colors, ensure that your charts are both accessible and the color you choose coincides with the meaning you are trying to convey.
## Styling your charts for readability
Charts are not meaningful if they are not readable! Take a moment to consider styling the width and height of your chart to scale well with your data. If one variable (such as all 50 states) need to be displayed, show them vertically on the Y axis if possible so as to avoid a horizontally-scrolling chart.
Label your axes, provide a legend if necessary, and offer tooltips for better comprehension of data.
If your data is textual and verbose on the X-axis, you can angle the text for better readability. [Matplotlib](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html) offers 3d plotting, if you data supports it. Sophisticated data visualizations can be produced using `mpl_toolkits.mplot3d`.

## Animation and 3D chart display
Some of the best data visualizations today are animated. Shirley Wu has amazing ones done with D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/)', where each flower is a visualization of a movie. Another example for the Guardian is 'bussed out', an interactive experience combining visualizations with Greensock and D3 plus a scrollytelling article format to show how NYC handles its homeless problem by busing people out of the city.

While this lesson is insufficient to go into depth to teach these powerful visualization libraries, try your hand at D3 in a Vue.js app using a library to display a visualization of the book "Dangerous Liaisons" as an animated social network.
> "Les Liaisons Dangereuses" is an epistolary novel, or a novel presented as a series of letters. Written in 1782 by Choderlos de Laclos, it tells the story of the vicious, morally-bankrupt social maneuvers of two dueling protagonists of the French aristocracy in the late 18th century, the Vicomte de Valmont and the Marquise de Merteuil. Both meet their demise in the end but not without inflicting a great deal of social damage. The novel unfolds as a series of letters written to various people in their circles, plotting for revenge or simply to make trouble. Create a visualization of these letters to discover the major kingpins of the narrative, visually.
You will complete a web app that will display an animated view of this social network. It uses a library that was built to create a [visual of a network](https://github.com/emiliorizzo/vue-d3-network) using Vue.js and D3. When the app is running, you can pull the nodes around on the screen to shuffle the data around.

## Project: Build a chart to show a network using D3.js
> This lesson folder includes a `solution` folder where you can find the completed project, for your reference.
1. Follow the instructions in the README.md file in the starter folder's root. Make sure you have NPM and Node.js running on your machine before installing your project's dependencies.
2. Open the `starter/src` folder. You'll discover an `assets` folder where you can find a .json file with all the letters from the novel, numbered, with a 'to' and 'from' annotation.
3. Complete the code in `components/Nodes.vue` to enable the visualization. Look for the method called `createLinks()` and add the following nested loop.
Loop through the .json object to capture the 'to' and 'from' data for the letters and build up the `links` object so that the visualization library can consume it:
```javascript
//loop through letters
let f = 0;
let t = 0;
for (var i = 0; i <letters.length;i++){
for (var j = 0; j <characters.length;j++){
if (characters[j] == letters[i].from) {
f = j;
}
if (characters[j] == letters[i].to) {
t = j;
}
}
this.links.push({ sid: f, tid: t });
}
```
Run your app from the terminal (npm run serve) and enjoy the visualization!
## 🚀 Challenge
Take a tour of the internet to discover deceptive visualizations. How does the author fool the user, and is it intentional? Try correcting the visualizations to show how they should look.
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
Here are some articles to read about deceptive data visualization:
Using the code sample in this project to create a social network, mock up data of your own social interactions. You could map your usage of social media or make a diagram of your family members. Create an interesting web app that shows a unique visualization of a social network.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
A GitHub repo is presented with code that runs properly (try deploying it as a static web app) and has an annotated README explaining the project | The repo does not run properly or is not documented well | The repo does not run properly and is not documented well
To get started, you need to ensure that you have NPM and Node running on your machine. Install the dependencies (npm install) and then run the project locally (npm run serve):
## Project setup
```
npm install
```
### Compiles and hot-reloads for development
```
npm run serve
```
### Compiles and minifies for production
```
npm run build
```
### Lints and fixes files
```
npm run lint
```
### Customize configuration
See [Configuration Reference](https://cli.vuejs.org/config/).
To get started, you need to ensure that you have NPM and Node running on your machine. Install the dependencies (npm install) and then run the project locally (npm run serve):
## Project setup
```
npm install
```
### Compiles and hot-reloads for development
```
npm run serve
```
### Compiles and minifies for production
```
npm run build
```
### Lints and fixes files
```
npm run lint
```
### Customize configuration
See [Configuration Reference](https://cli.vuejs.org/config/).
@ -8,18 +8,133 @@ In this lesson, you will learn the fundamental principles of the Cloud, then you
## What is the Cloud?
The Cloud, or Cloud Computing, is the delivery of a wide range of pay-as-you-go computing services hosted on an infrastructure over the internet. Services include solutions such as storage, databases, networking, software, analytics, and intelligent services.
We usually differentiate the Public, Private and Hybrid clouds:
* Public cloud: a public cloud is owned and operated by a third-party cloud service provider which delivers their computing resources over the Internet to the public
* Private cloud: refers to cloud computing resources used exclusively by a single business or organization, with services and an infrastructure maintained on a private network.
* Hybrid cloud: the hybrid cloud is a system that combines public and private clouds. Users opt for an on-premises datacenter, while allowing data and applications to be run on one or more public clouds.
Most cloud computing services fall into three categories: infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS).
* Infrastructure as a service (IaaS): users rent an IT infrastructure — servers and virtual machines (VMs), storage, networks, operating systems
* Platform as a service (PaaS): users rent an environment for developing, testing, delivering, and managing software applications. Users don’t need to worry about setting up or managing the underlying infrastructure of servers, storage, network, and databases needed for development.
* Software as a service (SaaS): users get access to software applications over the Internet, on demand and typically on a subscription basis. Users don’t need to worry about hosting, managing the software application, the underlying infrastructure or the maintenance, like software upgrades and security patching.
Some of the largest Cloud providers are Amazon Web Services, Google Cloud Platform and Microsoft Azure.
## Why Choose the Cloud for Data Science?
Developers and IT professionals chose to work with the Cloud for many reasons, including the following:
* Innovation: you can power your applications by integrating innovative services created by Cloud providers directly into your apps
* Flexibility: you only pay for the services that you need and can choose from a wide range of services. You typically pay as you go and adapt your services according to your evolving needs.
* Budget: you don’t need to make initial investments to purchase hardware and software, set up and run on-site datacenters and you can just pay for what you use
* Scalability: your resources can scale according to the needs of your project, which means that your apps can use more or less computing power, storage and bandwidth, by adapting to external factors at any given time
* Productivity: you can focus on your business rather than spend time on tasks that can be managed by someone else, such as managing datacenters
* Reliability: cloud computing offers several ways to continuously back up your data and you can set up disaster recovery plans to keep your business going
* Security: you can benefit from policies, technologies, and controls that strengthen the security of your project
These are some of the most reasons why people choose to use Cloud services. Now that we have a better understanding of what the Cloud is and what its main benefits are, let's look more specifically into the jobs of Data scientists and developers working with data, and how the Cloud can help them with several of the specific challenges they face:
* Storing large amounts of data: instead of buying, managing and protecting big servers, you can store your data directly in the cloud, with solutions such as Azure Cosmos DB, Azure SQL Database and Azure Data Lake Storage
* Performing data integration: data integration is an essential part of data science, that lets you go from data collection to taking actions. With data integration services offered in the cloud, you can collect, transform and integrate data from various sources into a single data warehouse, with Data Factory
* Processing data: processing vast amounts of data requires a lot of computing power, and not everyone has access to machines powerful enough for that, which is why many people choose to directly harness the cloud’s huge computing power to run and deploy their solutions
* Using data analytics services: to turn your data into actionable insights, with Azure Synapse Analytics, Azure Stream Analytics, Azure Databricks
* Using Machine Learning and data intelligence services: Instead of starting from scratch, you can use machine learning algorithms offered by the cloud provider, with services such as AzureML, and you can use cognitive services such as speech-to-text, text to speech, computer vision and more
## Examples of Data Science in the Cloud
## Post-Lecture Quiz
Let’s make this more tangible by looking at a couple of scenarios.
[Post-lecture quiz]()
We’ll start with a scenario commonly studied by people who start with machine learning: social media sentiment analysis in real time.
Let's say you run a news media website, and you want to leverage live data to understand what content your readers could be interested in. To know more about that, you can build a program that performs real-time sentiment analysis of data from Twitter publications, on topics that are relevant to your readers.
## Review & Self Study
The key indicators you will look at is the volume of tweets on specific topics (hashtags), and sentiment, which is established using analytics tools that perform sentiment analysis around the specified topics.
## Assignment
The steps necessary to create this projects are the following:
[Assignment Title](assignment.md)
* Create an event hub for streaming input, which will collect data from Twitter
* Configure and start a Twitter client application, which will call the Twitter Streaming APIs
* Create a Stream Analytics job
* Specify the job input and query
* Create an output sink and specify the job output
* Start the job
To view the full process, check out the [documentation](https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends).
Let’s take a more original example of a project created by Dmitry Soshnikov, one of the authors of this curriculum.
Dmitry created a Generative Adversarial Network and taught it to create artificial paintings. We will see the different steps used to for this:
* Creating artificial paintings
* Training the network on paintings from a dataset
* Training a GAN, or Generative Adversarial Network
* Creating a discriminator model and a generator model
* Training the script in the Cloud with Azure ML
* Generating new images
As you can see, we can leverage Cloud services in many ways to perform Data Science.
To see the full process, visit [Dmitry’s blog](https://soshnikov.com/scienceart/creating-generative-art-using-gan-on-azureml).
Here is a paragraph with an footnote <spanid="a1">[[1]](#f1)</span>.