|
|
1 month ago | |
|---|---|---|
| .. | ||
| README.md | 1 month ago | |
| assignment.md | 1 month ago | |
README.md
Introduction to Data Ethics
![]() |
|---|
| Data Science Ethics - Sketchnote by @nitya |
We are all data citizens living in a world shaped by data.
Market trends indicate that by 2022, one in three large organizations will buy and sell their data through online Marketplaces and Exchanges. As App Developers, it will become easier and more cost-effective to integrate data-driven insights and algorithm-based automation into everyday user experiences. However, as AI becomes more widespread, we must also understand the potential harms caused by the weaponization of such algorithms on a large scale.
Projections suggest that by 2025, we will generate and consume over 180 zettabytes of data. For Data Scientists, this data explosion offers unprecedented access to personal and behavioral information. This enables the creation of detailed user profiles and the subtle influence of decision-making—often fostering an illusion of free choice. While this can be used to nudge users toward desired outcomes, it also raises critical questions about data privacy, autonomy, and the ethical limits of algorithmic influence.
Data ethics now serve as essential guardrails for data science and engineering, helping to minimize potential harms and unintended consequences of data-driven actions. The Gartner Hype Cycle for AI highlights trends in digital ethics, responsible AI, and AI governance as key drivers for larger megatrends around the democratization and industrialization of AI.
In this lesson, we will delve into the fascinating field of data ethics—from core concepts and challenges to case studies and applied AI concepts like governance—that help establish an ethical culture within teams and organizations working with data and AI.
Pre-lecture quiz 🎯
Basic Definitions
Let’s begin by understanding some basic terminology.
The word "ethics" originates from the Greek word "ethikos" (and its root "ethos"), meaning character or moral nature.
Ethics refers to the shared values and moral principles that guide our behavior in society. Ethics are not based on laws but on widely accepted norms of what is "right versus wrong." However, ethical considerations can influence corporate governance initiatives and government regulations, creating incentives for compliance.
Data Ethics is a new branch of ethics that "studies and evaluates moral problems related to data, algorithms, and corresponding practices." Here, "data" focuses on actions like generation, recording, curation, processing, dissemination, sharing, and usage; "algorithms" focuses on AI, agents, machine learning, and robots; and "practices" addresses topics like responsible innovation, programming, hacking, and ethical codes.
Applied Ethics refers to the practical application of moral considerations. It involves actively investigating ethical issues in the context of real-world actions, products, and processes and taking corrective measures to ensure alignment with defined ethical values.
Ethics Culture is about operationalizing applied ethics to ensure that ethical principles and practices are consistently and scalably adopted across an organization. Successful ethics cultures define organization-wide ethical principles, provide meaningful incentives for compliance, and reinforce ethical norms by encouraging and amplifying desired behaviors at every level of the organization.
Ethics Concepts
In this section, we’ll discuss concepts like shared values (principles) and ethical challenges (problems) in data ethics—and explore case studies to understand these concepts in real-world contexts.
1. Ethics Principles
Every data ethics strategy begins with defining ethical principles—the "shared values" that describe acceptable behaviors and guide compliant actions in data and AI projects. These principles can be defined at an individual or team level. However, most large organizations outline them in an ethical AI mission statement or framework, defined at the corporate level and enforced consistently across all teams.
Example: Microsoft's Responsible AI mission statement reads: "We are committed to the advancement of AI driven by ethical principles that put people first"—identifying six ethical principles in the framework below:
Let’s briefly explore these principles. Transparency and accountability are foundational values upon which other principles are built—so let’s start there:
- Accountability ensures practitioners are responsible for their data and AI operations and compliance with ethical principles.
- Transparency ensures that data and AI actions are understandable (interpretable) to users, explaining the what and why behind decisions.
- Fairness focuses on ensuring AI treats all people fairly, addressing systemic or implicit socio-technical biases in data and systems.
- Reliability & Safety ensures AI behaves consistently with defined values, minimizing potential harms or unintended consequences.
- Privacy & Security involves understanding data lineage and providing data privacy and related protections to users.
- Inclusiveness focuses on designing AI solutions intentionally, adapting them to meet a broad range of human needs and capabilities.
🚨 Think about what your data ethics mission statement could be. Explore ethical AI frameworks from other organizations—examples include IBM, Google, and Facebook. What shared values do they have in common? How do these principles relate to the AI products or industries they operate in?
2. Ethics Challenges
Once ethical principles are defined, the next step is to evaluate data and AI actions to ensure alignment with those shared values. Consider your actions in two categories: data collection and algorithm design.
In data collection, actions often involve personal data or personally identifiable information (PII) for identifiable living individuals. This includes various types of non-personal data that collectively identify an individual. Ethical challenges may relate to data privacy, data ownership, and topics like informed consent and intellectual property rights for users.
In algorithm design, actions involve collecting and curating datasets, then using them to train and deploy data models that predict outcomes or automate decisions in real-world contexts. Ethical challenges may arise from dataset bias, data quality issues, unfairness, and misrepresentation in algorithms—including systemic issues.
In both cases, ethical challenges highlight areas where actions may conflict with shared values. To detect, mitigate, minimize, or eliminate these concerns, we must ask moral "yes/no" questions about our actions and take corrective measures as needed. Let’s examine some ethical challenges and the moral questions they raise:
2.1 Data Ownership
Data collection often involves personal data that can identify individuals. Data ownership concerns control and user rights related to the creation, processing, and dissemination of data.
Moral questions to consider:
- Who owns the data? (user or organization)
- What rights do data subjects have? (e.g., access, erasure, portability)
- What rights do organizations have? (e.g., rectifying malicious user reviews)
2.2 Informed Consent
Informed consent involves users agreeing to an action (like data collection) with a full understanding of relevant facts, including the purpose, potential risks, and alternatives.
Questions to explore:
- Did the user (data subject) give permission for data capture and usage?
- Did the user understand the purpose of data collection?
- Did the user understand the potential risks of their participation?
2.3 Intellectual Property
Intellectual property refers to intangible creations resulting from human initiative that may have economic value to individuals or businesses.
Questions to explore:
- Does the collected data have economic value to a user or business?
- Does the user have intellectual property rights here?
- Does the organization have intellectual property rights here?
- If these rights exist, how are they being protected?
2.4 Data Privacy
Data privacy refers to preserving user privacy and protecting user identity with respect to personally identifiable information.
Questions to explore:
- Is users' (personal) data secured against hacks and leaks?
- Is users' data accessible only to authorized users and contexts?
- Is users' anonymity preserved when data is shared or disseminated?
- Can a user be de-identified from anonymized datasets?
2.5 Right To Be Forgotten
The Right To Be Forgotten or Right to Erasure provides additional personal data protection to users. It allows users to request deletion or removal of personal data from Internet searches and other locations, under specific circumstances—giving them a fresh start online without past actions being held against them.
Questions to explore:
- Does the system allow data subjects to request erasure?
- Should the withdrawal of user consent trigger automated erasure?
- Was data collected without consent or by unlawful means?
- Are we compliant with government regulations for data privacy?
2.6 Dataset Bias
Dataset or Collection Bias involves selecting a non-representative subset of data for algorithm development, potentially creating unfair outcomes for diverse groups. Types of bias include selection or sampling bias, volunteer bias, and instrument bias.
Questions to explore:
- Did we recruit a representative set of data subjects?
- Did we test our collected or curated dataset for various biases?
- Can we mitigate or remove any discovered biases?
2.7 Data Quality
Data Quality examines the validity of the curated dataset used to develop algorithms, ensuring features and records meet the accuracy and consistency requirements for the AI purpose.
Questions to explore:
- Did we capture valid features for our use case?
- Was data captured consistently across diverse data sources?
- Is the dataset complete for diverse conditions or scenarios?
- Is information captured accurately in reflecting reality?
2.8 Algorithm Fairness
Algorithm Fairness examines whether the design of an algorithm systematically discriminates against specific subgroups of data subjects, leading to potential harms in allocation (where resources are denied or withheld from that group) and quality of service (where AI is less accurate for certain subgroups compared to others).
Questions to consider:
- Have we evaluated model accuracy across diverse subgroups and conditions?
- Have we examined the system for potential harms (e.g., stereotyping)?
- Can we revise the data or retrain models to address identified harms?
Explore resources like AI Fairness checklists to learn more.
2.9 Misrepresentation
Data Misrepresentation involves questioning whether insights derived from honestly reported data are being communicated in a deceptive way to support a desired narrative.
Questions to consider:
- Are we reporting incomplete or inaccurate data?
- Are we visualizing data in ways that lead to misleading conclusions?
- Are we using selective statistical techniques to manipulate outcomes?
- Are there alternative explanations that might lead to different conclusions?
2.10 Free Choice
The Illusion of Free Choice occurs when system "choice architectures" use decision-making algorithms to subtly push people toward a preferred outcome while appearing to offer them options and control. These dark patterns can cause social and economic harm to users. Since user decisions influence behavior profiles, these actions can potentially drive future choices, amplifying or extending the impact of these harms.
Questions to consider:
- Did the user understand the implications of making that choice?
- Was the user aware of (alternative) choices and the pros & cons of each?
- Can the user reverse an automated or influenced choice later?
3. Case Studies
To contextualize these ethical challenges, it’s helpful to examine case studies that highlight the potential harms and consequences for individuals and society when ethical violations are overlooked.
Here are some examples:
| Ethics Challenge | Case Study |
|---|---|
| Informed Consent | 1972 - Tuskegee Syphilis Study - African American men who participated in the study were promised free medical care but deceived by researchers who failed to inform subjects of their diagnosis or the availability of treatment. Many subjects died, and their partners or children were affected; the study lasted 40 years. |
| Data Privacy | 2007 - The Netflix data prize provided researchers with 10M anonymized movie rankings from 50K customers to improve recommendation algorithms. However, researchers were able to correlate anonymized data with personally identifiable data in external datasets (e.g., IMDb comments), effectively "de-anonymizing" some Netflix subscribers. |
| Collection Bias | 2013 - The City of Boston developed Street Bump, an app that let citizens report potholes, giving the city better roadway data to find and fix issues. However, people in lower-income groups had less access to cars and phones, making their roadway issues invisible in this app. Developers worked with academics to address equitable access and digital divides for fairness. |
| Algorithmic Fairness | 2018 - The MIT Gender Shades Study evaluated the accuracy of gender classification AI products, exposing gaps in accuracy for women and persons of color. A 2019 Apple Card seemed to offer less credit to women than men. Both illustrated issues in algorithmic bias leading to socio-economic harms. |
| Data Misrepresentation | 2020 - The Georgia Department of Public Health released COVID-19 charts that appeared to mislead citizens about trends in confirmed cases with non-chronological ordering on the x-axis. This illustrates misrepresentation through visualization tricks. |
| Illusion of free choice | 2020 - Learning app ABCmouse paid $10M to settle an FTC complaint where parents were trapped into paying for subscriptions they couldn't cancel. This illustrates dark patterns in choice architectures, where users were nudged toward potentially harmful choices. |
| Data Privacy & User Rights | 2021 - Facebook Data Breach exposed data from 530M users, resulting in a $5B settlement to the FTC. It, however, refused to notify users of the breach, violating user rights around data transparency and access. |
Want to explore more case studies? Check out these resources:
- Ethics Unwrapped - ethics dilemmas across diverse industries.
- Data Science Ethics course - landmark case studies explored.
- Where things have gone wrong - deon checklist with examples
🚨 Think about the case studies you've seen - have you experienced, or been affected by, a similar ethical challenge in your life? Can you think of at least one other case study that illustrates one of the ethical challenges we've discussed in this section?
Applied Ethics
We’ve discussed ethical concepts, challenges, and case studies in real-world contexts. But how do we start applying ethical principles and practices in our projects? And how do we operationalize these practices for better governance? Let’s explore some real-world solutions:
1. Professional Codes
Professional Codes offer one option for organizations to "incentivize" members to support their ethical principles and mission statement. Codes are moral guidelines for professional behavior, helping employees or members make decisions that align with their organization's principles. They are only as effective as the voluntary compliance from members; however, many organizations offer additional rewards and penalties to motivate compliance.
Examples include:
- Oxford Munich Code of Ethics
- Data Science Association Code of Conduct (created 2013)
- ACM Code of Ethics and Professional Conduct (since 1993)
🚨 Do you belong to a professional engineering or data science organization? Explore their site to see if they define a professional code of ethics. What does this say about their ethical principles? How are they "incentivizing" members to follow the code?
2. Ethics Checklists
While professional codes define required ethical behavior from practitioners, they have known limitations in enforcement, particularly in large-scale projects. Instead, many data science experts advocate for checklists, which can connect principles to practices in more deterministic and actionable ways.
Checklists convert questions into "yes/no" tasks that can be operationalized, allowing them to be tracked as part of standard product release workflows.
Examples include:
- Deon - a general-purpose data ethics checklist created from industry recommendations with a command-line tool for easy integration.
- Privacy Audit Checklist - provides general guidance for information handling practices from legal and social exposure perspectives.
- AI Fairness Checklist - created by AI practitioners to support the adoption and integration of fairness checks into AI development cycles.
- 22 questions for ethics in data and AI - a more open-ended framework, structured for initial exploration of ethical issues in design, implementation, and organizational contexts.
3. Ethics Regulations
Ethics is about defining shared values and doing the right thing voluntarily. Compliance is about following the law if and where defined. Governance broadly covers all the ways in which organizations operate to enforce ethical principles and comply with established laws.
Today, governance takes two forms within organizations. First, it’s about defining ethical AI principles and establishing practices to operationalize adoption across all AI-related projects in the organization. Second, it’s about complying with all government-mandated data protection regulations for regions it operates in.
Examples of data protection and privacy regulations:
1974, US Privacy Act - regulates federal govt. collection, use, and disclosure of personal information.1996, US Health Insurance Portability & Accountability Act (HIPAA) - protects personal health data.1998, US Children's Online Privacy Protection Act (COPPA) - protects data privacy of children under 13.2018, General Data Protection Regulation (GDPR) - provides user rights, data protection, and privacy.2018, California Consumer Privacy Act (CCPA) - gives consumers more rights over their (personal) data.2021, China’s Personal Information Protection Law - one of the strongest online data privacy regulations worldwide.
🚨 The European Union’s GDPR (General Data Protection Regulation) remains one of the most influential data privacy regulations today. Did you know it also defines 8 user rights to protect citizens’ digital privacy and personal data? Learn about what these are, and why they matter.
4. Ethics Culture
There remains an intangible gap between compliance (doing enough to meet "the letter of the law") and addressing systemic issues (like ossification, information asymmetry, and distributional unfairness) that can accelerate the weaponization of AI.
The latter requires collaborative approaches to defining ethics cultures that build emotional connections and consistent shared values across organizations in the industry. This calls for more formalized data ethics cultures in organizations—allowing anyone to pull the Andon cord (to raise ethics concerns early in the process) and making ethical assessments (e.g., in hiring) a core criterion for team formation in AI projects.
Post-lecture quiz 🎯
Review & Self Study
Courses and books help with understanding core ethics concepts and challenges, while case studies and tools help with applied ethics practices in real-world contexts. Here are a few resources to start with.
- Machine Learning For Beginners - lesson on Fairness, provided by Microsoft.
- Principles of Responsible AI - free learning path available on Microsoft Learn.
- Ethics and Data Science - O'Reilly EBook by M. Loukides, H. Mason, and others.
- Data Science Ethics - online course offered by the University of Michigan.
- Ethics Unwrapped - case studies from the University of Texas.
Assignment
Write A Data Ethics Case Study
Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we aim for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

