Merge pull request #863 from microsoft/update-translations

🌐 Update translations via Co-op Translator
pull/864/head
Lee Stott 2 weeks ago committed by GitHub
commit 82f24ea830
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -0,0 +1,159 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "69389392fa6346e0dfa30f664b7b6fec",
"translation_date": "2025-09-06T10:54:05+00:00",
"source_file": "1-Introduction/1-intro-to-ML/README.md",
"language_code": "en"
}
-->
# Introduction to Machine Learning
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
---
[![ML for beginners - Introduction to Machine Learning for Beginners](https://img.youtube.com/vi/6mSx_KJxcHI/0.jpg)](https://youtu.be/6mSx_KJxcHI "ML for beginners - Introduction to Machine Learning for Beginners")
> 🎥 Click the image above for a short video that walks you through this lesson.
Welcome to this course on classical machine learning for beginners! Whether you're completely new to the topic or an experienced ML practitioner looking to refresh your knowledge, we're glad to have you here! We aim to create a welcoming starting point for your ML journey and would love to hear your [feedback](https://github.com/microsoft/ML-For-Beginners/discussions) to improve this course.
[![Introduction to ML](https://img.youtube.com/vi/h0e2HAPTGF4/0.jpg)](https://youtu.be/h0e2HAPTGF4 "Introduction to ML")
> 🎥 Click the image above for a video: MIT's John Guttag introduces machine learning.
---
## Getting Started with Machine Learning
Before diving into this curriculum, make sure your computer is set up to run notebooks locally.
- **Set up your machine with these videos**. Use the following links to learn [how to install Python](https://youtu.be/CXZYvNRIAKM) on your system and [set up a text editor](https://youtu.be/EU8eayHWoZg) for development.
- **Learn Python**. It's recommended to have a basic understanding of [Python](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-77952-leestott), a programming language widely used by data scientists and utilized in this course.
- **Learn Node.js and JavaScript**. We occasionally use JavaScript in this course for building web apps, so you'll need [Node.js](https://nodejs.org) and [npm](https://www.npmjs.com/) installed, along with [Visual Studio Code](https://code.visualstudio.com/) for both Python and JavaScript development.
- **Create a GitHub account**. If you found us on [GitHub](https://github.com), you might already have an account. If not, create one and fork this curriculum to use on your own. (Feel free to give us a star, too 😊)
- **Explore Scikit-learn**. Familiarize yourself with [Scikit-learn](https://scikit-learn.org/stable/user_guide.html), a set of ML libraries referenced throughout these lessons.
---
## What is Machine Learning?
The term "machine learning" is one of the most popular and widely used buzzwords today. If you have any familiarity with technology, you've likely heard it at least once, regardless of your field. However, the mechanics of machine learning remain a mystery to many. For beginners, the subject can sometimes feel overwhelming. That's why it's important to understand what machine learning truly is and learn about it step by step through practical examples.
---
## The Hype Curve
![ml hype curve](../../../../1-Introduction/1-intro-to-ML/images/hype.png)
> Google Trends shows the recent "hype curve" of the term "machine learning."
---
## A Mysterious Universe
We live in a universe filled with fascinating mysteries. Great scientists like Stephen Hawking, Albert Einstein, and many others have dedicated their lives to uncovering meaningful information about the world around us. This quest for knowledge is part of the human condition: as children, we learn new things and gradually understand the structure of our world as we grow.
---
## The Child's Brain
A child's brain and senses perceive their surroundings and gradually learn hidden patterns of life. These patterns help the child develop logical rules to identify and understand what they've learned. This learning process makes humans the most advanced living beings on Earth. By continuously discovering hidden patterns and innovating upon them, we improve ourselves throughout our lives. This ability to learn and adapt is linked to a concept called [brain plasticity](https://www.simplypsychology.org/brain-plasticity.html). On a surface level, we can draw motivational parallels between the human brain's learning process and the principles of machine learning.
---
## The Human Brain
The [human brain](https://www.livescience.com/29365-human-brain.html) perceives information from the real world, processes it, makes rational decisions, and takes actions based on circumstances. This is what we call intelligent behavior. When we program a machine to mimic this intelligent behavior, we call it artificial intelligence (AI).
---
## Some Terminology
Although the terms are often confused, machine learning (ML) is a significant subset of artificial intelligence. **ML focuses on using specialized algorithms to uncover meaningful insights and hidden patterns from data, supporting rational decision-making processes.**
---
## AI, ML, Deep Learning
![AI, ML, deep learning, data science](../../../../1-Introduction/1-intro-to-ML/images/ai-ml-ds.png)
> A diagram showing the relationships between AI, ML, deep learning, and data science. Infographic by [Jen Looper](https://twitter.com/jenlooper) inspired by [this graphic](https://softwareengineering.stackexchange.com/questions/366996/distinction-between-ai-ml-neural-networks-deep-learning-and-data-mining).
---
## Concepts to Cover
In this curriculum, we will focus on the core concepts of machine learning that every beginner should know. We will explore "classical machine learning," primarily using Scikit-learn, a popular library for learning the basics. A solid understanding of machine learning is essential for grasping broader concepts in artificial intelligence or deep learning, and we aim to provide that foundation here.
---
## In This Course, You Will Learn:
- Core concepts of machine learning
- The history of ML
- ML and fairness
- Regression ML techniques
- Classification ML techniques
- Clustering ML techniques
- Natural language processing ML techniques
- Time series forecasting ML techniques
- Reinforcement learning
- Real-world applications of ML
---
## What We Will Not Cover
- Deep learning
- Neural networks
- AI
To keep the learning experience manageable, we will avoid the complexities of neural networks, "deep learning" (which involves building multi-layered models using neural networks), and AI. These topics will be covered in a separate curriculum. Additionally, we plan to offer a data science curriculum in the future to focus on that aspect of this broader field.
---
## Why Study Machine Learning?
From a systems perspective, machine learning is the creation of automated systems that can learn hidden patterns from data to make intelligent decisions.
This concept is loosely inspired by how the human brain learns from the data it perceives in the world.
✅ Take a moment to think about why a business might prefer using machine learning strategies over creating a hard-coded, rules-based system.
---
## Applications of Machine Learning
Machine learning applications are everywhere, as ubiquitous as the data generated by our smartphones, connected devices, and other systems. Given the immense potential of state-of-the-art ML algorithms, researchers are exploring their ability to solve complex, real-world problems across multiple disciplines, often with remarkable results.
---
## Examples of Applied ML
**Machine learning can be used in many ways**:
- Predicting the likelihood of disease based on a patient's medical history or reports.
- Using weather data to forecast weather events.
- Analyzing text to understand sentiment.
- Detecting fake news to prevent the spread of misinformation.
Fields like finance, economics, earth science, space exploration, biomedical engineering, cognitive science, and even the humanities have adopted machine learning to tackle data-intensive challenges in their domains.
---
## Conclusion
Machine learning automates the discovery of patterns by extracting meaningful insights from real-world or generated data. It has proven to be highly valuable in business, healthcare, finance, and other fields.
In the near future, understanding the basics of machine learning will become essential for people in any domain due to its widespread adoption.
---
# 🚀 Challenge
Sketch, on paper or using an online app like [Excalidraw](https://excalidraw.com/), your understanding of the differences between AI, ML, deep learning, and data science. Include examples of problems that each technique is well-suited to solve.
# [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
---
# Review & Self-Study
To learn more about working with ML algorithms in the cloud, follow this [Learning Path](https://docs.microsoft.com/learn/paths/create-no-code-predictive-models-azure-machine-learning/?WT.mc_id=academic-77952-leestott).
Take a [Learning Path](https://docs.microsoft.com/learn/modules/introduction-to-machine-learning/?WT.mc_id=academic-77952-leestott) to explore the basics of ML.
---
# Assignment
[Get up and running](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,23 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "4c4698044bb8af52cfb6388a4ee0e53b",
"translation_date": "2025-09-06T10:54:29+00:00",
"source_file": "1-Introduction/1-intro-to-ML/assignment.md",
"language_code": "en"
}
-->
# Get Up and Running
## Instructions
In this non-graded assignment, you should refresh your Python knowledge and set up your environment to be able to run notebooks.
Follow this [Python Learning Path](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-77952-leestott), and then prepare your systems by watching these introductory videos:
https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,164 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "6a05fec147e734c3e6bfa54505648e2b",
"translation_date": "2025-09-06T10:54:33+00:00",
"source_file": "1-Introduction/2-history-of-ML/README.md",
"language_code": "en"
}
-->
# History of Machine Learning
![Summary of History of Machine Learning in a sketchnote](../../../../sketchnotes/ml-history.png)
> Sketchnote by [Tomomi Imura](https://www.twitter.com/girlie_mac)
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
---
[![ML for beginners - History of Machine Learning](https://img.youtube.com/vi/N6wxM4wZ7V0/0.jpg)](https://youtu.be/N6wxM4wZ7V0 "ML for beginners - History of Machine Learning")
> 🎥 Click the image above for a short video covering this lesson.
In this lesson, well explore the key milestones in the history of machine learning and artificial intelligence.
The history of artificial intelligence (AI) as a field is closely tied to the history of machine learning, as the algorithms and computational advancements that drive ML have contributed to the development of AI. Its worth noting that while these fields began to take shape as distinct areas of study in the 1950s, significant [algorithmic, statistical, mathematical, computational, and technical discoveries](https://wikipedia.org/wiki/Timeline_of_machine_learning) predate and overlap this period. In fact, people have been pondering these ideas for [centuries](https://wikipedia.org/wiki/History_of_artificial_intelligence). This article delves into the intellectual foundations of the concept of a "thinking machine."
---
## Notable Discoveries
- 1763, 1812 [Bayes Theorem](https://wikipedia.org/wiki/Bayes%27_theorem) and its predecessors. This theorem and its applications form the basis of inference, describing the probability of an event occurring based on prior knowledge.
- 1805 [Least Square Theory](https://wikipedia.org/wiki/Least_squares) by French mathematician Adrien-Marie Legendre. This theory, which youll learn about in our Regression unit, aids in data fitting.
- 1913 [Markov Chains](https://wikipedia.org/wiki/Markov_chain), named after Russian mathematician Andrey Markov, are used to describe sequences of possible events based on a previous state.
- 1957 [Perceptron](https://wikipedia.org/wiki/Perceptron), a type of linear classifier invented by American psychologist Frank Rosenblatt, laid the groundwork for advances in deep learning.
---
- 1967 [Nearest Neighbor](https://wikipedia.org/wiki/Nearest_neighbor), originally designed for mapping routes, is used in ML to detect patterns.
- 1970 [Backpropagation](https://wikipedia.org/wiki/Backpropagation) is employed to train [feedforward neural networks](https://wikipedia.org/wiki/Feedforward_neural_network).
- 1982 [Recurrent Neural Networks](https://wikipedia.org/wiki/Recurrent_neural_network), derived from feedforward neural networks, create temporal graphs.
✅ Do some research. What other dates stand out as pivotal in the history of ML and AI?
---
## 1950: Machines That Think
Alan Turing, an extraordinary individual who was voted [by the public in 2019](https://wikipedia.org/wiki/Icons:_The_Greatest_Person_of_the_20th_Century) as the greatest scientist of the 20th century, is credited with laying the foundation for the concept of a "machine that can think." He addressed skeptics and his own need for empirical evidence by creating the [Turing Test](https://www.bbc.com/news/technology-18475646), which youll explore in our NLP lessons.
---
## 1956: Dartmouth Summer Research Project
"The Dartmouth Summer Research Project on artificial intelligence was a landmark event for AI as a field," and it was here that the term "artificial intelligence" was coined ([source](https://250.dartmouth.edu/highlights/artificial-intelligence-ai-coined-dartmouth)).
> Every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.
---
The lead researcher, mathematics professor John McCarthy, aimed "to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it." Participants included another prominent figure in the field, Marvin Minsky.
The workshop is credited with sparking discussions on topics such as "the rise of symbolic methods, systems focused on limited domains (early expert systems), and deductive systems versus inductive systems." ([source](https://wikipedia.org/wiki/Dartmouth_workshop)).
---
## 1956 - 1974: "The Golden Years"
From the 1950s to the mid-1970s, there was great optimism about AIs potential to solve numerous problems. In 1967, Marvin Minsky confidently stated, "Within a generation ... the problem of creating 'artificial intelligence' will substantially be solved." (Minsky, Marvin (1967), Computation: Finite and Infinite Machines, Englewood Cliffs, N.J.: Prentice-Hall)
Research in natural language processing flourished, search algorithms were refined and made more powerful, and the concept of "micro-worlds" emerged, where simple tasks were completed using plain language instructions.
---
Government agencies provided generous funding, computational and algorithmic advancements were made, and prototypes of intelligent machines were developed. Some of these machines include:
* [Shakey the robot](https://wikipedia.org/wiki/Shakey_the_robot), which could navigate and decide how to perform tasks "intelligently."
![Shakey, an intelligent robot](../../../../1-Introduction/2-history-of-ML/images/shakey.jpg)
> Shakey in 1972
---
* Eliza, an early "chatterbot," could converse with people and act as a primitive "therapist." Youll learn more about Eliza in the NLP lessons.
![Eliza, a bot](../../../../1-Introduction/2-history-of-ML/images/eliza.png)
> A version of Eliza, a chatbot
---
* "Blocks world" was an example of a micro-world where blocks could be stacked and sorted, allowing experiments in teaching machines to make decisions. Advances using libraries like [SHRDLU](https://wikipedia.org/wiki/SHRDLU) propelled language processing forward.
[![blocks world with SHRDLU](https://img.youtube.com/vi/QAJz4YKUwqw/0.jpg)](https://www.youtube.com/watch?v=QAJz4YKUwqw "blocks world with SHRDLU")
> 🎥 Click the image above for a video: Blocks world with SHRDLU
---
## 1974 - 1980: "AI Winter"
By the mid-1970s, it became clear that the complexity of creating "intelligent machines" had been underestimated and that its promise, given the available computational power, had been overstated. Funding dried up, and confidence in the field waned. Some factors that contributed to this decline included:
---
- **Limitations**. Computational power was insufficient.
- **Combinatorial explosion**. The number of parameters required for training grew exponentially as more was demanded of computers, without a corresponding evolution in computational power and capability.
- **Lack of data**. A shortage of data hindered the testing, development, and refinement of algorithms.
- **Are we asking the right questions?**. Researchers began to question the very questions they were pursuing:
- Turing tests faced criticism, including the "Chinese room theory," which argued that "programming a digital computer may make it appear to understand language but could not produce real understanding." ([source](https://plato.stanford.edu/entries/chinese-room/))
- Ethical concerns arose about introducing artificial intelligences like the "therapist" ELIZA into society.
---
During this time, different schools of thought in AI emerged. A dichotomy developed between ["scruffy" vs. "neat AI"](https://wikipedia.org/wiki/Neats_and_scruffies) approaches. _Scruffy_ labs tweaked programs extensively to achieve desired results, while _neat_ labs focused on logic and formal problem-solving. ELIZA and SHRDLU were well-known _scruffy_ systems. In the 1980s, as the demand for reproducible ML systems grew, the _neat_ approach gained prominence due to its more explainable results.
---
## 1980s: Expert Systems
As the field matured, its value to businesses became evident, leading to the proliferation of "expert systems" in the 1980s. "Expert systems were among the first truly successful forms of artificial intelligence (AI) software." ([source](https://wikipedia.org/wiki/Expert_system))
These systems were _hybrid_, combining a rules engine that defined business requirements with an inference engine that used the rules to deduce new facts.
This era also saw growing interest in neural networks.
---
## 1987 - 1993: AI 'Chill'
The rise of specialized expert systems hardware had the unintended consequence of becoming overly specialized. Meanwhile, the advent of personal computers competed with these large, centralized systems. The democratization of computing had begun, eventually paving the way for the modern explosion of big data.
---
## 1993 - 2011
This period marked a new chapter for ML and AI, enabling solutions to earlier challenges caused by limited data and computational power. Data availability grew rapidly, especially with the introduction of smartphones around 2007. Computational power expanded exponentially, and algorithms evolved in tandem. The field began to mature, transitioning from its freewheeling early days into a structured discipline.
---
## Now
Today, machine learning and AI influence nearly every aspect of our lives. This era demands a thoughtful understanding of the risks and potential impacts of these algorithms on human lives. As Microsoft's Brad Smith has noted, "Information technology raises issues that go to the heart of fundamental human-rights protections like privacy and freedom of expression. These issues heighten responsibility for tech companies that create these products. In our view, they also call for thoughtful government regulation and for the development of norms around acceptable uses" ([source](https://www.technologyreview.com/2019/12/18/102365/the-future-of-ais-impact-on-society/)).
---
The future remains uncertain, but understanding these systems, their software, and algorithms is crucial. We hope this curriculum will help you gain the knowledge needed to form your own perspective.
[![The history of deep learning](https://img.youtube.com/vi/mTtDfKgLm54/0.jpg)](https://www.youtube.com/watch?v=mTtDfKgLm54 "The history of deep learning")
> 🎥 Click the image above for a video: Yann LeCun discusses the history of deep learning in this lecture
---
## 🚀Challenge
Dive deeper into one of these historical moments and learn more about the people behind them. The characters are fascinating, and no scientific discovery happens in isolation. What do you uncover?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
---
## Review & Self Study
Here are some resources to watch and listen to:
[This podcast where Amy Boyd discusses the evolution of AI](http://runasradio.com/Shows/Show/739)
[![The history of AI by Amy Boyd](https://img.youtube.com/vi/EJt3_bFYKss/0.jpg)](https://www.youtube.com/watch?v=EJt3_bFYKss "The history of AI by Amy Boyd")
---
## Assignment
[Create a timeline](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "eb6e4d5afd1b21a57d2b9e6d0aac3969",
"translation_date": "2025-09-06T10:54:57+00:00",
"source_file": "1-Introduction/2-history-of-ML/assignment.md",
"language_code": "en"
}
-->
# Create a timeline
## Instructions
Using [this repo](https://github.com/Digital-Humanities-Toolkit/timeline-builder), create a timeline about some aspect of the history of algorithms, mathematics, statistics, AI, or ML, or a combination of these. You can focus on one person, one idea, or a long period of thought. Be sure to include multimedia elements.
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | ------------------------------------------------- | --------------------------------------- | ---------------------------------------------------------------- |
| | A deployed timeline is presented as a GitHub page | The code is incomplete and not deployed | The timeline is incomplete, not well researched, and not deployed |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,170 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "9a6b702d1437c0467e3c5c28d763dac2",
"translation_date": "2025-09-06T10:53:05+00:00",
"source_file": "1-Introduction/3-fairness/README.md",
"language_code": "en"
}
-->
# Building Machine Learning solutions with responsible AI
![Summary of responsible AI in Machine Learning in a sketchnote](../../../../sketchnotes/ml-fairness.png)
> Sketchnote by [Tomomi Imura](https://www.twitter.com/girlie_mac)
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Introduction
In this curriculum, you will begin to explore how machine learning is shaping and influencing our daily lives. Today, systems and models play a role in decision-making processes such as healthcare diagnoses, loan approvals, or fraud detection. It is crucial that these models perform well and deliver trustworthy outcomes. Like any software application, AI systems can fail to meet expectations or produce undesirable results. This is why it is essential to understand and explain the behavior of AI models.
Consider what might happen if the data used to build these models lacks representation of certain demographics, such as race, gender, political views, or religion, or if it disproportionately represents some groups. What if the models output favors one demographic over another? What are the consequences for the application? Furthermore, what happens when the model causes harm? Who is accountable for the behavior of AI systems? These are some of the questions we will explore in this curriculum.
In this lesson, you will:
- Learn about the importance of fairness in machine learning and the harms related to unfairness.
- Understand the practice of exploring outliers and unusual scenarios to ensure reliability and safety.
- Recognize the need to design inclusive systems that empower everyone.
- Explore the importance of protecting privacy and security for both data and individuals.
- Understand the value of a transparent approach to explain AI model behavior.
- Appreciate how accountability is key to building trust in AI systems.
## Prerequisite
Before starting, please complete the "Responsible AI Principles" Learn Path and watch the video below on the topic:
Learn more about Responsible AI by following this [Learning Path](https://docs.microsoft.com/learn/modules/responsible-ai-principles/?WT.mc_id=academic-77952-leestott)
[![Microsoft's Approach to Responsible AI](https://img.youtube.com/vi/dnC8-uUZXSc/0.jpg)](https://youtu.be/dnC8-uUZXSc "Microsoft's Approach to Responsible AI")
> 🎥 Click the image above for a video: Microsoft's Approach to Responsible AI
## Fairness
AI systems should treat everyone fairly and avoid disadvantaging similar groups of people. For example, when AI systems provide recommendations for medical treatment, loan applications, or employment, they should offer the same guidance to individuals with similar symptoms, financial situations, or qualifications. As humans, we all carry biases that influence our decisions and actions. These biases can also appear in the data used to train AI systems, sometimes unintentionally. It can be challenging to recognize when bias is being introduced into data.
**“Unfairness”** refers to negative impacts, or “harms,” experienced by a group of people, such as those defined by race, gender, age, or disability. The main types of fairness-related harms include:
- **Allocation**: Favoring one gender, ethnicity, or group over another.
- **Quality of service**: Training data for a specific scenario while ignoring the complexity of real-world situations, leading to poor performance. For example, a soap dispenser that fails to detect people with dark skin. [Reference](https://gizmodo.com/why-cant-this-soap-dispenser-identify-dark-skin-1797931773)
- **Denigration**: Unfairly criticizing or labeling someone or something. For instance, an image labeling system that mislabeled dark-skinned individuals as gorillas.
- **Over- or under-representation**: When a group is absent or underrepresented in a profession, and systems perpetuate this imbalance.
- **Stereotyping**: Associating a group with predefined attributes. For example, a language translation system between English and Turkish may produce errors due to gender-based stereotypes.
![translation to Turkish](../../../../1-Introduction/3-fairness/images/gender-bias-translate-en-tr.png)
> translation to Turkish
![translation back to English](../../../../1-Introduction/3-fairness/images/gender-bias-translate-tr-en.png)
> translation back to English
When designing and testing AI systems, it is essential to ensure that AI is fair and does not make biased or discriminatory decisions—just as humans are prohibited from doing. Achieving fairness in AI and machine learning is a complex sociotechnical challenge.
### Reliability and safety
To build trust, AI systems must be reliable, safe, and consistent under both normal and unexpected conditions. It is important to understand how AI systems behave in various situations, especially outliers. When developing AI solutions, significant attention must be given to handling a wide range of scenarios the system might encounter. For example, a self-driving car must prioritize peoples safety. The AI powering the car must account for scenarios like nighttime driving, thunderstorms, blizzards, children running into the street, pets, road construction, and more. The reliability and safety of an AI system reflect the level of foresight and preparation by the data scientist or AI developer during design and testing.
> [🎥 Click here for a video: ](https://www.microsoft.com/videoplayer/embed/RE4vvIl)
### Inclusiveness
AI systems should be designed to engage and empower everyone. Data scientists and AI developers must identify and address potential barriers in the system that could unintentionally exclude people. For example, there are 1 billion people with disabilities worldwide. Advances in AI can help them access information and opportunities more easily in their daily lives. Addressing barriers creates opportunities to innovate and develop AI products that provide better experiences for everyone.
> [🎥 Click here for a video: inclusiveness in AI](https://www.microsoft.com/videoplayer/embed/RE4vl9v)
### Security and privacy
AI systems must be safe and respect peoples privacy. People are less likely to trust systems that put their privacy, information, or lives at risk. When training machine learning models, data is essential for producing accurate results. However, the origin and integrity of the data must be considered. For example, was the data user-submitted or publicly available? Additionally, AI systems must protect confidential information and resist attacks. As AI becomes more widespread, safeguarding privacy and securing personal and business information are increasingly critical and complex. Privacy and data security require special attention because AI systems rely on data to make accurate predictions and decisions.
> [🎥 Click here for a video: security in AI](https://www.microsoft.com/videoplayer/embed/RE4voJF)
- The industry has made significant progress in privacy and security, driven by regulations like GDPR (General Data Protection Regulation).
- However, AI systems face a tension between the need for personal data to improve effectiveness and the need to protect privacy.
- Just as the internet brought new security challenges, AI has led to a rise in security issues.
- At the same time, AI is being used to enhance security, such as in modern antivirus scanners powered by AI heuristics.
- Data science processes must align with the latest privacy and security practices.
### Transparency
AI systems should be understandable. Transparency involves explaining the behavior of AI systems and their components. Stakeholders need to understand how and why AI systems function to identify potential performance issues, safety and privacy concerns, biases, exclusionary practices, or unintended outcomes. Those who use AI systems should also be transparent about when, why, and how they deploy them, as well as the systems limitations. For example, if a bank uses AI for lending decisions, it is important to examine the outcomes and understand which data influences the systems recommendations. Governments are beginning to regulate AI across industries, so data scientists and organizations must ensure their systems meet regulatory requirements, especially in cases of undesirable outcomes.
> [🎥 Click here for a video: transparency in AI](https://www.microsoft.com/videoplayer/embed/RE4voJF)
- AI systems are complex, making it difficult to understand how they work and interpret their results.
- This lack of understanding affects how systems are managed, operationalized, and documented.
- More importantly, it impacts the decisions made based on the systems results.
### Accountability
The people who design and deploy AI systems must be accountable for their operation. Accountability is especially critical for sensitive technologies like facial recognition. For example, law enforcement agencies may use facial recognition to find missing children, but the same technology could enable governments to infringe on citizens freedoms through continuous surveillance. Data scientists and organizations must take responsibility for how their AI systems impact individuals and society.
[![Leading AI Researcher Warns of Mass Surveillance Through Facial Recognition](../../../../1-Introduction/3-fairness/images/accountability.png)](https://www.youtube.com/watch?v=Wldt8P5V6D0 "Microsoft's Approach to Responsible AI")
> 🎥 Click the image above for a video: Warnings of Mass Surveillance Through Facial Recognition
One of the most significant questions for our generation, as the first to bring AI to society, is how to ensure that computers remain accountable to people and that the people designing these systems remain accountable to everyone else.
## Impact assessment
Before training a machine learning model, it is important to conduct an impact assessment to understand the purpose of the AI system, its intended use, where it will be deployed, and who will interact with it. This helps reviewers or testers identify potential risks and expected consequences.
Key areas to focus on during an impact assessment include:
- **Adverse impact on individuals**: Be aware of any restrictions, requirements, unsupported uses, or known limitations that could hinder the systems performance and cause harm.
- **Data requirements**: Understand how and where the system will use data to identify any regulatory requirements (e.g., GDPR or HIPAA) and ensure the data source and quantity are sufficient for training.
- **Summary of impact**: Identify potential harms that could arise from using the system and review whether these issues are addressed throughout the ML lifecycle.
- **Applicable goals** for each of the six core principles: Assess whether the goals of each principle are met and identify any gaps.
## Debugging with responsible AI
Debugging an AI system is similar to debugging a software application—it involves identifying and resolving issues. Many factors can cause a model to perform unexpectedly or irresponsibly. Traditional model performance metrics, which are often quantitative aggregates, are insufficient for analyzing how a model violates responsible AI principles. Additionally, machine learning models are often black boxes, making it difficult to understand their outcomes or explain their mistakes. Later in this course, we will learn how to use the Responsible AI dashboard to debug AI systems. This dashboard provides a comprehensive tool for data scientists and AI developers to:
- **Perform error analysis**: Identify error distributions that affect fairness or reliability.
- **Review the model overview**: Discover disparities in the models performance across data cohorts.
- **Analyze data**: Understand data distribution and identify potential biases that could impact fairness, inclusiveness, and reliability.
- **Interpret the model**: Understand what influences the models predictions, which is essential for transparency and accountability.
## 🚀 Challenge
To prevent harms from being introduced in the first place, we should:
- Ensure diversity in the backgrounds and perspectives of the people working on AI systems.
- Invest in datasets that reflect the diversity of society.
- Develop better methods throughout the machine learning lifecycle to detect and address responsible AI issues.
Think about real-life scenarios where a models lack of trustworthiness is evident during development or use. What else should we consider?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
In this lesson, you have learned the basics of fairness and unfairness in machine learning.
Watch this workshop to explore the topics in more detail:
- In pursuit of responsible AI: Applying principles in practice by Besmira Nushi, Mehrnoosh Sameki, and Amit Sharma
[![Responsible AI Toolbox: An open-source framework for building responsible AI](https://img.youtube.com/vi/tGgJCrA-MZU/0.jpg)](https://www.youtube.com/watch?v=tGgJCrA-MZU "RAI Toolbox: An open-source framework for building responsible AI")
> 🎥 Click the image above to watch the video: RAI Toolbox: An open-source framework for building responsible AI by Besmira Nushi, Mehrnoosh Sameki, and Amit Sharma
Additionally, check out:
- Microsofts RAI resource center: [Responsible AI Resources Microsoft AI](https://www.microsoft.com/ai/responsible-ai-resources?activetab=pivot1%3aprimaryr4)
- Microsofts FATE research group: [FATE: Fairness, Accountability, Transparency, and Ethics in AI - Microsoft Research](https://www.microsoft.com/research/theme/fate/)
RAI Toolbox:
- [Responsible AI Toolbox GitHub repository](https://github.com/microsoft/responsible-ai-toolbox)
Learn about Azure Machine Learning's tools for ensuring fairness:
- [Azure Machine Learning](https://docs.microsoft.com/azure/machine-learning/concept-fairness-ml?WT.mc_id=academic-77952-leestott)
## Assignment
[Explore RAI Toolbox](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "dbda60e7b1fe5f18974e7858eff0004e",
"translation_date": "2025-09-06T10:53:35+00:00",
"source_file": "1-Introduction/3-fairness/assignment.md",
"language_code": "en"
}
-->
# Explore the Responsible AI Toolbox
## Instructions
In this lesson, you learned about the Responsible AI Toolbox, an "open-source, community-driven project designed to help data scientists analyze and improve AI systems." For this assignment, explore one of RAI Toolbox's [notebooks](https://github.com/microsoft/responsible-ai-toolbox/blob/main/notebooks/responsibleaidashboard/getting-started.ipynb) and share your findings in a paper or presentation.
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | --------- | -------- | ----------------- |
| | A paper or PowerPoint presentation is provided discussing Fairlearn's systems, the notebook that was executed, and the conclusions derived from running it | A paper is provided without conclusions | No paper is provided |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,132 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "9d91f3af3758fdd4569fb410575995ef",
"translation_date": "2025-09-06T10:53:40+00:00",
"source_file": "1-Introduction/4-techniques-of-ML/README.md",
"language_code": "en"
}
-->
# Techniques of Machine Learning
The process of creating, using, and maintaining machine learning models and the data they rely on is quite different from many other development workflows. In this lesson, we will break down the process and outline the key techniques you need to understand. You will:
- Gain a high-level understanding of the processes behind machine learning.
- Explore foundational concepts such as 'models,' 'predictions,' and 'training data.'
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
[![ML for beginners - Techniques of Machine Learning](https://img.youtube.com/vi/4NGM0U2ZSHU/0.jpg)](https://youtu.be/4NGM0U2ZSHU "ML for beginners - Techniques of Machine Learning")
> 🎥 Click the image above for a short video walkthrough of this lesson.
## Introduction
At a high level, the process of creating machine learning (ML) systems involves several steps:
1. **Define the question**. Most ML processes begin with a question that cannot be answered using a simple conditional program or rules-based system. These questions often focus on making predictions based on a dataset.
2. **Collect and prepare data**. To answer your question, you need data. The quality and sometimes the quantity of your data will determine how well you can address your initial question. Visualizing data is an important part of this phase. This phase also includes splitting the data into training and testing sets to build a model.
3. **Select a training method**. Depending on your question and the nature of your data, you need to choose how to train a model to best represent your data and make accurate predictions. This step often requires specific expertise and a significant amount of experimentation.
4. **Train the model**. Using your training data, you'll apply various algorithms to train a model to recognize patterns in the data. The model may use internal weights that can be adjusted to prioritize certain parts of the data over others to improve its performance.
5. **Evaluate the model**. You use unseen data (your testing data) from your dataset to assess how well the model performs.
6. **Tune parameters**. Based on the model's performance, you can repeat the process using different parameters or variables that control the behavior of the algorithms used to train the model.
7. **Make predictions**. Use new inputs to test the model's accuracy.
## What question to ask
Computers excel at uncovering hidden patterns in data. This capability is particularly useful for researchers who have questions about a specific domain that cannot be easily answered by creating a rules-based system. For example, in an actuarial task, a data scientist might create handcrafted rules to analyze the mortality rates of smokers versus non-smokers.
However, when many other variables are introduced, an ML model might be more effective at predicting future mortality rates based on past health data. A more optimistic example could involve predicting the weather for April in a specific location using data such as latitude, longitude, climate change, proximity to the ocean, jet stream patterns, and more.
✅ This [slide deck](https://www2.cisl.ucar.edu/sites/default/files/2021-10/0900%20June%2024%20Haupt_0.pdf) on weather models provides a historical perspective on using ML for weather analysis.
## Pre-building tasks
Before you start building your model, there are several tasks you need to complete. To test your question and form a hypothesis based on the model's predictions, you need to identify and configure several elements.
### Data
To answer your question with confidence, you need a sufficient amount of data of the right type. At this stage, you need to:
- **Collect data**. Referencing the previous lesson on fairness in data analysis, collect your data carefully. Be mindful of its sources, any inherent biases, and document its origin.
- **Prepare data**. Data preparation involves several steps. You may need to combine and normalize data from different sources. You can enhance the quality and quantity of your data through methods like converting strings to numbers (as seen in [Clustering](../../5-Clustering/1-Visualize/README.md)). You might also generate new data based on the original (as seen in [Classification](../../4-Classification/1-Introduction/README.md)). You can clean and edit the data (as we will do before the [Web App](../../3-Web-App/README.md) lesson). Additionally, you may need to randomize and shuffle the data depending on your training techniques.
✅ After collecting and processing your data, take a moment to evaluate whether its structure will allow you to address your intended question. Sometimes, the data may not perform well for your specific task, as we discover in our [Clustering](../../5-Clustering/1-Visualize/README.md) lessons!
### Features and Target
A [feature](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) is a measurable property of your data. In many datasets, it is represented as a column heading like 'date,' 'size,' or 'color.' Feature variables, often represented as `X` in code, are the input variables used to train the model.
A target is what you are trying to predict. Targets, usually represented as `y` in code, are the answers to the questions you are asking of your data: In December, what **color** pumpkins will be cheapest? In San Francisco, which neighborhoods will have the best real estate **prices**? Sometimes, the target is also referred to as the label attribute.
### Selecting your feature variable
🎓 **Feature Selection and Feature Extraction** How do you decide which variables to use when building a model? You will likely go through a process of feature selection or feature extraction to identify the best variables for the most effective model. These processes differ: "Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features." ([source](https://wikipedia.org/wiki/Feature_selection))
### Visualize your data
Visualization is a powerful tool in a data scientist's toolkit. Libraries like Seaborn or MatPlotLib allow you to represent your data visually, which can help uncover hidden correlations that you can leverage. Visualizations can also reveal bias or imbalances in your data (as seen in [Classification](../../4-Classification/2-Classifiers-1/README.md)).
### Split your dataset
Before training, you need to divide your dataset into two or more parts of unequal size that still represent the data well.
- **Training**. This portion of the dataset is used to train your model. It typically constitutes the majority of the original dataset.
- **Testing**. A test dataset is an independent subset of the original data used to validate the model's performance.
- **Validating**. A validation set is a smaller independent subset used to fine-tune the model's hyperparameters or architecture to improve its performance. Depending on the size of your data and the question you are addressing, you may not need to create this third set (as noted in [Time Series Forecasting](../../7-TimeSeries/1-Introduction/README.md)).
## Building a model
Using your training data, your goal is to build a model—a statistical representation of your data—using various algorithms to **train** it. Training a model exposes it to data, enabling it to identify patterns, validate them, and accept or reject them.
### Decide on a training method
Depending on your question and the nature of your data, you will select a method to train the model. By exploring [Scikit-learn's documentation](https://scikit-learn.org/stable/user_guide.html)—which we use in this course—you can examine various ways to train a model. Depending on your experience, you may need to try multiple methods to build the best model. Data scientists often evaluate a model's performance by testing it with unseen data, checking for accuracy, bias, and other issues, and selecting the most suitable training method for the task.
### Train a model
With your training data, you are ready to 'fit' it to create a model. In many ML libraries, you will encounter the code 'model.fit'—this is where you input your feature variable as an array of values (usually 'X') and a target variable (usually 'y').
### Evaluate the model
Once the training process is complete (it may require many iterations, or 'epochs,' to train a large model), you can evaluate the model's quality using test data to measure its performance. This test data is a subset of the original data that the model has not previously analyzed. You can generate a table of metrics to assess the model's quality.
🎓 **Model fitting**
In machine learning, model fitting refers to how accurately the model's underlying function analyzes data it has not encountered before.
🎓 **Underfitting** and **overfitting** are common issues that reduce a model's quality. An underfit model fails to analyze both its training data and unseen data accurately. An overfit model performs too well on training data because it has learned the data's details and noise excessively. Both scenarios lead to poor predictions.
![overfitting model](../../../../1-Introduction/4-techniques-of-ML/images/overfitting.png)
> Infographic by [Jen Looper](https://twitter.com/jenlooper)
## Parameter tuning
After initial training, evaluate the model's quality and consider improving it by adjusting its 'hyperparameters.' Learn more about this process [in the documentation](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?WT.mc_id=academic-77952-leestott).
## Prediction
This is the stage where you use entirely new data to test your model's accuracy. In an applied ML setting, such as building web applications for production, this process might involve gathering user input (e.g., a button press) to set a variable and send it to the model for inference or evaluation.
In these lessons, you will learn how to prepare, build, test, evaluate, and predict—covering all the steps of a data scientist and more as you progress toward becoming a 'full stack' ML engineer.
---
## 🚀Challenge
Create a flow chart illustrating the steps of an ML practitioner. Where do you see yourself in this process right now? Where do you anticipate challenges? What seems straightforward to you?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Search online for interviews with data scientists discussing their daily work. Here is [one](https://www.youtube.com/watch?v=Z3IjgbbCEfs).
## Assignment
[Interview a data scientist](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "70d65aeddc06170bc1aed5b27805f930",
"translation_date": "2025-09-06T10:54:01+00:00",
"source_file": "1-Introduction/4-techniques-of-ML/assignment.md",
"language_code": "en"
}
-->
# Interview a data scientist
## Instructions
In your company, a user group, or among your friends or classmates, talk to someone who works professionally as a data scientist. Write a short essay (500 words) about their daily tasks. Are they specialists, or do they work as 'full stack' professionals?
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | ------------------------------------------------------------------------------------ | ------------------------------------------------------------------ | --------------------- |
| | An essay of the correct length, with attributed sources, is presented as a .doc file | The essay is poorly attributed or shorter than the required length | No essay is presented |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,37 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "cf8ecc83f28e5b98051d2179eca08e08",
"translation_date": "2025-09-06T10:52:57+00:00",
"source_file": "1-Introduction/README.md",
"language_code": "en"
}
-->
# Introduction to machine learning
In this section of the curriculum, you will be introduced to the fundamental concepts behind the field of machine learning, what it entails, and learn about its history and the techniques researchers use to work with it. Let's dive into this exciting world of ML together!
![globe](../../../1-Introduction/images/globe.jpg)
> Photo by <a href="https://unsplash.com/@bill_oxford?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Bill Oxford</a> on <a href="https://unsplash.com/s/photos/globe?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
### Lessons
1. [Introduction to machine learning](1-intro-to-ML/README.md)
1. [The History of machine learning and AI](2-history-of-ML/README.md)
1. [Fairness and machine learning](3-fairness/README.md)
1. [Techniques of machine learning](4-techniques-of-ML/README.md)
### Credits
"Introduction to Machine Learning" was created with ♥️ by a team including [Muhammad Sakib Khan Inan](https://twitter.com/Sakibinan), [Ornella Altunyan](https://twitter.com/ornelladotcom), and [Jen Looper](https://twitter.com/jenlooper)
"The History of Machine Learning" was created with ♥️ by [Jen Looper](https://twitter.com/jenlooper) and [Amy Boyd](https://twitter.com/AmyKateNicho)
"Fairness and Machine Learning" was created with ♥️ by [Tomomi Imura](https://twitter.com/girliemac)
"Techniques of Machine Learning" was created with ♥️ by [Jen Looper](https://twitter.com/jenlooper) and [Chris Noring](https://twitter.com/softchris)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,239 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "fa81d226c71d5af7a2cade31c1c92b88",
"translation_date": "2025-09-06T10:46:54+00:00",
"source_file": "2-Regression/1-Tools/README.md",
"language_code": "en"
}
-->
# Get started with Python and Scikit-learn for regression models
![Summary of regressions in a sketchnote](../../../../sketchnotes/ml-regression.png)
> Sketchnote by [Tomomi Imura](https://www.twitter.com/girlie_mac)
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
> ### [This lesson is available in R!](../../../../2-Regression/1-Tools/solution/R/lesson_1.html)
## Introduction
In these four lessons, you will learn how to build regression models. We'll discuss their purpose shortly. But before diving in, ensure you have the necessary tools set up to begin!
In this lesson, you will:
- Set up your computer for local machine learning tasks.
- Work with Jupyter notebooks.
- Install and use Scikit-learn.
- Explore linear regression through a hands-on exercise.
## Installations and configurations
[![ML for beginners - Setup your tools ready to build Machine Learning models](https://img.youtube.com/vi/-DfeD2k2Kj0/0.jpg)](https://youtu.be/-DfeD2k2Kj0 "ML for beginners -Setup your tools ready to build Machine Learning models")
> 🎥 Click the image above for a short video on configuring your computer for ML.
1. **Install Python**. Ensure that [Python](https://www.python.org/downloads/) is installed on your computer. Python is essential for many data science and machine learning tasks. Most systems already have Python installed. You can also use [Python Coding Packs](https://code.visualstudio.com/learn/educators/installers?WT.mc_id=academic-77952-leestott) to simplify the setup process.
Some Python tasks require specific versions of the software. To manage this, it's helpful to work within a [virtual environment](https://docs.python.org/3/library/venv.html).
2. **Install Visual Studio Code**. Make sure Visual Studio Code is installed on your computer. Follow these instructions to [install Visual Studio Code](https://code.visualstudio.com/) for the basic setup. Since you'll use Python in Visual Studio Code during this course, you might want to review how to [configure Visual Studio Code](https://docs.microsoft.com/learn/modules/python-install-vscode?WT.mc_id=academic-77952-leestott) for Python development.
> Familiarize yourself with Python by exploring this collection of [Learn modules](https://docs.microsoft.com/users/jenlooper-2911/collections/mp1pagggd5qrq7?WT.mc_id=academic-77952-leestott)
>
> [![Setup Python with Visual Studio Code](https://img.youtube.com/vi/yyQM70vi7V8/0.jpg)](https://youtu.be/yyQM70vi7V8 "Setup Python with Visual Studio Code")
>
> 🎥 Click the image above for a video on using Python within VS Code.
3. **Install Scikit-learn** by following [these instructions](https://scikit-learn.org/stable/install.html). Since Python 3 is required, using a virtual environment is recommended. If you're installing this library on an M1 Mac, refer to the special instructions on the linked page.
4. **Install Jupyter Notebook**. You'll need to [install the Jupyter package](https://pypi.org/project/jupyter/).
## Your ML authoring environment
You'll use **notebooks** to develop Python code and create machine learning models. Notebooks are a popular tool for data scientists, identifiable by their `.ipynb` extension.
Notebooks provide an interactive environment where developers can code, add notes, and write documentation alongside their code—ideal for experimental or research-oriented projects.
[![ML for beginners - Set up Jupyter Notebooks to start building regression models](https://img.youtube.com/vi/7E-jC8FLA2E/0.jpg)](https://youtu.be/7E-jC8FLA2E "ML for beginners - Set up Jupyter Notebooks to start building regression models")
> 🎥 Click the image above for a short video on setting up Jupyter Notebooks.
### Exercise - work with a notebook
In this folder, you'll find the file _notebook.ipynb_.
1. Open _notebook.ipynb_ in Visual Studio Code.
A Jupyter server will start with Python 3+. You'll find areas of the notebook containing `run` sections of code. You can execute a code block by clicking the play button icon.
2. Select the `md` icon and add some markdown with the text **# Welcome to your notebook**.
Next, add some Python code.
3. Type **print('hello notebook')** in the code block.
4. Click the arrow to run the code.
You should see the printed output:
```output
hello notebook
```
![VS Code with a notebook open](../../../../2-Regression/1-Tools/images/notebook.jpg)
You can intersperse your code with comments to document the notebook.
✅ Reflect on how a web developer's working environment differs from that of a data scientist.
## Up and running with Scikit-learn
Now that Python is set up locally and you're comfortable with Jupyter notebooks, let's get familiar with Scikit-learn (pronounced `sci` as in `science`). Scikit-learn offers an [extensive API](https://scikit-learn.org/stable/modules/classes.html#api-ref) for performing ML tasks.
According to its [website](https://scikit-learn.org/stable/getting_started.html), "Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities."
In this course, you'll use Scikit-learn and other tools to build machine learning models for 'traditional machine learning' tasks. Neural networks and deep learning are excluded, as they are covered in our upcoming 'AI for Beginners' curriculum.
Scikit-learn simplifies model building and evaluation. It primarily focuses on numeric data and includes several ready-made datasets for learning. It also provides pre-built models for experimentation. Let's explore how to load prepackaged data and use a built-in estimator to create your first ML model with Scikit-learn.
## Exercise - your first Scikit-learn notebook
> This tutorial was inspired by the [linear regression example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py) on Scikit-learn's website.
[![ML for beginners - Your First Linear Regression Project in Python](https://img.youtube.com/vi/2xkXL5EUpS0/0.jpg)](https://youtu.be/2xkXL5EUpS0 "ML for beginners - Your First Linear Regression Project in Python")
> 🎥 Click the image above for a short video on this exercise.
In the _notebook.ipynb_ file associated with this lesson, clear all cells by clicking the 'trash can' icon.
In this section, you'll work with a small diabetes dataset built into Scikit-learn for learning purposes. Imagine testing a treatment for diabetic patients. Machine learning models can help identify which patients might respond better to the treatment based on variable combinations. Even a basic regression model, when visualized, can reveal insights about variables that could guide clinical trials.
✅ There are various regression methods, and your choice depends on the question you're trying to answer. For example, if you want to predict someone's probable height based on their age, you'd use linear regression to find a **numeric value**. If you're determining whether a cuisine is vegan, you'd use logistic regression for a **category assignment**. Think about questions you could ask of data and which method would be most suitable.
Let's begin.
### Import libraries
For this task, we'll import the following libraries:
- **matplotlib**: A useful [graphing tool](https://matplotlib.org/) for creating visualizations.
- **numpy**: [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html) is a library for handling numeric data in Python.
- **sklearn**: The [Scikit-learn](https://scikit-learn.org/stable/user_guide.html) library.
Import the libraries you'll need for this task.
1. Add imports by typing the following code:
```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, model_selection
```
Here, you're importing `matplotlib`, `numpy`, and `datasets`, `linear_model`, and `model_selection` from `sklearn`. `model_selection` is used for splitting data into training and test sets.
### The diabetes dataset
The built-in [diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) contains 442 samples of diabetes-related data with 10 feature variables, including:
- age: age in years
- bmi: body mass index
- bp: average blood pressure
- s1 tc: T-Cells (a type of white blood cells)
✅ This dataset includes 'sex' as a feature variable, which is significant in diabetes research. Many medical datasets use binary classifications like this. Consider how such categorizations might exclude certain populations from treatments.
Now, load the X and y data.
> 🎓 Remember, this is supervised learning, so we need a labeled 'y' target.
In a new code cell, load the diabetes dataset by calling `load_diabetes()`. The input `return_X_y=True` ensures `X` is a data matrix and `y` is the regression target.
1. Add print commands to display the shape of the data matrix and its first element:
```python
X, y = datasets.load_diabetes(return_X_y=True)
print(X.shape)
print(X[0])
```
The response is a tuple. You're assigning the first two values of the tuple to `X` and `y`. Learn more [about tuples](https://wikipedia.org/wiki/Tuple).
You'll see the data has 442 items shaped into arrays of 10 elements:
```text
(442, 10)
[ 0.03807591 0.05068012 0.06169621 0.02187235 -0.0442235 -0.03482076
-0.04340085 -0.00259226 0.01990842 -0.01764613]
```
✅ Reflect on the relationship between the data and the regression target. Linear regression predicts relationships between feature X and target variable y. Can you find the [target](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) for the diabetes dataset in the documentation? What does this dataset demonstrate, given the target?
2. Select a portion of the dataset to plot by choosing the 3rd column. Use the `:` operator to select all rows, and then select the 3rd column using the index (2). Reshape the data into a 2D array for plotting using `reshape(n_rows, n_columns)`. If one parameter is -1, the corresponding dimension is calculated automatically.
```python
X = X[:, 2]
X = X.reshape((-1,1))
```
✅ Print the data at any time to check its shape.
3. With the data ready for plotting, use a machine to determine a logical split between the numbers in the dataset. Split both the data (X) and the target (y) into test and training sets. Scikit-learn provides a simple way to do this; you can split your test data at a specific point.
```python
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33)
```
4. Train your model! Load the linear regression model and train it with your X and y training sets using `model.fit()`:
```python
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
```
`model.fit()` is a function you'll encounter in many ML libraries like TensorFlow.
5. Create a prediction using test data with the `predict()` function. This will help draw the line between data groups.
```python
y_pred = model.predict(X_test)
```
6. Finally, visualize the data using Matplotlib. Create a scatterplot of all the X and y test data, and use the prediction to draw a line that best separates the data groups.
```python
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xlabel('Scaled BMIs')
plt.ylabel('Disease Progression')
plt.title('A Graph Plot Showing Diabetes Progression Against BMI')
plt.show()
```
![a scatterplot showing datapoints around diabetes](../../../../2-Regression/1-Tools/images/scatterplot.png)
✅ Think about what's happening here. A straight line is passing through many small data points, but what is its purpose? Can you see how this line can help predict where a new, unseen data point might align with the plot's y-axis? Try to describe the practical application of this model in your own words.
Congratulations, you've built your first linear regression model, made a prediction with it, and visualized it in a plot!
---
## 🚀Challenge
Plot a different variable from this dataset. Hint: modify this line: `X = X[:,2]`. Considering the target of this dataset, what insights can you gain about the progression of diabetes as a disease?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
In this tutorial, you worked with simple linear regression, rather than univariate or multiple linear regression. Take some time to read about the differences between these methods, or watch [this video](https://www.coursera.org/lecture/quantifying-relationships-regression-models/linear-vs-nonlinear-categorical-variables-ai2Ef).
Learn more about the concept of regression and reflect on the types of questions this technique can help answer. Consider taking [this tutorial](https://docs.microsoft.com/learn/modules/train-evaluate-regression-models?WT.mc_id=academic-77952-leestott) to deepen your understanding.
## Assignment
[A different dataset](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,27 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "74a5cf83e4ebc302afbcbc4f418afd0a",
"translation_date": "2025-09-06T10:47:21+00:00",
"source_file": "2-Regression/1-Tools/assignment.md",
"language_code": "en"
}
-->
# Regression with Scikit-learn
## Instructions
Explore the [Linnerud dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud) available in Scikit-learn. This dataset includes multiple [targets](https://scikit-learn.org/stable/datasets/toy_dataset.html#linnerrud-dataset): 'It contains three exercise (data) and three physiological (target) variables collected from twenty middle-aged men at a fitness club.'
In your own words, explain how to build a Regression model that visualizes the relationship between waist circumference and the number of sit-ups performed. Repeat this process for the other data points in the dataset.
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| ------------------------------ | ---------------------------------- | ----------------------------- | -------------------------- |
| Submit a descriptive paragraph | A well-crafted paragraph is provided | A few sentences are provided | No description is provided |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:47:26+00:00",
"source_file": "2-Regression/1-Tools/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,226 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "7c077988328ebfe33b24d07945f16eca",
"translation_date": "2025-09-06T10:47:29+00:00",
"source_file": "2-Regression/2-Data/README.md",
"language_code": "en"
}
-->
# Build a regression model using Scikit-learn: prepare and visualize data
![Data visualization infographic](../../../../2-Regression/2-Data/images/data-visualization.png)
Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
> ### [This lesson is available in R!](../../../../2-Regression/2-Data/solution/R/lesson_2.html)
## Introduction
Now that you have the tools needed to start building machine learning models with Scikit-learn, you're ready to begin asking questions of your data. When working with data and applying ML solutions, it's crucial to know how to ask the right questions to unlock the full potential of your dataset.
In this lesson, you will learn:
- How to prepare your data for model building.
- How to use Matplotlib for data visualization.
## Asking the right question of your data
The type of question you want answered will determine the ML algorithms you use. The quality of the answer you get will largely depend on the nature of your data.
Take a look at the [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) provided for this lesson. You can open this .csv file in VS Code. A quick glance reveals blanks, a mix of strings and numeric data, and a peculiar column called 'Package' with values like 'sacks', 'bins', and others. The data is, frankly, a bit messy.
[![ML for beginners - How to Analyze and Clean a Dataset](https://img.youtube.com/vi/5qGjczWTrDQ/0.jpg)](https://youtu.be/5qGjczWTrDQ "ML for beginners - How to Analyze and Clean a Dataset")
> 🎥 Click the image above for a short video on preparing the data for this lesson.
It's rare to receive a dataset that's completely ready for building a machine learning model. In this lesson, you'll learn how to prepare a raw dataset using standard Python libraries. You'll also explore techniques for visualizing the data.
## Case study: 'the pumpkin market'
In this folder, you'll find a .csv file in the root `data` folder called [US-pumpkins.csv](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv), which contains 1757 rows of data about the pumpkin market, grouped by city. This raw data was extracted from the [Specialty Crops Terminal Markets Standard Reports](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) provided by the United States Department of Agriculture.
### Preparing data
This data is in the public domain and can be downloaded in separate files for each city from the USDA website. To simplify things, we've combined all the city data into one spreadsheet, so the data has already been partially prepared. Next, let's examine the data more closely.
### The pumpkin data - early conclusions
What do you notice about this data? As mentioned earlier, there's a mix of strings, numbers, blanks, and odd values that need to be understood.
What question can you ask of this data using a regression technique? For example: "Predict the price of a pumpkin for sale during a given month." Looking at the data again, you'll need to make some adjustments to structure it properly for this task.
## Exercise - analyze the pumpkin data
Let's use [Pandas](https://pandas.pydata.org/) (short for `Python Data Analysis`), a powerful tool for shaping data, to analyze and prepare the pumpkin data.
### First, check for missing dates
Start by checking for missing dates:
1. Convert the dates to a month format (these are US dates, so the format is `MM/DD/YYYY`).
2. Extract the month into a new column.
Open the _notebook.ipynb_ file in Visual Studio Code and import the spreadsheet into a new Pandas dataframe.
1. Use the `head()` function to view the first five rows.
```python
import pandas as pd
pumpkins = pd.read_csv('../data/US-pumpkins.csv')
pumpkins.head()
```
✅ What function would you use to view the last five rows?
1. Check for missing data in the current dataframe:
```python
pumpkins.isnull().sum()
```
There is missing data, but it might not affect the task at hand.
1. To simplify your dataframe, select only the columns you need using the `loc` function. This function extracts rows (first parameter) and columns (second parameter) from the original dataframe. The `:` expression below means "all rows."
```python
columns_to_select = ['Package', 'Low Price', 'High Price', 'Date']
pumpkins = pumpkins.loc[:, columns_to_select]
```
### Second, determine average price of pumpkin
Think about how to calculate the average price of a pumpkin in a given month. Which columns would you use for this task? Hint: you'll need three columns.
Solution: Take the average of the `Low Price` and `High Price` columns to populate a new `Price` column, and convert the `Date` column to show only the month. Fortunately, based on the earlier check, there is no missing data for dates or prices.
1. To calculate the average, add the following code:
```python
price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2
month = pd.DatetimeIndex(pumpkins['Date']).month
```
✅ Feel free to print any data you'd like to check using `print(month)`.
2. Copy your converted data into a new Pandas dataframe:
```python
new_pumpkins = pd.DataFrame({'Month': month, 'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'],'High Price': pumpkins['High Price'], 'Price': price})
```
Printing your dataframe will show a clean, organized dataset ready for building your regression model.
### But wait! There's something odd here
Looking at the `Package` column, pumpkins are sold in various configurations. Some are sold in '1 1/9 bushel' measures, others in '1/2 bushel' measures, some per pumpkin, some per pound, and some in large boxes of varying sizes.
> Pumpkins seem very hard to weigh consistently
Examining the original data, anything with `Unit of Sale` equal to 'EACH' or 'PER BIN' also has the `Package` type listed as per inch, per bin, or 'each'. Pumpkins are difficult to weigh consistently, so let's filter the data to include only pumpkins with the string 'bushel' in their `Package` column.
1. Add a filter at the top of the file, under the initial .csv import:
```python
pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)]
```
If you print the data now, you'll see only the 415 rows containing pumpkins sold by the bushel.
### But wait! There's one more thing to do
Did you notice that the bushel amount varies per row? You'll need to normalize the pricing to show the price per bushel. Perform some calculations to standardize it.
1. Add these lines after the block creating the new_pumpkins dataframe:
```python
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1 1/9'), 'Price'] = price/(1 + 1/9)
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1/2'), 'Price'] = price/(1/2)
```
✅ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), a bushel's weight depends on the type of produce, as it's a volume measurement. "A bushel of tomatoes, for example, is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds." It's all quite complex! For simplicity, let's price by the bushel without converting to pounds. This study of bushels highlights the importance of understanding your data!
Now, you can analyze pricing per unit based on bushel measurements. If you print the data again, you'll see it's standardized.
✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: Smaller pumpkins are pricier than larger ones, likely because more of them fit into a bushel, leaving less unused space compared to one large hollow pumpkin.
## Visualization Strategies
A data scientist's role often involves demonstrating the quality and characteristics of the data they're working with. This is done by creating visualizations—plots, graphs, and charts—that reveal relationships and gaps that might otherwise be hard to identify.
[![ML for beginners - How to Visualize Data with Matplotlib](https://img.youtube.com/vi/SbUkxH6IJo0/0.jpg)](https://youtu.be/SbUkxH6IJo0 "ML for beginners - How to Visualize Data with Matplotlib")
> 🎥 Click the image above for a short video on visualizing the data for this lesson.
Visualizations can also help determine the most suitable machine learning technique for the data. For example, a scatterplot that follows a line suggests the data is a good candidate for linear regression.
One data visualization library that works well in Jupyter notebooks is [Matplotlib](https://matplotlib.org/) (introduced in the previous lesson).
> Gain more experience with data visualization in [these tutorials](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-77952-leestott).
## Exercise - experiment with Matplotlib
Try creating some basic plots to display the new dataframe you just created. What insights can a simple line plot provide?
1. Import Matplotlib at the top of the file, under the Pandas import:
```python
import matplotlib.pyplot as plt
```
1. Rerun the entire notebook to refresh.
1. At the bottom of the notebook, add a cell to plot the data as a box:
```python
price = new_pumpkins.Price
month = new_pumpkins.Month
plt.scatter(price, month)
plt.show()
```
![A scatterplot showing price to month relationship](../../../../2-Regression/2-Data/images/scatterplot.png)
Is this plot useful? Does anything about it surprise you?
It's not particularly useful, as it simply displays the data as a spread of points for each month.
### Make it useful
To create more meaningful charts, you often need to group the data. Let's try creating a plot where the y-axis shows the months and the data demonstrates the distribution.
1. Add a cell to create a grouped bar chart:
```python
new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
plt.ylabel("Pumpkin Price")
```
![A bar chart showing price to month relationship](../../../../2-Regression/2-Data/images/barchart.png)
This visualization is more useful! It suggests that pumpkin prices peak in September and October. Does this match your expectations? Why or why not?
---
## 🚀Challenge
Explore the different types of visualizations offered by Matplotlib. Which types are most suitable for regression problems?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Investigate the various ways to visualize data. Make a list of available libraries and note which are best for specific tasks, such as 2D vs. 3D visualizations. What do you discover?
## Assignment
[Exploring visualization](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,23 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "4485a1ed4dd1b5647365e3d87456515d",
"translation_date": "2025-09-06T10:47:54+00:00",
"source_file": "2-Regression/2-Data/assignment.md",
"language_code": "en"
}
-->
# Exploring Visualizations
There are several different libraries available for data visualization. Use the Pumpkin data from this lesson to create some visualizations with matplotlib and seaborn in a sample notebook. Which libraries are easier to use?
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | --------- | -------- | ----------------- |
| | A notebook is submitted with two explorations/visualizations | A notebook is submitted with one exploration/visualization | A notebook is not submitted |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:47:59+00:00",
"source_file": "2-Regression/2-Data/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,381 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "40e64f004f3cb50aa1d8661672d3cd92",
"translation_date": "2025-09-06T10:44:32+00:00",
"source_file": "2-Regression/3-Linear/README.md",
"language_code": "en"
}
-->
# Build a regression model using Scikit-learn: regression four ways
![Linear vs polynomial regression infographic](../../../../2-Regression/3-Linear/images/linear-polynomial.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
> ### [This lesson is available in R!](../../../../2-Regression/3-Linear/solution/R/lesson_3.html)
### Introduction
Up to this point, you've explored what regression is using sample data from the pumpkin pricing dataset, which will be used throughout this lesson. You've also visualized the data using Matplotlib.
Now, you're ready to delve deeper into regression for machine learning. While visualization helps you understand the data, the true power of machine learning lies in _training models_. Models are trained on historical data to automatically capture data dependencies, enabling predictions for new data the model hasn't seen before.
In this lesson, you'll learn more about two types of regression: _basic linear regression_ and _polynomial regression_, along with some of the mathematics behind these techniques. These models will help us predict pumpkin prices based on various input data.
[![ML for beginners - Understanding Linear Regression](https://img.youtube.com/vi/CRxFT8oTDMg/0.jpg)](https://youtu.be/CRxFT8oTDMg "ML for beginners - Understanding Linear Regression")
> 🎥 Click the image above for a short video overview of linear regression.
> Throughout this curriculum, we assume minimal math knowledge and aim to make it accessible for students from other fields. Look out for notes, 🧮 callouts, diagrams, and other tools to aid comprehension.
### Prerequisite
By now, you should be familiar with the structure of the pumpkin dataset we're analyzing. This lesson's _notebook.ipynb_ file contains the preloaded and pre-cleaned data. In the file, pumpkin prices are displayed per bushel in a new data frame. Ensure you can run these notebooks in Visual Studio Code kernels.
### Preparation
As a reminder, you're loading this data to answer specific questions:
- When is the best time to buy pumpkins?
- What price can I expect for a case of miniature pumpkins?
- Should I buy them in half-bushel baskets or 1 1/9 bushel boxes?
Let's continue exploring this data.
In the previous lesson, you created a Pandas data frame and populated it with part of the original dataset, standardizing the pricing by the bushel. However, this only provided about 400 data points, mostly for the fall months.
Take a look at the data preloaded in this lesson's accompanying notebook. The data is preloaded, and an initial scatterplot is charted to show month data. Perhaps we can uncover more details by cleaning the data further.
## A linear regression line
As you learned in Lesson 1, the goal of linear regression is to plot a line that:
- **Shows variable relationships**. Illustrates the relationship between variables.
- **Makes predictions**. Accurately predicts where a new data point would fall relative to the line.
A **Least-Squares Regression** is typically used to draw this type of line. The term 'least-squares' refers to squaring and summing the distances of all data points from the regression line. Ideally, this sum is as small as possible, minimizing errors or `least-squares`.
This approach models a line with the least cumulative distance from all data points. Squaring the terms ensures we're focused on magnitude rather than direction.
> **🧮 Show me the math**
>
> This line, called the _line of best fit_, can be expressed by [an equation](https://en.wikipedia.org/wiki/Simple_linear_regression):
>
> ```
> Y = a + bX
> ```
>
> `X` is the 'explanatory variable', and `Y` is the 'dependent variable'. The slope of the line is `b`, and `a` is the y-intercept, which represents the value of `Y` when `X = 0`.
>
>![calculate the slope](../../../../2-Regression/3-Linear/images/slope.png)
>
> First, calculate the slope `b`. Infographic by [Jen Looper](https://twitter.com/jenlooper)
>
> For example, in our pumpkin dataset's original question: "predict the price of a pumpkin per bushel by month," `X` would represent the price, and `Y` would represent the month of sale.
>
>![complete the equation](../../../../2-Regression/3-Linear/images/calculation.png)
>
> Calculate the value of Y. If you're paying around $4, it must be April! Infographic by [Jen Looper](https://twitter.com/jenlooper)
>
> The math behind the line calculation demonstrates the slope, which depends on the intercept, or where `Y` is located when `X = 0`.
>
> You can explore the calculation method for these values on the [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) website. Also, check out [this Least-squares calculator](https://www.mathsisfun.com/data/least-squares-calculator.html) to see how the numbers affect the line.
## Correlation
Another important term to understand is the **Correlation Coefficient** between given X and Y variables. Using a scatterplot, you can quickly visualize this coefficient. A plot with data points forming a neat line has high correlation, while a plot with scattered data points has low correlation.
A good linear regression model will have a high (closer to 1 than 0) Correlation Coefficient using the Least-Squares Regression method with a regression line.
✅ Run the notebook accompanying this lesson and examine the Month-to-Price scatterplot. Does the data associating Month to Price for pumpkin sales appear to have high or low correlation based on your visual interpretation of the scatterplot? Does this change if you use a more detailed measure, such as *day of the year* (i.e., the number of days since the start of the year)?
In the code below, we assume the data has been cleaned and a data frame called `new_pumpkins` has been obtained, similar to the following:
ID | Month | DayOfYear | Variety | City | Package | Low Price | High Price | Price
---|-------|-----------|---------|------|---------|-----------|------------|-------
70 | 9 | 267 | PIE TYPE | BALTIMORE | 1 1/9 bushel cartons | 15.0 | 15.0 | 13.636364
71 | 9 | 267 | PIE TYPE | BALTIMORE | 1 1/9 bushel cartons | 18.0 | 18.0 | 16.363636
72 | 10 | 274 | PIE TYPE | BALTIMORE | 1 1/9 bushel cartons | 18.0 | 18.0 | 16.363636
73 | 10 | 274 | PIE TYPE | BALTIMORE | 1 1/9 bushel cartons | 17.0 | 17.0 | 15.454545
74 | 10 | 281 | PIE TYPE | BALTIMORE | 1 1/9 bushel cartons | 15.0 | 15.0 | 13.636364
> The code to clean the data is available in [`notebook.ipynb`](../../../../2-Regression/3-Linear/notebook.ipynb). We performed the same cleaning steps as in the previous lesson and calculated the `DayOfYear` column using the following expression:
```python
day_of_year = pd.to_datetime(pumpkins['Date']).apply(lambda dt: (dt-datetime(dt.year,1,1)).days)
```
Now that you understand the math behind linear regression, let's create a regression model to predict which pumpkin package offers the best prices. This information could be useful for someone buying pumpkins for a holiday pumpkin patch to optimize their purchases.
## Looking for Correlation
[![ML for beginners - Looking for Correlation: The Key to Linear Regression](https://img.youtube.com/vi/uoRq-lW2eQo/0.jpg)](https://youtu.be/uoRq-lW2eQo "ML for beginners - Looking for Correlation: The Key to Linear Regression")
> 🎥 Click the image above for a short video overview of correlation.
From the previous lesson, you've likely observed that the average price for different months looks like this:
<img alt="Average price by month" src="../2-Data/images/barchart.png" width="50%"/>
This suggests there might be some correlation, and we can attempt to train a linear regression model to predict the relationship between `Month` and `Price`, or between `DayOfYear` and `Price`. Here's the scatterplot showing the latter relationship:
<img alt="Scatter plot of Price vs. Day of Year" src="images/scatter-dayofyear.png" width="50%" />
Let's check for correlation using the `corr` function:
```python
print(new_pumpkins['Month'].corr(new_pumpkins['Price']))
print(new_pumpkins['DayOfYear'].corr(new_pumpkins['Price']))
```
The correlation appears to be quite small: -0.15 for `Month` and -0.17 for `DayOfYear`. However, there might be another significant relationship. It seems there are distinct price clusters corresponding to different pumpkin varieties. To confirm this hypothesis, let's plot each pumpkin category using a different color. By passing an `ax` parameter to the `scatter` plotting function, we can plot all points on the same graph:
```python
ax=None
colors = ['red','blue','green','yellow']
for i,var in enumerate(new_pumpkins['Variety'].unique()):
df = new_pumpkins[new_pumpkins['Variety']==var]
ax = df.plot.scatter('DayOfYear','Price',ax=ax,c=colors[i],label=var)
```
<img alt="Scatter plot of Price vs. Day of Year" src="images/scatter-dayofyear-color.png" width="50%" />
Our investigation suggests that variety has a greater impact on price than the actual selling date. This can be visualized with a bar graph:
```python
new_pumpkins.groupby('Variety')['Price'].mean().plot(kind='bar')
```
<img alt="Bar graph of price vs variety" src="images/price-by-variety.png" width="50%" />
Let's focus on one pumpkin variety, the 'pie type,' and examine the effect of the date on price:
```python
pie_pumpkins = new_pumpkins[new_pumpkins['Variety']=='PIE TYPE']
pie_pumpkins.plot.scatter('DayOfYear','Price')
```
<img alt="Scatter plot of Price vs. Day of Year" src="images/pie-pumpkins-scatter.png" width="50%" />
If we calculate the correlation between `Price` and `DayOfYear` using the `corr` function, we get approximately `-0.27`, indicating that training a predictive model is worthwhile.
> Before training a linear regression model, it's crucial to ensure the data is clean. Linear regression doesn't work well with missing values, so it's a good idea to remove any empty cells:
```python
pie_pumpkins.dropna(inplace=True)
pie_pumpkins.info()
```
Another approach would be to fill empty values with the mean values from the corresponding column.
## Simple Linear Regression
[![ML for beginners - Linear and Polynomial Regression using Scikit-learn](https://img.youtube.com/vi/e4c_UP2fSjg/0.jpg)](https://youtu.be/e4c_UP2fSjg "ML for beginners - Linear and Polynomial Regression using Scikit-learn")
> 🎥 Click the image above for a short video overview of linear and polynomial regression.
To train our Linear Regression model, we'll use the **Scikit-learn** library.
```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
```
We start by separating input values (features) and the expected output (label) into separate numpy arrays:
```python
X = pie_pumpkins['DayOfYear'].to_numpy().reshape(-1,1)
y = pie_pumpkins['Price']
```
> Note that we had to perform `reshape` on the input data for the Linear Regression package to interpret it correctly. Linear Regression expects a 2D-array as input, where each row corresponds to a vector of input features. Since we have only one input, we need an array with shape N×1, where N is the dataset size.
Next, we split the data into training and testing datasets to validate our model after training:
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
Finally, training the Linear Regression model requires just two lines of code. We define the `LinearRegression` object and fit it to our data using the `fit` method:
```python
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train)
```
The `LinearRegression` object after fitting contains all the regression coefficients, accessible via the `.coef_` property. In our case, there's just one coefficient, which should be around `-0.017`. This indicates that prices tend to drop slightly over time, by about 2 cents per day. The intersection point of the regression with the Y-axis can be accessed using `lin_reg.intercept_`, which will be around `21` in our case, representing the price at the start of the year.
To evaluate the model's accuracy, we can predict prices on the test dataset and measure how close the predictions are to the expected values. This can be done using the mean square error (MSE) metric, which calculates the mean of all squared differences between expected and predicted values.
```python
pred = lin_reg.predict(X_test)
mse = np.sqrt(mean_squared_error(y_test,pred))
print(f'Mean error: {mse:3.3} ({mse/np.mean(pred)*100:3.3}%)')
```
Our error seems to be around 2 points, which is ~17%. Not great. Another way to evaluate model quality is the **coefficient of determination**, which can be calculated like this:
```python
score = lin_reg.score(X_train,y_train)
print('Model determination: ', score)
```
If the value is 0, it means the model doesn't consider the input data and acts as the *worst linear predictor*, which is simply the mean value of the result. A value of 1 means we can perfectly predict all expected outputs. In our case, the coefficient is around 0.06, which is quite low.
We can also plot the test data along with the regression line to better understand how regression works in our case:
```python
plt.scatter(X_test,y_test)
plt.plot(X_test,pred)
```
<img alt="Linear regression" src="images/linear-results.png" width="50%" />
## Polynomial Regression
Another type of Linear Regression is Polynomial Regression. While sometimes there is a linear relationship between variables—like the larger the pumpkin's volume, the higher the price—other times these relationships can't be represented as a plane or straight line.
✅ Here are [some more examples](https://online.stat.psu.edu/stat501/lesson/9/9.8) of data that could benefit from Polynomial Regression.
Take another look at the relationship between Date and Price. Does this scatterplot seem like it should necessarily be analyzed with a straight line? Can't prices fluctuate? In this case, you can try polynomial regression.
✅ Polynomials are mathematical expressions that may include one or more variables and coefficients.
Polynomial regression creates a curved line to better fit nonlinear data. In our case, if we include a squared `DayOfYear` variable in the input data, we should be able to fit our data with a parabolic curve, which will have a minimum at a certain point in the year.
Scikit-learn provides a useful [pipeline API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=pipeline#sklearn.pipeline.make_pipeline) to combine different steps of data processing. A **pipeline** is a chain of **estimators**. In our case, we will create a pipeline that first adds polynomial features to our model and then trains the regression:
```python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(PolynomialFeatures(2), LinearRegression())
pipeline.fit(X_train,y_train)
```
Using `PolynomialFeatures(2)` means we will include all second-degree polynomials from the input data. In our case, this will just mean `DayOfYear`<sup>2</sup>, but with two input variables X and Y, this would add X<sup>2</sup>, XY, and Y<sup>2</sup>. We can also use higher-degree polynomials if needed.
Pipelines can be used in the same way as the original `LinearRegression` object, meaning we can `fit` the pipeline and then use `predict` to get prediction results. Below is the graph showing test data and the approximation curve:
<img alt="Polynomial regression" src="images/poly-results.png" width="50%" />
Using Polynomial Regression, we can achieve slightly lower MSE and higher determination, but not significantly. We need to consider other features!
> Notice that the lowest pumpkin prices occur around Halloween. Why do you think that is?
🎃 Congratulations, you've just created a model that can help predict the price of pie pumpkins. You could repeat the same process for all pumpkin types, but that would be tedious. Let's now learn how to include pumpkin variety in our model!
## Categorical Features
Ideally, we want to predict prices for different pumpkin varieties using the same model. However, the `Variety` column is different from columns like `Month` because it contains non-numeric values. Such columns are called **categorical**.
[![ML for beginners - Categorical Feature Predictions with Linear Regression](https://img.youtube.com/vi/DYGliioIAE0/0.jpg)](https://youtu.be/DYGliioIAE0 "ML for beginners - Categorical Feature Predictions with Linear Regression")
> 🎥 Click the image above for a short video overview of using categorical features.
Heres how the average price depends on variety:
<img alt="Average price by variety" src="images/price-by-variety.png" width="50%" />
To include variety in our model, we first need to convert it to numeric form, or **encode** it. There are several ways to do this:
* Simple **numeric encoding** creates a table of different varieties and replaces the variety name with an index from that table. This isn't ideal for linear regression because the model treats the numeric index as a value and multiplies it by a coefficient. In our case, the relationship between the index number and the price is clearly non-linear, even if we carefully order the indices.
* **One-hot encoding** replaces the `Variety` column with multiple columns—one for each variety. Each column contains `1` if the corresponding row belongs to that variety, and `0` otherwise. This means linear regression will have one coefficient for each variety, representing the "starting price" (or "additional price") for that variety.
The code below demonstrates how to one-hot encode a variety:
```python
pd.get_dummies(new_pumpkins['Variety'])
```
ID | FAIRYTALE | MINIATURE | MIXED HEIRLOOM VARIETIES | PIE TYPE
----|-----------|-----------|--------------------------|----------
70 | 0 | 0 | 0 | 1
71 | 0 | 0 | 0 | 1
... | ... | ... | ... | ...
1738 | 0 | 1 | 0 | 0
1739 | 0 | 1 | 0 | 0
1740 | 0 | 1 | 0 | 0
1741 | 0 | 1 | 0 | 0
1742 | 0 | 1 | 0 | 0
To train linear regression using one-hot encoded variety as input, we just need to initialize `X` and `y` data correctly:
```python
X = pd.get_dummies(new_pumpkins['Variety'])
y = new_pumpkins['Price']
```
The rest of the code is the same as what we used earlier to train Linear Regression. If you try it, you'll see that the mean squared error is about the same, but the coefficient of determination improves significantly (~77%). To make even more accurate predictions, we can include additional categorical features and numeric features like `Month` or `DayOfYear`. To combine all features into one large array, we can use `join`:
```python
X = pd.get_dummies(new_pumpkins['Variety']) \
.join(new_pumpkins['Month']) \
.join(pd.get_dummies(new_pumpkins['City'])) \
.join(pd.get_dummies(new_pumpkins['Package']))
y = new_pumpkins['Price']
```
Here, we also include `City` and `Package` type, which results in an MSE of 2.84 (10%) and a determination coefficient of 0.94!
## Putting it all together
To create the best model, we can combine the one-hot encoded categorical data and numeric data from the previous example with Polynomial Regression. Here's the complete code for your reference:
```python
# set up training data
X = pd.get_dummies(new_pumpkins['Variety']) \
.join(new_pumpkins['Month']) \
.join(pd.get_dummies(new_pumpkins['City'])) \
.join(pd.get_dummies(new_pumpkins['Package']))
y = new_pumpkins['Price']
# make train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# setup and train the pipeline
pipeline = make_pipeline(PolynomialFeatures(2), LinearRegression())
pipeline.fit(X_train,y_train)
# predict results for test data
pred = pipeline.predict(X_test)
# calculate MSE and determination
mse = np.sqrt(mean_squared_error(y_test,pred))
print(f'Mean error: {mse:3.3} ({mse/np.mean(pred)*100:3.3}%)')
score = pipeline.score(X_train,y_train)
print('Model determination: ', score)
```
This should give us the best determination coefficient of nearly 97% and an MSE of 2.23 (~8% prediction error).
| Model | MSE | Determination |
|-------|-----|---------------|
| `DayOfYear` Linear | 2.77 (17.2%) | 0.07 |
| `DayOfYear` Polynomial | 2.73 (17.0%) | 0.08 |
| `Variety` Linear | 5.24 (19.7%) | 0.77 |
| All features Linear | 2.84 (10.5%) | 0.94 |
| All features Polynomial | 2.23 (8.25%) | 0.97 |
🏆 Well done! You've created four Regression models in one lesson and improved the model quality to 97%. In the final section on Regression, you'll learn about Logistic Regression to classify categories.
---
## 🚀Challenge
Experiment with different variables in this notebook to see how correlation affects model accuracy.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
In this lesson, we explored Linear Regression. There are other important types of Regression. Read about Stepwise, Ridge, Lasso, and ElasticNet techniques. A great resource to learn more is the [Stanford Statistical Learning course](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning).
## Assignment
[Build a Model](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "cc471fa89c293bc735dd3a9a0fb79b1b",
"translation_date": "2025-09-06T10:45:18+00:00",
"source_file": "2-Regression/3-Linear/assignment.md",
"language_code": "en"
}
-->
# Create a Regression Model
## Instructions
In this lesson, you learned how to create a model using both Linear and Polynomial Regression. Using this knowledge, find a dataset or use one of Scikit-learn's built-in datasets to develop a new model. In your notebook, explain why you chose the specific technique and demonstrate the accuracy of your model. If the model is not accurate, provide an explanation for the result.
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | ------------------------------------------------------------ | -------------------------- | ------------------------------- |
| | provides a complete notebook with a well-documented solution | the solution is incomplete | the solution contains errors or issues |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:45:23+00:00",
"source_file": "2-Regression/3-Linear/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,418 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "abf86d845c84330bce205a46b382ec88",
"translation_date": "2025-09-06T10:46:08+00:00",
"source_file": "2-Regression/4-Logistic/README.md",
"language_code": "en"
}
-->
# Logistic regression to predict categories
![Logistic vs. linear regression infographic](../../../../2-Regression/4-Logistic/images/linear-vs-logistic.png)
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
> ### [This lesson is available in R!](../../../../2-Regression/4-Logistic/solution/R/lesson_4.html)
## Introduction
In this final lesson on Regression, one of the foundational _classic_ ML techniques, we will explore Logistic Regression. This technique is used to identify patterns for predicting binary categories. For example: Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?
In this lesson, you will learn:
- A new library for data visualization
- Techniques for logistic regression
✅ Deepen your understanding of this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-77952-leestott)
## Prerequisite
Having worked with the pumpkin data, we now know that there is one binary category we can focus on: `Color`.
Let's build a logistic regression model to predict, based on certain variables, _what color a given pumpkin is likely to be_ (orange 🎃 or white 👻).
> Why are we discussing binary classification in a lesson about regression? For simplicity, as logistic regression is [technically a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), though it is linear-based. Youll learn about other classification methods in the next lesson group.
## Define the question
For our purposes, we will frame this as a binary question: 'White' or 'Not White'. While our dataset includes a 'striped' category, there are very few instances of it, so we will exclude it. It also disappears when we remove null values from the dataset.
> 🎃 Fun fact: White pumpkins are sometimes called 'ghost' pumpkins. Theyre not as easy to carve as orange ones, so theyre less popular, but they look pretty cool! We could also reframe our question as: 'Ghost' or 'Not Ghost'. 👻
## About logistic regression
Logistic regression differs from linear regression, which you learned about earlier, in several key ways.
[![ML for beginners - Understanding Logistic Regression for Machine Learning Classification](https://img.youtube.com/vi/KpeCT6nEpBY/0.jpg)](https://youtu.be/KpeCT6nEpBY "ML for beginners - Understanding Logistic Regression for Machine Learning Classification")
> 🎥 Click the image above for a short video overview of logistic regression.
### Binary classification
Logistic regression doesnt provide the same capabilities as linear regression. The former predicts binary categories ("white or not white"), while the latter predicts continuous values, such as estimating _how much the price of a pumpkin will increase_ based on its origin and harvest time.
![Pumpkin classification Model](../../../../2-Regression/4-Logistic/images/pumpkin-classifier.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
### Other classifications
There are other types of logistic regression, such as multinomial and ordinal:
- **Multinomial**, which involves more than two categories, e.g., "Orange, White, and Striped."
- **Ordinal**, which involves ordered categories, useful for logically ranked outcomes, like pumpkin sizes (mini, small, medium, large, extra-large, extra-extra-large).
![Multinomial vs ordinal regression](../../../../2-Regression/4-Logistic/images/multinomial-vs-ordinal.png)
### Variables DO NOT have to correlate
Unlike linear regression, which works better with highly correlated variables, logistic regression does not require variables to align. This makes it suitable for our dataset, which has relatively weak correlations.
### You need a lot of clean data
Logistic regression performs better with larger datasets. Our small dataset is not ideal for this task, so keep that in mind.
[![ML for beginners - Data Analysis and Preparation for Logistic Regression](https://img.youtube.com/vi/B2X4H9vcXTs/0.jpg)](https://youtu.be/B2X4H9vcXTs "ML for beginners - Data Analysis and Preparation for Logistic Regression")
> 🎥 Click the image above for a short video overview of preparing data for logistic regression.
✅ Consider what types of data are well-suited for logistic regression.
## Exercise - tidy the data
First, clean the data by removing null values and selecting specific columns:
1. Add the following code:
```python
columns_to_select = ['City Name','Package','Variety', 'Origin','Item Size', 'Color']
pumpkins = full_pumpkins.loc[:, columns_to_select]
pumpkins.dropna(inplace=True)
```
You can always preview your new dataframe:
```python
pumpkins.info
```
### Visualization - categorical plot
By now, youve loaded the [starter notebook](../../../../2-Regression/4-Logistic/notebook.ipynb) with pumpkin data and cleaned it to retain a dataset with a few variables, including `Color`. Lets visualize the dataframe in the notebook using a new library: [Seaborn](https://seaborn.pydata.org/index.html), which is built on Matplotlib (used earlier).
Seaborn provides interesting ways to visualize data. For example, you can compare distributions of `Variety` and `Color` using a categorical plot.
1. Create a categorical plot using the `catplot` function, specifying a color mapping for each pumpkin category (orange or white):
```python
import seaborn as sns
palette = {
'ORANGE': 'orange',
'WHITE': 'wheat',
}
sns.catplot(
data=pumpkins, y="Variety", hue="Color", kind="count",
palette=palette,
)
```
![A grid of visualized data](../../../../2-Regression/4-Logistic/images/pumpkins_catplot_1.png)
By observing the data, you can see how `Color` relates to `Variety`.
✅ Based on this categorical plot, what interesting patterns or questions come to mind?
### Data pre-processing: feature and label encoding
The pumpkins dataset contains string values in all columns. While categorical data is intuitive for humans, machines work better with numerical data. Encoding is a crucial step in pre-processing, as it converts categorical data into numerical data without losing information. Good encoding leads to better models.
For feature encoding, there are two main types:
1. **Ordinal encoder**: Suitable for ordinal variables, which have a logical order (e.g., `Item Size`). It maps each category to a number based on its order.
```python
from sklearn.preprocessing import OrdinalEncoder
item_size_categories = [['sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo']]
ordinal_features = ['Item Size']
ordinal_encoder = OrdinalEncoder(categories=item_size_categories)
```
2. **Categorical encoder**: Suitable for nominal variables, which lack a logical order (e.g., all features except `Item Size`). This uses one-hot encoding, where each category is represented by a binary column.
```python
from sklearn.preprocessing import OneHotEncoder
categorical_features = ['City Name', 'Package', 'Variety', 'Origin']
categorical_encoder = OneHotEncoder(sparse_output=False)
```
Then, `ColumnTransformer` combines multiple encoders into a single step and applies them to the appropriate columns.
```python
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers=[
('ord', ordinal_encoder, ordinal_features),
('cat', categorical_encoder, categorical_features)
])
ct.set_output(transform='pandas')
encoded_features = ct.fit_transform(pumpkins)
```
For label encoding, we use Scikit-learns `LabelEncoder` class to normalize labels to values between 0 and n_classes-1 (here, 0 and 1).
```python
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
encoded_label = label_encoder.fit_transform(pumpkins['Color'])
```
After encoding the features and label, merge them into a new dataframe `encoded_pumpkins`.
```python
encoded_pumpkins = encoded_features.assign(Color=encoded_label)
```
✅ What are the benefits of using an ordinal encoder for the `Item Size` column?
### Analyze relationships between variables
With pre-processed data, analyze relationships between features and the label to assess how well the model might predict the label. Visualization is the best way to do this. Use Seaborns `catplot` to explore relationships between `Item Size`, `Variety`, and `Color`. Use the encoded `Item Size` column and the unencoded `Variety` column for better visualization.
```python
palette = {
'ORANGE': 'orange',
'WHITE': 'wheat',
}
pumpkins['Item Size'] = encoded_pumpkins['ord__Item Size']
g = sns.catplot(
data=pumpkins,
x="Item Size", y="Color", row='Variety',
kind="box", orient="h",
sharex=False, margin_titles=True,
height=1.8, aspect=4, palette=palette,
)
g.set(xlabel="Item Size", ylabel="").set(xlim=(0,6))
g.set_titles(row_template="{row_name}")
```
![A catplot of visualized data](../../../../2-Regression/4-Logistic/images/pumpkins_catplot_2.png)
### Use a swarm plot
Since `Color` is a binary category (White or Not), it requires a [specialized approach](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) for visualization. Seaborn offers various ways to visualize relationships between variables.
1. Try a 'swarm' plot to show the distribution of values:
```python
palette = {
0: 'orange',
1: 'wheat'
}
sns.swarmplot(x="Color", y="ord__Item Size", data=encoded_pumpkins, palette=palette)
```
![A swarm of visualized data](../../../../2-Regression/4-Logistic/images/swarm_2.png)
**Note**: The code above may generate a warning because Seaborn struggles to represent a large number of data points in a swarm plot. You can reduce marker size using the 'size' parameter, but this may affect readability.
> **🧮 Show Me The Math**
>
> Logistic regression relies on 'maximum likelihood' using [sigmoid functions](https://wikipedia.org/wiki/Sigmoid_function). A sigmoid function maps values to a range between 0 and 1, forming an 'S'-shaped curve (logistic curve). Its formula is:
>
> ![logistic function](../../../../2-Regression/4-Logistic/images/sigmoid.png)
>
> The midpoint of the sigmoid is at x=0, L is the curves maximum value, and k determines steepness. If the functions output exceeds 0.5, the label is classified as '1'; otherwise, its classified as '0'.
## Build your model
Building a binary classification model is straightforward with Scikit-learn.
[![ML for beginners - Logistic Regression for classification of data](https://img.youtube.com/vi/MmZS2otPrQ8/0.jpg)](https://youtu.be/MmZS2otPrQ8 "ML for beginners - Logistic Regression for classification of data")
> 🎥 Click the image above for a short video overview of building a logistic regression model.
1. Select the variables for your classification model and split the data into training and test sets using `train_test_split()`:
```python
from sklearn.model_selection import train_test_split
X = encoded_pumpkins[encoded_pumpkins.columns.difference(['Color'])]
y = encoded_pumpkins['Color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
2. Train your model using `fit()` with the training data, and print the results:
```python
from sklearn.metrics import f1_score, classification_report
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
print('Predicted labels: ', predictions)
print('F1-score: ', f1_score(y_test, predictions))
```
Check your models performance. Its not bad, considering the dataset has only about 1,000 rows:
```output
precision recall f1-score support
0 0.94 0.98 0.96 166
1 0.85 0.67 0.75 33
accuracy 0.92 199
macro avg 0.89 0.82 0.85 199
weighted avg 0.92 0.92 0.92 199
Predicted labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 1 0 0 0 0 0 0 0 0 1 1]
F1-score: 0.7457627118644068
```
## Better comprehension via a confusion matrix
While you can evaluate your model using [classification report terms](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report), a [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) can provide a clearer picture of its performance.
> 🎓 A '[confusion matrix](https://wikipedia.org/wiki/Confusion_matrix)' (or 'error matrix') is a table that compares true vs. false positives and negatives, helping gauge prediction accuracy.
1. Use `confusion_matrix()` to generate the matrix:
```python
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)
```
View your models confusion matrix:
```output
array([[162, 4],
[ 11, 22]])
```
In Scikit-learn, confusion matrix rows (axis 0) represent actual labels, while columns (axis 1) represent predicted labels.
| | 0 | 1 |
| :---: | :---: | :---: |
| 0 | TN | FP |
| 1 | FN | TP |
Heres what the matrix means:
- **True Negative (TN)**: The model predicts 'not white,' and the pumpkin is actually 'not white.'
- **False Negative (FN)**: The model predicts 'not white,' but the pumpkin is actually 'white.'
- **False Positive (FP)**: The model predicts 'white,' but the pumpkin is actually 'not white.'
- **True Positive (TP)**: The model predicts 'white,' and the pumpkin is actually 'white.'
Ideally, you want more true positives and true negatives, and fewer false positives and false negatives, indicating better model performance.
How does the confusion matrix relate to precision and recall? Remember, the classification report printed above showed precision (0.85) and recall (0.67).
Precision = tp / (tp + fp) = 22 / (22 + 4) = 0.8461538461538461
Recall = tp / (tp + fn) = 22 / (22 + 11) = 0.6666666666666666
✅ Q: According to the confusion matrix, how did the model do?
A: Not bad; there are a good number of true negatives but also a few false negatives.
Let's revisit the terms we saw earlier with the help of the confusion matrix's mapping of TP/TN and FP/FN:
🎓 Precision: TP/(TP + FP)
The fraction of relevant instances among the retrieved instances (e.g., which labels were well-labeled).
🎓 Recall: TP/(TP + FN)
The fraction of relevant instances that were retrieved, whether well-labeled or not.
🎓 f1-score: (2 * precision * recall)/(precision + recall)
A weighted average of the precision and recall, with the best being 1 and the worst being 0.
🎓 Support:
The number of occurrences of each label retrieved.
🎓 Accuracy: (TP + TN)/(TP + TN + FP + FN)
The percentage of labels predicted accurately for a sample.
🎓 Macro Avg:
The calculation of the unweighted mean metrics for each label, not taking label imbalance into account.
🎓 Weighted Avg:
The calculation of the mean metrics for each label, taking label imbalance into account by weighting them by their support (the number of true instances for each label).
✅ Can you think which metric you should watch if you want your model to reduce the number of false negatives?
## Visualize the ROC curve of this model
[![ML for beginners - Analyzing Logistic Regression Performance with ROC Curves](https://img.youtube.com/vi/GApO575jTA0/0.jpg)](https://youtu.be/GApO575jTA0 "ML for beginners - Analyzing Logistic Regression Performance with ROC Curves")
> 🎥 Click the image above for a short video overview of ROC curves
Let's do one more visualization to see the so-called 'ROC' curve:
```python
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
y_scores = model.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 6))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
```
Using Matplotlib, plot the model's [Receiving Operating Characteristic](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html?highlight=roc) or ROC. ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. "ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis." Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly:
![ROC](../../../../2-Regression/4-Logistic/images/ROC_2.png)
Finally, use Scikit-learn's [`roc_auc_score` API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html?highlight=roc_auc#sklearn.metrics.roc_auc_score) to compute the actual 'Area Under the Curve' (AUC):
```python
auc = roc_auc_score(y_test,y_scores[:,1])
print(auc)
```
The result is `0.9749908725812341`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is _pretty good_.
In future lessons on classifications, you will learn how to iterate to improve your model's scores. But for now, congratulations! You've completed these regression lessons!
---
## 🚀Challenge
There's a lot more to unpack regarding logistic regression! But the best way to learn is to experiment. Find a dataset that lends itself to this type of analysis and build a model with it. What do you learn? tip: try [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) for interesting datasets.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Read the first few pages of [this paper from Stanford](https://web.stanford.edu/~jurafsky/slp3/5.pdf) on some practical uses for logistic regression. Think about tasks that are better suited for one or the other type of regression tasks that we have studied up to this point. What would work best?
## Assignment
[Retrying this regression](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "8af40209a41494068c1f42b14c0b450d",
"translation_date": "2025-09-06T10:46:47+00:00",
"source_file": "2-Regression/4-Logistic/assignment.md",
"language_code": "en"
}
-->
# Retrying some Regression
## Instructions
In this lesson, you worked with a subset of the pumpkin data. Now, return to the original dataset and use all of it—cleaned and standardized—to build a Logistic Regression model.
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | ----------------------------------------------------------------------- | ------------------------------------------------------------ | ----------------------------------------------------------- |
| | A notebook is provided with a well-explained and high-performing model | A notebook is provided with a model that performs adequately | A notebook is provided with a poorly performing model or none |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:46:51+00:00",
"source_file": "2-Regression/4-Logistic/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,54 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "508582278dbb8edd2a8a80ac96ef416c",
"translation_date": "2025-09-06T10:44:22+00:00",
"source_file": "2-Regression/README.md",
"language_code": "en"
}
-->
# Regression models for machine learning
## Regional topic: Regression models for pumpkin prices in North America 🎃
In North America, pumpkins are often carved into spooky faces for Halloween. Let's explore more about these intriguing vegetables!
![jack-o-lanterns](../../../2-Regression/images/jack-o-lanterns.jpg)
> Photo by <a href="https://unsplash.com/@teutschmann?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Beth Teutschmann</a> on <a href="https://unsplash.com/s/photos/jack-o-lanterns?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## What you will learn
[![Introduction to Regression](https://img.youtube.com/vi/5QnJtDad4iQ/0.jpg)](https://youtu.be/5QnJtDad4iQ "Regression Introduction video - Click to Watch!")
> 🎥 Click the image above for a quick introduction video to this lesson
The lessons in this section focus on types of regression within the context of machine learning. Regression models are useful for identifying the _relationship_ between variables. These models can predict values like length, temperature, or age, helping to uncover patterns and relationships as they analyze data points.
In this series of lessons, you'll learn the differences between linear and logistic regression, and understand when to use one over the other.
[![ML for beginners - Introduction to Regression models for Machine Learning](https://img.youtube.com/vi/XA3OaoW86R8/0.jpg)](https://youtu.be/XA3OaoW86R8 "ML for beginners - Introduction to Regression models for Machine Learning")
> 🎥 Click the image above for a short video introducing regression models.
In this set of lessons, you'll prepare to start machine learning tasks, including setting up Visual Studio Code to manage notebooks, a common environment for data scientists. You'll explore Scikit-learn, a machine learning library, and build your first models, focusing on regression models in this chapter.
> There are helpful low-code tools available to assist you in learning about regression models. Check out [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-regression-model-azure-machine-learning-designer/?WT.mc_id=academic-77952-leestott)
### Lessons
1. [Tools of the trade](1-Tools/README.md)
2. [Managing data](2-Data/README.md)
3. [Linear and polynomial regression](3-Linear/README.md)
4. [Logistic regression](4-Logistic/README.md)
---
### Credits
"ML with regression" was created with ♥️ by [Jen Looper](https://twitter.com/jenlooper)
♥️ Quiz contributors include: [Muhammad Sakib Khan Inan](https://twitter.com/Sakibinan) and [Ornella Altunyan](https://twitter.com/ornelladotcom)
The pumpkin dataset is recommended by [this project on Kaggle](https://www.kaggle.com/usda/a-year-of-pumpkin-prices) and its data is sourced from the [Specialty Crops Terminal Markets Standard Reports](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) provided by the United States Department of Agriculture. We have added some points related to color based on variety to normalize the distribution. This data is in the public domain.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,357 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "e0b75f73e4a90d45181dc5581fe2ef5c",
"translation_date": "2025-09-06T10:55:08+00:00",
"source_file": "3-Web-App/1-Web-App/README.md",
"language_code": "en"
}
-->
# Build a Web App to use a ML Model
In this lesson, you will train a machine learning model using a fascinating dataset: _UFO sightings over the past century_, sourced from NUFORC's database.
You will learn:
- How to save a trained model using 'pickle'
- How to integrate that model into a Flask web application
We'll continue using notebooks to clean data and train our model, but we'll take it a step further by exploring how to use the model in a real-world scenario: a web app.
To achieve this, you'll need to build a web app using Flask.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Building an app
There are multiple ways to create web apps that utilize machine learning models. The architecture of your web app may influence how the model is trained. Imagine you're working in a company where the data science team has trained a model, and they want you to integrate it into an app.
### Considerations
Here are some important questions to consider:
- **Is it a web app or a mobile app?** If you're building a mobile app or need to use the model in an IoT context, you could use [TensorFlow Lite](https://www.tensorflow.org/lite/) to integrate the model into an Android or iOS app.
- **Where will the model be hosted?** Will it reside in the cloud or locally?
- **Offline support.** Does the app need to function offline?
- **What technology was used to train the model?** The technology used may dictate the tools required for integration.
- **Using TensorFlow.** If the model was trained using TensorFlow, you can convert it for use in a web app with [TensorFlow.js](https://www.tensorflow.org/js/).
- **Using PyTorch.** If the model was built using [PyTorch](https://pytorch.org/), you can export it in [ONNX](https://onnx.ai/) (Open Neural Network Exchange) format for use in JavaScript web apps with [Onnx Runtime](https://www.onnxruntime.ai/). This approach will be covered in a future lesson for a Scikit-learn-trained model.
- **Using Lobe.ai or Azure Custom Vision.** If you used an ML SaaS platform like [Lobe.ai](https://lobe.ai/) or [Azure Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77952-leestott), these tools provide options to export the model for various platforms, including creating a custom API for cloud-based queries.
You could also build a complete Flask web app capable of training the model directly in a web browser. This can be achieved using TensorFlow.js in a JavaScript environment.
For this lesson, since we've been working with Python-based notebooks, we'll focus on exporting a trained model from a notebook into a format that can be used in a Python-built web app.
## Tool
To complete this task, you'll need two tools: Flask and Pickle, both of which run on Python.
✅ What is [Flask](https://palletsprojects.com/p/flask/)? Flask is a lightweight web framework for Python that provides essential features for building web applications, including a templating engine for creating web pages. Check out [this Learn module](https://docs.microsoft.com/learn/modules/python-flask-build-ai-web-app?WT.mc_id=academic-77952-leestott) to practice working with Flask.
✅ What is [Pickle](https://docs.python.org/3/library/pickle.html)? Pickle 🥒 is a Python module used to serialize and deserialize Python object structures. When you 'pickle' a model, you flatten its structure for use in a web app. Be cautious: Pickle is not inherently secure, so exercise caution when prompted to 'un-pickle' a file. Pickled files typically have the `.pkl` extension.
## Exercise - clean your data
In this lesson, you'll work with data from 80,000 UFO sightings collected by [NUFORC](https://nuforc.org) (The National UFO Reporting Center). The dataset includes intriguing descriptions of UFO sightings, such as:
- **Long example description.** "A man emerges from a beam of light that shines on a grassy field at night and he runs towards the Texas Instruments parking lot."
- **Short example description.** "The lights chased us."
The [ufos.csv](../../../../3-Web-App/1-Web-App/data/ufos.csv) spreadsheet contains columns for the `city`, `state`, and `country` where the sighting occurred, the object's `shape`, and its `latitude` and `longitude`.
In the blank [notebook](../../../../3-Web-App/1-Web-App/notebook.ipynb) provided in this lesson:
1. Import `pandas`, `matplotlib`, and `numpy` as you did in previous lessons, and load the UFO dataset. Here's a sample of the data:
```python
import pandas as pd
import numpy as np
ufos = pd.read_csv('./data/ufos.csv')
ufos.head()
```
1. Convert the UFO data into a smaller dataframe with updated column names. Check the unique values in the `Country` field.
```python
ufos = pd.DataFrame({'Seconds': ufos['duration (seconds)'], 'Country': ufos['country'],'Latitude': ufos['latitude'],'Longitude': ufos['longitude']})
ufos.Country.unique()
```
1. Reduce the dataset by removing rows with null values and filtering sightings that lasted between 1-60 seconds:
```python
ufos.dropna(inplace=True)
ufos = ufos[(ufos['Seconds'] >= 1) & (ufos['Seconds'] <= 60)]
ufos.info()
```
1. Use Scikit-learn's `LabelEncoder` library to convert text values in the `Country` column into numeric values:
✅ LabelEncoder encodes data alphabetically.
```python
from sklearn.preprocessing import LabelEncoder
ufos['Country'] = LabelEncoder().fit_transform(ufos['Country'])
ufos.head()
```
Your cleaned data should look like this:
```output
Seconds Country Latitude Longitude
2 20.0 3 53.200000 -2.916667
3 20.0 4 28.978333 -96.645833
14 30.0 4 35.823889 -80.253611
23 60.0 4 45.582778 -122.352222
24 3.0 3 51.783333 -0.783333
```
## Exercise - build your model
Now, divide the data into training and testing sets to prepare for model training.
1. Select three features for your X vector, and use the `Country` column as your y vector. The goal is to input `Seconds`, `Latitude`, and `Longitude` to predict a country ID.
```python
from sklearn.model_selection import train_test_split
Selected_features = ['Seconds','Latitude','Longitude']
X = ufos[Selected_features]
y = ufos['Country']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
1. Train your model using logistic regression:
```python
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
print('Predicted labels: ', predictions)
print('Accuracy: ', accuracy_score(y_test, predictions))
```
The accuracy is quite good **(around 95%)**, which is expected since `Country` correlates strongly with `Latitude` and `Longitude`.
While the model isn't groundbreaking—predicting a `Country` from its `Latitude` and `Longitude` is straightforward—it serves as a valuable exercise in cleaning data, training a model, exporting it, and using it in a web app.
## Exercise - 'pickle' your model
Next, save your trained model using Pickle. After pickling the model, load it and test it with a sample data array containing values for seconds, latitude, and longitude:
```python
import pickle
model_filename = 'ufo-model.pkl'
pickle.dump(model, open(model_filename,'wb'))
model = pickle.load(open('ufo-model.pkl','rb'))
print(model.predict([[50,44,-12]]))
```
The model predicts **'3'**, which corresponds to the UK. Fascinating! 👽
## Exercise - build a Flask app
Now, create a Flask app to call your model and display results in a user-friendly format.
1. Start by creating a folder named **web-app** next to the _notebook.ipynb_ file where your _ufo-model.pkl_ file is located.
1. Inside the **web-app** folder, create three subfolders: **static** (with a **css** folder inside) and **templates**. Your directory structure should look like this:
```output
web-app/
static/
css/
templates/
notebook.ipynb
ufo-model.pkl
```
✅ Refer to the solution folder for a completed app example.
1. Create a **requirements.txt** file in the **web-app** folder. This file lists the app's dependencies, similar to _package.json_ in JavaScript apps. Add the following lines to **requirements.txt**:
```text
scikit-learn
pandas
numpy
flask
```
1. Navigate to the **web-app** folder and run the following command:
```bash
cd web-app
```
1. Install the libraries listed in **requirements.txt** by typing `pip install` in your terminal:
```bash
pip install -r requirements.txt
```
1. Create three additional files to complete the app:
1. **app.py** in the root directory.
2. **index.html** in the **templates** folder.
3. **styles.css** in the **static/css** folder.
1. Add some basic styles to the **styles.css** file:
```css
body {
width: 100%;
height: 100%;
font-family: 'Helvetica';
background: black;
color: #fff;
text-align: center;
letter-spacing: 1.4px;
font-size: 30px;
}
input {
min-width: 150px;
}
.grid {
width: 300px;
border: 1px solid #2d2d2d;
display: grid;
justify-content: center;
margin: 20px auto;
}
.box {
color: #fff;
background: #2d2d2d;
padding: 12px;
display: inline-block;
}
```
1. Build the **index.html** file:
```html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>🛸 UFO Appearance Prediction! 👽</title>
<link rel="stylesheet" href="{{ url_for('static', filename='css/styles.css') }}">
</head>
<body>
<div class="grid">
<div class="box">
<p>According to the number of seconds, latitude and longitude, which country is likely to have reported seeing a UFO?</p>
<form action="{{ url_for('predict')}}" method="post">
<input type="number" name="seconds" placeholder="Seconds" required="required" min="0" max="60" />
<input type="text" name="latitude" placeholder="Latitude" required="required" />
<input type="text" name="longitude" placeholder="Longitude" required="required" />
<button type="submit" class="btn">Predict country where the UFO is seen</button>
</form>
<p>{{ prediction_text }}</p>
</div>
</div>
</body>
</html>
```
Notice the templating syntax in this file. Variables provided by the app, such as the prediction text, are enclosed in `{{}}`. The form posts data to the `/predict` route.
1. Finally, create the Python file that handles the model and displays predictions:
```python
import numpy as np
from flask import Flask, request, render_template
import pickle
app = Flask(__name__)
model = pickle.load(open("./ufo-model.pkl", "rb"))
@app.route("/")
def home():
return render_template("index.html")
@app.route("/predict", methods=["POST"])
def predict():
int_features = [int(x) for x in request.form.values()]
final_features = [np.array(int_features)]
prediction = model.predict(final_features)
output = prediction[0]
countries = ["Australia", "Canada", "Germany", "UK", "US"]
return render_template(
"index.html", prediction_text="Likely country: {}".format(countries[output])
)
if __name__ == "__main__":
app.run(debug=True)
```
> 💡 Tip: Adding [`debug=True`](https://www.askpython.com/python-modules/flask/flask-debug-mode) while running the Flask app allows you to see changes immediately without restarting the server. However, avoid enabling this mode in production.
Run `python app.py` or `python3 app.py` to start your local web server. You can then fill out the form to discover where UFOs have been sighted!
Before testing the app, review the structure of `app.py`:
1. Dependencies are loaded, and the app is initialized.
1. The model is imported.
1. The home route renders the `index.html` file.
On the `/predict` route, the following occurs when the form is submitted:
1. Form variables are collected and converted into a numpy array. The array is sent to the model, which returns a prediction.
2. Predicted country codes are converted into readable text and sent back to `index.html` for display.
Using a model with Flask and Pickle is relatively simple. The key challenge is understanding the data format required by the model for predictions, which depends on how the model was trained. In this case, three data points are needed for predictions.
In a professional setting, clear communication between the team training the model and the team integrating it into an app is crucial. In this lesson, you're both teams!
---
## 🚀 Challenge
Instead of training the model in a notebook and importing it into the Flask app, try training the model directly within the Flask app on a route called `train`. What are the advantages and disadvantages of this approach?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
There are various ways to build a web app that utilizes machine learning models. Make a list of methods for using JavaScript or Python to create such an app. Consider the architecture: should the model reside in the app or in the cloud? If hosted in the cloud, how would you access it? Sketch an architectural diagram for an applied ML web solution.
## Assignment
[Try a different model](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a8e8ae10be335cbc745b75ee552317ff",
"translation_date": "2025-09-06T10:55:32+00:00",
"source_file": "3-Web-App/1-Web-App/assignment.md",
"language_code": "en"
}
-->
# Try a different model
## Instructions
Now that youve created a web app using a trained Regression model, try using one of the models from a previous Regression lesson to rebuild this web app. You can keep the same style or redesign it to better suit the pumpkin data. Make sure to adjust the inputs to align with the training method of your chosen model.
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------------------------- | -------------------------------------------------------- | --------------------------------------------------------- | -------------------------------------- |
| | The web app works as intended and is successfully deployed to the cloud | The web app has some issues or produces unexpected results | The web app fails to function correctly |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,35 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "9836ff53cfef716ddfd70e06c5f43436",
"translation_date": "2025-09-06T10:55:01+00:00",
"source_file": "3-Web-App/README.md",
"language_code": "en"
}
-->
# Build a web app to use your ML model
In this part of the curriculum, you'll explore a practical application of machine learning: how to save your Scikit-learn model as a file that can be used to make predictions in a web application. Once the model is saved, you'll learn how to integrate it into a web app built with Flask. You'll start by creating a model using data about UFO sightings! Then, you'll develop a web app that allows users to input a number of seconds along with latitude and longitude values to predict which country reported the UFO sighting.
![UFO Parking](../../../3-Web-App/images/ufo.jpg)
Photo by <a href="https://unsplash.com/@mdherren?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Michael Herren</a> on <a href="https://unsplash.com/s/photos/ufo?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## Lessons
1. [Build a Web App](1-Web-App/README.md)
## Credits
"Build a Web App" was written with ♥️ by [Jen Looper](https://twitter.com/jenlooper).
♥️ The quizzes were created by Rohan Raj.
The dataset is provided by [Kaggle](https://www.kaggle.com/NUFORC/ufo-sightings).
The web app architecture was partially inspired by [this article](https://towardsdatascience.com/how-to-easily-deploy-machine-learning-models-using-flask-b95af8fe34d4) and [this repo](https://github.com/abhinavsagar/machine-learning-deployment) by Abhinav Sagar.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,313 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "aaf391d922bd6de5efba871d514c6d47",
"translation_date": "2025-09-06T10:57:23+00:00",
"source_file": "4-Classification/1-Introduction/README.md",
"language_code": "en"
}
-->
# Introduction to classification
In these four lessons, you will dive into one of the core areas of traditional machine learning: _classification_. We'll explore various classification algorithms using a dataset about the diverse and delicious cuisines of Asia and India. Get ready to whet your appetite!
![just a pinch!](../../../../4-Classification/1-Introduction/images/pinch.png)
> Celebrate pan-Asian cuisines in these lessons! Image by [Jen Looper](https://twitter.com/jenlooper)
Classification is a type of [supervised learning](https://wikipedia.org/wiki/Supervised_learning) that shares many similarities with regression techniques. Machine learning is all about predicting values or assigning labels to data using datasets, and classification typically falls into two categories: _binary classification_ and _multiclass classification_.
[![Introduction to classification](https://img.youtube.com/vi/eg8DJYwdMyg/0.jpg)](https://youtu.be/eg8DJYwdMyg "Introduction to classification")
> 🎥 Click the image above for a video: MIT's John Guttag introduces classification
Key points to remember:
- **Linear regression** helps predict relationships between variables and make accurate predictions about where a new data point might fall relative to a line. For example, you could predict _the price of a pumpkin in September versus December_.
- **Logistic regression** helps identify "binary categories": at a certain price point, _is this pumpkin orange or not-orange_?
Classification uses different algorithms to determine how to assign a label or class to a data point. In this lesson, we'll use cuisine data to see if we can predict the cuisine of origin based on a set of ingredients.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
> ### [This lesson is available in R!](../../../../4-Classification/1-Introduction/solution/R/lesson_10.html)
### Introduction
Classification is a fundamental task for machine learning researchers and data scientists. From simple binary classification ("is this email spam or not?") to complex image classification and segmentation using computer vision, the ability to sort data into categories and analyze it is invaluable.
To put it more scientifically, classification involves creating a predictive model that maps the relationship between input variables and output variables.
![binary vs. multiclass classification](../../../../4-Classification/1-Introduction/images/binary-multiclass.png)
> Binary vs. multiclass problems for classification algorithms to handle. Infographic by [Jen Looper](https://twitter.com/jenlooper)
Before we start cleaning, visualizing, and preparing our data for machine learning tasks, let's learn about the different ways machine learning can be used to classify data.
Derived from [statistics](https://wikipedia.org/wiki/Statistical_classification), classification in traditional machine learning uses features like `smoker`, `weight`, and `age` to predict the _likelihood of developing a certain disease_. As a supervised learning technique similar to regression, classification uses labeled data to train algorithms to classify and predict features (or 'labels') of a dataset and assign them to a group or outcome.
✅ Take a moment to imagine a dataset about cuisines. What could a multiclass model answer? What could a binary model answer? For instance, could you predict whether a given cuisine is likely to use fenugreek? Or, if you were handed a grocery bag containing star anise, artichokes, cauliflower, and horseradish, could you determine if you could make a typical Indian dish?
[![Crazy mystery baskets](https://img.youtube.com/vi/GuTeDbaNoEU/0.jpg)](https://youtu.be/GuTeDbaNoEU "Crazy mystery baskets")
> 🎥 Click the image above for a video. The premise of the show 'Chopped' is the 'mystery basket,' where chefs must create dishes using random ingredients. Imagine how helpful a machine learning model could be!
## Hello 'classifier'
The question we want to answer with this cuisine dataset is a **multiclass question**, as we have several possible national cuisines to consider. Given a set of ingredients, which of these classes does the data belong to?
Scikit-learn provides several algorithms for classifying data, depending on the type of problem you're solving. In the next two lessons, you'll explore some of these algorithms.
## Exercise - clean and balance your data
Before diving into the project, the first step is to clean and **balance** your data to achieve better results. Start with the blank _notebook.ipynb_ file in the root of this folder.
The first thing you'll need to install is [imblearn](https://imbalanced-learn.org/stable/). This Scikit-learn package helps balance datasets (you'll learn more about this shortly).
1. Install `imblearn` using `pip install`:
```python
pip install imblearn
```
1. Import the necessary packages to load and visualize your data, and also import `SMOTE` from `imblearn`.
```python
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from imblearn.over_sampling import SMOTE
```
You're now ready to import the data.
1. Import the data:
```python
df = pd.read_csv('../data/cuisines.csv')
```
Use `read_csv()` to load the contents of the _cuisines.csv_ file into the variable `df`.
1. Check the shape of the data:
```python
df.head()
```
The first five rows look like this:
```output
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 65 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 66 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 67 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 68 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 69 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
```
1. Get information about the data using `info()`:
```python
df.info()
```
The output looks like this:
```output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2448 entries, 0 to 2447
Columns: 385 entries, Unnamed: 0 to zucchini
dtypes: int64(384), object(1)
memory usage: 7.2+ MB
```
## Exercise - learning about cuisines
Now the fun begins! Let's explore the distribution of data across cuisines.
1. Plot the data as horizontal bars using `barh()`:
```python
df.cuisine.value_counts().plot.barh()
```
![cuisine data distribution](../../../../4-Classification/1-Introduction/images/cuisine-dist.png)
There are a limited number of cuisines, but the data distribution is uneven. You can fix this! Before doing so, let's explore further.
1. Check how much data is available for each cuisine and print the results:
```python
thai_df = df[(df.cuisine == "thai")]
japanese_df = df[(df.cuisine == "japanese")]
chinese_df = df[(df.cuisine == "chinese")]
indian_df = df[(df.cuisine == "indian")]
korean_df = df[(df.cuisine == "korean")]
print(f'thai df: {thai_df.shape}')
print(f'japanese df: {japanese_df.shape}')
print(f'chinese df: {chinese_df.shape}')
print(f'indian df: {indian_df.shape}')
print(f'korean df: {korean_df.shape}')
```
The output looks like this:
```output
thai df: (289, 385)
japanese df: (320, 385)
chinese df: (442, 385)
indian df: (598, 385)
korean df: (799, 385)
```
## Discovering ingredients
Now let's dig deeper into the data to identify typical ingredients for each cuisine. You'll need to clean out recurring data that might cause confusion between cuisines.
1. Create a Python function `create_ingredient()` to generate an ingredient dataframe. This function will drop an unhelpful column and sort ingredients by their count:
```python
def create_ingredient_df(df):
ingredient_df = df.T.drop(['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
ingredient_df = ingredient_df.sort_values(by='value', ascending=False,
inplace=False)
return ingredient_df
```
Use this function to identify the top ten most popular ingredients for each cuisine.
1. Call `create_ingredient()` and plot the results using `barh()`:
```python
thai_ingredient_df = create_ingredient_df(thai_df)
thai_ingredient_df.head(10).plot.barh()
```
![thai](../../../../4-Classification/1-Introduction/images/thai.png)
1. Repeat the process for Japanese cuisine:
```python
japanese_ingredient_df = create_ingredient_df(japanese_df)
japanese_ingredient_df.head(10).plot.barh()
```
![japanese](../../../../4-Classification/1-Introduction/images/japanese.png)
1. Do the same for Chinese cuisine:
```python
chinese_ingredient_df = create_ingredient_df(chinese_df)
chinese_ingredient_df.head(10).plot.barh()
```
![chinese](../../../../4-Classification/1-Introduction/images/chinese.png)
1. Plot the ingredients for Indian cuisine:
```python
indian_ingredient_df = create_ingredient_df(indian_df)
indian_ingredient_df.head(10).plot.barh()
```
![indian](../../../../4-Classification/1-Introduction/images/indian.png)
1. Finally, plot the ingredients for Korean cuisine:
```python
korean_ingredient_df = create_ingredient_df(korean_df)
korean_ingredient_df.head(10).plot.barh()
```
![korean](../../../../4-Classification/1-Introduction/images/korean.png)
1. Remove common ingredients that might cause confusion between cuisines using `drop()`:
Everyone loves rice, garlic, and ginger!
```python
feature_df= df.drop(['cuisine','Unnamed: 0','rice','garlic','ginger'], axis=1)
labels_df = df.cuisine #.unique()
feature_df.head()
```
## Balance the dataset
Now that the data is cleaned, use [SMOTE](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html) - "Synthetic Minority Over-sampling Technique" - to balance it.
1. Use `fit_resample()` to generate new samples through interpolation.
```python
oversample = SMOTE()
transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)
```
Balancing your data improves classification results. For example, in binary classification, if most of your data belongs to one class, the model will predict that class more often simply because there's more data for it. Balancing the data reduces this bias.
1. Check the number of labels per ingredient:
```python
print(f'new label count: {transformed_label_df.value_counts()}')
print(f'old label count: {df.cuisine.value_counts()}')
```
The output looks like this:
```output
new label count: korean 799
chinese 799
indian 799
japanese 799
thai 799
Name: cuisine, dtype: int64
old label count: korean 799
indian 598
chinese 442
japanese 320
thai 289
Name: cuisine, dtype: int64
```
The data is now clean, balanced, and ready for analysis!
1. Save the balanced data, including labels and features, into a new dataframe for export:
```python
transformed_df = pd.concat([transformed_label_df,transformed_feature_df],axis=1, join='outer')
```
1. Take one last look at the data using `transformed_df.head()` and `transformed_df.info()`. Save a copy of this data for use in future lessons:
```python
transformed_df.head()
transformed_df.info()
transformed_df.to_csv("../data/cleaned_cuisines.csv")
```
The new CSV file is now available in the root data folder.
---
## 🚀Challenge
This curriculum includes several interesting datasets. Explore the `data` folders to find datasets suitable for binary or multiclass classification. What questions could you ask of these datasets?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Learn more about SMOTE's API. What use cases is it best suited for? What problems does it address?
## Assignment
[Explore classification methods](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "b2a01912beb24cfb0007f83594dba801",
"translation_date": "2025-09-06T10:57:42+00:00",
"source_file": "4-Classification/1-Introduction/assignment.md",
"language_code": "en"
}
-->
# Explore classification methods
## Instructions
In [Scikit-learn documentation](https://scikit-learn.org/stable/supervised_learning.html), you'll find an extensive list of ways to classify data. Take some time to explore these docs: your goal is to identify classification methods and connect them to a dataset from this curriculum, a question you can pose about it, and a classification technique. Create a spreadsheet or table in a .doc file and describe how the dataset would interact with the classification algorithm.
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | ----------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| | A document is provided that reviews 5 algorithms along with a classification technique. The review is thorough and well-articulated. | A document is provided that reviews 3 algorithms along with a classification technique. The review is thorough and well-articulated. | A document is provided that reviews fewer than three algorithms along with a classification technique, and the review lacks clarity or detail. |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:57:47+00:00",
"source_file": "4-Classification/1-Introduction/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,254 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1a6e9e46b34a2e559fbbfc1f95397c7b",
"translation_date": "2025-09-06T10:55:54+00:00",
"source_file": "4-Classification/2-Classifiers-1/README.md",
"language_code": "en"
}
-->
# Cuisine classifiers 1
In this lesson, you will use the dataset you saved from the previous lesson, which contains balanced and clean data about cuisines.
You will use this dataset with various classifiers to _predict the national cuisine based on a set of ingredients_. Along the way, you'll learn more about how algorithms can be applied to classification tasks.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
# Preparation
If you completed [Lesson 1](../1-Introduction/README.md), ensure that a _cleaned_cuisines.csv_ file exists in the root `/data` folder for these four lessons.
## Exercise - predict a national cuisine
1. In this lesson's _notebook.ipynb_ folder, import the file along with the Pandas library:
```python
import pandas as pd
cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
cuisines_df.head()
```
The data looks like this:
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 0 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1. Next, import several additional libraries:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np
```
1. Separate the X and y coordinates into two dataframes for training. Use `cuisine` as the labels dataframe:
```python
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()
```
It will look like this:
```output
0 indian
1 indian
2 indian
3 indian
4 indian
Name: cuisine, dtype: object
```
1. Drop the `Unnamed: 0` column and the `cuisine` column using `drop()`. Save the remaining data as trainable features:
```python
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()
```
Your features will look like this:
| | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| ---: | -----: | -------: | ----: | ---------: | ----: | -----------: | ------: | -------: | --------: | --------: | ---: | ------: | ----------: | ---------: | ----------------------: | ---: | ---: | ---: | ----: | -----: | -------: |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Now you're ready to train your model!
## Choosing your classifier
With your data clean and ready for training, it's time to decide which algorithm to use for the task.
Scikit-learn categorizes classification under Supervised Learning, offering a wide range of classification methods. [The options](https://scikit-learn.org/stable/supervised_learning.html) can seem overwhelming at first glance. These methods include:
- Linear Models
- Support Vector Machines
- Stochastic Gradient Descent
- Nearest Neighbors
- Gaussian Processes
- Decision Trees
- Ensemble methods (voting Classifier)
- Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)
> You can also use [neural networks for classification](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification), but that is beyond the scope of this lesson.
### Which classifier should you choose?
So, how do you decide on a classifier? Often, testing several options and comparing results is a good approach. Scikit-learn provides a [side-by-side comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) on a sample dataset, showcasing KNeighbors, SVC (two variations), GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB, and QuadraticDiscriminationAnalysis, with visualized results:
![comparison of classifiers](../../../../4-Classification/2-Classifiers-1/images/comparison.png)
> Plots generated from Scikit-learn's documentation
> AutoML simplifies this process by running these comparisons in the cloud, helping you select the best algorithm for your data. Try it [here](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-77952-leestott)
### A more informed approach
Instead of guessing, you can refer to this downloadable [ML Cheat Sheet](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-77952-leestott). For our multiclass problem, it suggests several options:
![cheatsheet for multiclass problems](../../../../4-Classification/2-Classifiers-1/images/cheatsheet.png)
> A section of Microsoft's Algorithm Cheat Sheet, detailing multiclass classification options
✅ Download this cheat sheet, print it out, and keep it handy!
### Reasoning
Let's evaluate different approaches based on our constraints:
- **Neural networks are too resource-intensive**. Given our clean but small dataset and the fact that we're training locally in notebooks, neural networks are not ideal for this task.
- **Avoid two-class classifiers**. Since this is not a binary classification problem, two-class classifiers like one-vs-all are not suitable.
- **Decision tree or logistic regression could work**. Both decision trees and logistic regression are viable options for multiclass data.
- **Multiclass Boosted Decision Trees are not suitable**. These are better for nonparametric tasks like ranking, which is not relevant here.
### Using Scikit-learn
We'll use Scikit-learn to analyze our data. Logistic regression in Scikit-learn offers several options. Check out the [parameters](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression) you can configure.
Two key parameters to set are `multi_class` and `solver`. These determine the behavior and algorithm used for logistic regression. Not all solvers are compatible with all `multi_class` values.
According to the documentation, for multiclass classification:
- **The one-vs-rest (OvR) scheme** is used if `multi_class` is set to `ovr`.
- **Cross-entropy loss** is used if `multi_class` is set to `multinomial`. (The `multinomial` option is supported only by the lbfgs, sag, saga, and newton-cg solvers.)
> 🎓 The 'scheme' refers to how logistic regression handles multiclass classification. It can be 'ovr' (one-vs-rest) or 'multinomial'. These schemes adapt logistic regression, which is primarily designed for binary classification, to handle multiclass tasks. [source](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)
> 🎓 The 'solver' is the algorithm used to optimize the problem. [source](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).
Scikit-learn provides this table to explain how solvers handle different challenges based on data structures:
![solvers](../../../../4-Classification/2-Classifiers-1/images/solvers.png)
## Exercise - split the data
Let's start with logistic regression for our first training attempt, as you recently learned about it in a previous lesson.
Split your data into training and testing sets using `train_test_split()`:
```python
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
```
## Exercise - apply logistic regression
Since this is a multiclass problem, you need to choose a _scheme_ and a _solver_. Use LogisticRegression with a multiclass setting and the **liblinear** solver for training.
1. Create a logistic regression model with `multi_class` set to `ovr` and the solver set to `liblinear`:
```python
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))
accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))
```
✅ Try using a different solver like `lbfgs`, which is often the default option.
> Note, use Pandas [`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) function to flatten your data when needed.
The accuracy is good at over **80%**!
1. You can see this model in action by testing one row of data (#50):
```python
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')
```
The result is printed:
```output
ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
cuisine: indian
```
✅ Try a different row number and check the results.
1. Digging deeper, you can check the accuracy of this prediction:
```python
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)
topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()
```
The result is printed - Indian cuisine is its best guess, with good probability:
| | 0 |
| -------: | -------: |
| indian | 0.715851 |
| chinese | 0.229475 |
| japanese | 0.029763 |
| korean | 0.017277 |
| thai | 0.007634 |
✅ Can you explain why the model is quite confident this is an Indian cuisine?
1. Get more detail by printing a classification report, as you did in the regression lessons:
```python
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))
```
| | precision | recall | f1-score | support |
| ------------ | --------- | ------ | -------- | ------- |
| chinese | 0.73 | 0.71 | 0.72 | 229 |
| indian | 0.91 | 0.93 | 0.92 | 254 |
| japanese | 0.70 | 0.75 | 0.72 | 220 |
| korean | 0.86 | 0.76 | 0.81 | 242 |
| thai | 0.79 | 0.85 | 0.82 | 254 |
| accuracy | 0.80 | 1199 | | |
| macro avg | 0.80 | 0.80 | 0.80 | 1199 |
| weighted avg | 0.80 | 0.80 | 0.80 | 1199 |
## 🚀Challenge
In this lesson, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the many options Scikit-learn provides to classify data. Dig deeper into the concept of 'solver' to understand what goes on behind the scenes.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Dig a little more into the math behind logistic regression in [this lesson](https://people.eecs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2006.pdf)
## Assignment
[Study the solvers](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,24 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "de6025f96841498b0577e9d1aee18d1f",
"translation_date": "2025-09-06T10:56:30+00:00",
"source_file": "4-Classification/2-Classifiers-1/assignment.md",
"language_code": "en"
}
-->
# Study the solvers
## Instructions
In this lesson, you learned about the different solvers that combine algorithms with a machine learning process to build an accurate model. Review the solvers mentioned in the lesson and choose two. In your own words, compare and contrast these two solvers. What type of problem do they solve? How do they interact with different data structures? Why would you choose one over the other?
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------ | ---------------------------- |
| | A .doc file is provided with two paragraphs, each discussing one solver and comparing them thoughtfully. | A .doc file is provided with only one paragraph | The assignment is incomplete |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:56:35+00:00",
"source_file": "4-Classification/2-Classifiers-1/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,249 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "49047911108adc49d605cddfb455749c",
"translation_date": "2025-09-06T10:57:01+00:00",
"source_file": "4-Classification/3-Classifiers-2/README.md",
"language_code": "en"
}
-->
# Cuisine classifiers 2
In this second classification lesson, you will explore additional methods for classifying numeric data. You will also learn about the implications of choosing one classifier over another.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
### Prerequisite
We assume that you have completed the previous lessons and have a cleaned dataset in your `data` folder named _cleaned_cuisines.csv_ in the root of this 4-lesson folder.
### Preparation
We have preloaded your _notebook.ipynb_ file with the cleaned dataset and divided it into X and y dataframes, ready for the model-building process.
## A classification map
Previously, you learned about the various options available for classifying data using Microsoft's cheat sheet. Scikit-learn provides a similar but more detailed cheat sheet that can help you further narrow down your choice of estimators (another term for classifiers):
![ML Map from Scikit-learn](../../../../4-Classification/3-Classifiers-2/images/map.png)
> Tip: [visit this map online](https://scikit-learn.org/stable/tutorial/machine_learning_map/) and click along the path to read documentation.
### The plan
This map is very useful once you have a clear understanding of your data, as you can follow its paths to make a decision:
- We have >50 samples
- We want to predict a category
- We have labeled data
- We have fewer than 100K samples
- ✨ We can choose a Linear SVC
- If that doesn't work, since we have numeric data:
- We can try a ✨ KNeighbors Classifier
- If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers
This is a very helpful guide to follow.
## Exercise - split the data
Following this path, we should start by importing some libraries for use.
1. Import the necessary libraries:
```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
import numpy as np
```
1. Split your training and test data:
```python
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
```
## Linear SVC classifier
Support-Vector Clustering (SVC) is part of the Support-Vector Machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to determine how to cluster the labels. The 'C' parameter refers to 'regularization,' which controls the influence of parameters. The kernel can be one of [several](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC); here, we set it to 'linear' to ensure we use linear SVC. Probability defaults to 'false'; here, we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data and obtain probabilities.
### Exercise - apply a linear SVC
Start by creating an array of classifiers. You will add to this array progressively as we test.
1. Start with a Linear SVC:
```python
C = 10
# Create different classifiers.
classifiers = {
'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)
}
```
2. Train your model using the Linear SVC and print out a report:
```python
n_classifiers = len(classifiers)
for index, (name, classifier) in enumerate(classifiers.items()):
classifier.fit(X_train, np.ravel(y_train))
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
print(classification_report(y_test,y_pred))
```
The result is quite good:
```output
Accuracy (train) for Linear SVC: 78.6%
precision recall f1-score support
chinese 0.71 0.67 0.69 242
indian 0.88 0.86 0.87 234
japanese 0.79 0.74 0.76 254
korean 0.85 0.81 0.83 242
thai 0.71 0.86 0.78 227
accuracy 0.79 1199
macro avg 0.79 0.79 0.79 1199
weighted avg 0.79 0.79 0.79 1199
```
## K-Neighbors classifier
K-Neighbors belongs to the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created, and data is grouped around these points so that generalized labels can be predicted for the data.
### Exercise - apply the K-Neighbors classifier
The previous classifier performed well with the data, but perhaps we can achieve better accuracy. Try a K-Neighbors classifier.
1. Add a line to your classifier array (add a comma after the Linear SVC item):
```python
'KNN classifier': KNeighborsClassifier(C),
```
The result is slightly worse:
```output
Accuracy (train) for KNN classifier: 73.8%
precision recall f1-score support
chinese 0.64 0.67 0.66 242
indian 0.86 0.78 0.82 234
japanese 0.66 0.83 0.74 254
korean 0.94 0.58 0.72 242
thai 0.71 0.82 0.76 227
accuracy 0.74 1199
macro avg 0.76 0.74 0.74 1199
weighted avg 0.76 0.74 0.74 1199
```
✅ Learn about [K-Neighbors](https://scikit-learn.org/stable/modules/neighbors.html#neighbors)
## Support Vector Classifier
Support-Vector Classifiers are part of the [Support-Vector Machine](https://wikipedia.org/wiki/Support-vector_machine) family of ML methods used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.
### Exercise - apply a Support Vector Classifier
Let's aim for slightly better accuracy with a Support Vector Classifier.
1. Add a comma after the K-Neighbors item, and then add this line:
```python
'SVC': SVC(),
```
The result is quite good!
```output
Accuracy (train) for SVC: 83.2%
precision recall f1-score support
chinese 0.79 0.74 0.76 242
indian 0.88 0.90 0.89 234
japanese 0.87 0.81 0.84 254
korean 0.91 0.82 0.86 242
thai 0.74 0.90 0.81 227
accuracy 0.83 1199
macro avg 0.84 0.83 0.83 1199
weighted avg 0.84 0.83 0.83 1199
```
✅ Learn about [Support-Vectors](https://scikit-learn.org/stable/modules/svm.html#svm)
## Ensemble Classifiers
Let's follow the path to the very end, even though the previous test performed well. Let's try some 'Ensemble Classifiers,' specifically Random Forest and AdaBoost:
```python
'RFST': RandomForestClassifier(n_estimators=100),
'ADA': AdaBoostClassifier(n_estimators=100)
```
The result is excellent, especially for Random Forest:
```output
Accuracy (train) for RFST: 84.5%
precision recall f1-score support
chinese 0.80 0.77 0.78 242
indian 0.89 0.92 0.90 234
japanese 0.86 0.84 0.85 254
korean 0.88 0.83 0.85 242
thai 0.80 0.87 0.83 227
accuracy 0.84 1199
macro avg 0.85 0.85 0.84 1199
weighted avg 0.85 0.84 0.84 1199
Accuracy (train) for ADA: 72.4%
precision recall f1-score support
chinese 0.64 0.49 0.56 242
indian 0.91 0.83 0.87 234
japanese 0.68 0.69 0.69 254
korean 0.73 0.79 0.76 242
thai 0.67 0.83 0.74 227
accuracy 0.72 1199
macro avg 0.73 0.73 0.72 1199
weighted avg 0.73 0.72 0.72 1199
```
✅ Learn about [Ensemble Classifiers](https://scikit-learn.org/stable/modules/ensemble.html)
This Machine Learning method "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost.
- [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forest), an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter specifies the number of trees.
- [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.
---
## 🚀Challenge
Each of these techniques has numerous parameters that you can adjust. Research the default parameters for each one and consider how tweaking these parameters might affect the model's quality.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
There is a lot of terminology in these lessons, so take a moment to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-77952-leestott) of useful terms!
## Assignment
[Parameter play](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "58dfdaf79fb73f7d34b22bdbacf57329",
"translation_date": "2025-09-06T10:57:16+00:00",
"source_file": "4-Classification/3-Classifiers-2/assignment.md",
"language_code": "en"
}
-->
# Parameter Play
## Instructions
There are many parameters that are set by default when working with these classifiers. Intellisense in VS Code can help you explore them. Choose one of the ML Classification Techniques covered in this lesson and retrain models by adjusting various parameter values. Create a notebook that explains why certain changes improve the model's quality while others worsen it. Provide a detailed explanation in your response.
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | --------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------- | ----------------------------- |
| | A notebook is provided with a fully developed classifier, its parameters adjusted, and changes explained in textboxes | A notebook is partially provided or poorly explained | A notebook contains errors or issues |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:57:21+00:00",
"source_file": "4-Classification/3-Classifiers-2/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,329 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "61bdec27ed2da8b098cd9065405d9bb0",
"translation_date": "2025-09-06T10:56:37+00:00",
"source_file": "4-Classification/4-Applied/README.md",
"language_code": "en"
}
-->
# Build a Cuisine Recommender Web App
In this lesson, you will create a classification model using techniques learned in previous lessons and the delicious cuisine dataset used throughout this series. Additionally, you will develop a small web app to utilize a saved model, leveraging Onnx's web runtime.
Recommendation systems are one of the most practical applications of machine learning, and today youll take your first step in building one!
[![Presenting this web app](https://img.youtube.com/vi/17wdM9AHMfg/0.jpg)](https://youtu.be/17wdM9AHMfg "Applied ML")
> 🎥 Click the image above for a video: Jen Looper builds a web app using classified cuisine data
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
In this lesson, you will learn:
- How to build a model and save it as an Onnx model
- How to use Netron to inspect the model
- How to use your model in a web app for inference
## Build your model
Building applied ML systems is an essential part of integrating these technologies into business systems. By using Onnx, you can incorporate models into web applications, enabling offline usage if necessary.
In a [previous lesson](../../3-Web-App/1-Web-App/README.md), you created a regression model about UFO sightings, "pickled" it, and used it in a Flask app. While this architecture is valuable, it is a full-stack Python app, and your requirements might call for a JavaScript-based application.
In this lesson, youll build a basic JavaScript-based system for inference. First, you need to train a model and convert it for use with Onnx.
## Exercise - Train a Classification Model
Start by training a classification model using the cleaned cuisines dataset weve worked with before.
1. Begin by importing the necessary libraries:
```python
!pip install skl2onnx
import pandas as pd
```
Youll need '[skl2onnx](https://onnx.ai/sklearn-onnx/)' to convert your Scikit-learn model to Onnx format.
2. Process your data as you did in previous lessons by reading a CSV file using `read_csv()`:
```python
data = pd.read_csv('../data/cleaned_cuisines.csv')
data.head()
```
3. Remove the first two unnecessary columns and save the remaining data as 'X':
```python
X = data.iloc[:,2:]
X.head()
```
4. Save the labels as 'y':
```python
y = data[['cuisine']]
y.head()
```
### Start the Training Process
Well use the 'SVC' library, which provides good accuracy.
1. Import the relevant libraries from Scikit-learn:
```python
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report
```
2. Split the data into training and test sets:
```python
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
```
3. Build an SVC classification model as you did in the previous lesson:
```python
model = SVC(kernel='linear', C=10, probability=True,random_state=0)
model.fit(X_train,y_train.values.ravel())
```
4. Test your model by calling `predict()`:
```python
y_pred = model.predict(X_test)
```
5. Print a classification report to evaluate the models performance:
```python
print(classification_report(y_test,y_pred))
```
As seen before, the accuracy is strong:
```output
precision recall f1-score support
chinese 0.72 0.69 0.70 257
indian 0.91 0.87 0.89 243
japanese 0.79 0.77 0.78 239
korean 0.83 0.79 0.81 236
thai 0.72 0.84 0.78 224
accuracy 0.79 1199
macro avg 0.79 0.79 0.79 1199
weighted avg 0.79 0.79 0.79 1199
```
### Convert Your Model to Onnx
Ensure the conversion uses the correct tensor number. This dataset includes 380 ingredients, so youll need to specify that number in `FloatTensorType`.
1. Convert the model using a tensor number of 380:
```python
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 380]))]
options = {id(model): {'nocl': True, 'zipmap': False}}
```
2. Save the Onnx model as a file named **model.onnx**:
```python
onx = convert_sklearn(model, initial_types=initial_type, options=options)
with open("./model.onnx", "wb") as f:
f.write(onx.SerializeToString())
```
> Note: You can pass [options](https://onnx.ai/sklearn-onnx/parameterized.html) in your conversion script. In this case, we set 'nocl' to True and 'zipmap' to False. Since this is a classification model, you can remove ZipMap, which produces a list of dictionaries (not needed). `nocl` refers to class information being included in the model. Reduce the models size by setting `nocl` to 'True'.
Running the entire notebook will now create an Onnx model and save it in the current folder.
## View Your Model
Onnx models arent easily viewable in Visual Studio Code, but theres excellent free software called [Netron](https://github.com/lutzroeder/Netron) that researchers use to visualize models. Open your model.onnx file in Netron to see your simple model, including its 380 inputs and classifier:
![Netron visual](../../../../4-Classification/4-Applied/images/netron.png)
Netron is a useful tool for inspecting models.
Now youre ready to use this model in a web app. Lets build an app to help you decide which cuisine you can prepare based on the leftover ingredients in your refrigerator, as determined by your model.
## Build a Recommender Web Application
You can use your model directly in a web app. This architecture allows you to run it locally and even offline if needed. Start by creating an `index.html` file in the same folder as your `model.onnx` file.
1. In the file _index.html_, add the following markup:
```html
<!DOCTYPE html>
<html>
<header>
<title>Cuisine Matcher</title>
</header>
<body>
...
</body>
</html>
```
2. Within the `body` tags, add markup to display a list of checkboxes representing various ingredients:
```html
<h1>Check your refrigerator. What can you create?</h1>
<div id="wrapper">
<div class="boxCont">
<input type="checkbox" value="4" class="checkbox">
<label>apple</label>
</div>
<div class="boxCont">
<input type="checkbox" value="247" class="checkbox">
<label>pear</label>
</div>
<div class="boxCont">
<input type="checkbox" value="77" class="checkbox">
<label>cherry</label>
</div>
<div class="boxCont">
<input type="checkbox" value="126" class="checkbox">
<label>fenugreek</label>
</div>
<div class="boxCont">
<input type="checkbox" value="302" class="checkbox">
<label>sake</label>
</div>
<div class="boxCont">
<input type="checkbox" value="327" class="checkbox">
<label>soy sauce</label>
</div>
<div class="boxCont">
<input type="checkbox" value="112" class="checkbox">
<label>cumin</label>
</div>
</div>
<div style="padding-top:10px">
<button onClick="startInference()">What kind of cuisine can you make?</button>
</div>
```
Each checkbox is assigned a value corresponding to the index of the ingredient in the dataset. For example, Apple occupies the fifth column in the alphabetic list, so its value is '4' (counting starts at 0). Refer to the [ingredients spreadsheet](../../../../4-Classification/data/ingredient_indexes.csv) to find an ingredients index.
After the closing `</div>` tag, add a script block to call the model.
3. First, import the [Onnx Runtime](https://www.onnxruntime.ai/):
```html
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.9.0/dist/ort.min.js"></script>
```
> Onnx Runtime enables running Onnx models across various hardware platforms, offering optimizations and an API for usage.
4. Once the runtime is set up, call it:
```html
<script>
const ingredients = Array(380).fill(0);
const checks = [...document.querySelectorAll('.checkbox')];
checks.forEach(check => {
check.addEventListener('change', function() {
// toggle the state of the ingredient
// based on the checkbox's value (1 or 0)
ingredients[check.value] = check.checked ? 1 : 0;
});
});
function testCheckboxes() {
// validate if at least one checkbox is checked
return checks.some(check => check.checked);
}
async function startInference() {
let atLeastOneChecked = testCheckboxes()
if (!atLeastOneChecked) {
alert('Please select at least one ingredient.');
return;
}
try {
// create a new session and load the model.
const session = await ort.InferenceSession.create('./model.onnx');
const input = new ort.Tensor(new Float32Array(ingredients), [1, 380]);
const feeds = { float_input: input };
// feed inputs and run
const results = await session.run(feeds);
// read from results
alert('You can enjoy ' + results.label.data[0] + ' cuisine today!')
} catch (e) {
console.log(`failed to inference ONNX model`);
console.error(e);
}
}
</script>
```
In this code, several things happen:
1. An array of 380 possible values (1 or 0) is created to send to the model for inference, depending on whether an ingredient checkbox is checked.
2. An array of checkboxes is created, along with a way to determine whether they are checked, using an `init` function called when the app starts. When a checkbox is checked, the `ingredients` array is updated to reflect the selected ingredient.
3. A `testCheckboxes` function checks if any checkbox is selected.
4. The `startInference` function is triggered when the button is pressed. If any checkbox is checked, inference begins.
5. The inference routine includes:
1. Setting up an asynchronous model load
2. Creating a Tensor structure to send to the model
3. Creating 'feeds' that match the `float_input` input created during model training (use Netron to verify the name)
4. Sending these 'feeds' to the model and awaiting a response
## Test Your Application
Open a terminal in Visual Studio Code in the folder containing your index.html file. Ensure [http-server](https://www.npmjs.com/package/http-server) is installed globally, then type `http-server` at the prompt. A localhost will open, allowing you to view your web app. Check which cuisine is recommended based on selected ingredients:
![ingredient web app](../../../../4-Classification/4-Applied/images/web-app.png)
Congratulations! Youve created a recommendation web app with a few fields. Take some time to expand this system.
## 🚀Challenge
Your web app is quite basic, so enhance it using ingredients and their indexes from the [ingredient_indexes](../../../../4-Classification/data/ingredient_indexes.csv) dataset. What flavor combinations work to create a specific national dish?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
This lesson briefly introduced the concept of creating a recommendation system for food ingredients. This area of ML applications is rich with examples. Explore more about how these systems are built:
- https://www.sciencedirect.com/topics/computer-science/recommendation-engine
- https://www.technologyreview.com/2014/08/25/171547/the-ultimate-challenge-for-recommendation-engines/
- https://www.technologyreview.com/2015/03/23/168831/everything-is-a-recommendation/
## Assignment
[Build a new recommender](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "799ed651e2af0a7cad17c6268db11578",
"translation_date": "2025-09-06T10:56:56+00:00",
"source_file": "4-Classification/4-Applied/assignment.md",
"language_code": "en"
}
-->
# Build a recommender
## Instructions
Based on the exercises in this lesson, you now understand how to create a JavaScript-based web application using Onnx Runtime and a converted Onnx model. Try building a new recommender using data from these lessons or other sources (make sure to give proper credit). For example, you could design a pet recommender based on different personality traits or a music genre recommender tailored to someone's mood. Let your creativity shine!
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | --------------------------------------------------------------------- | ------------------------------------- | --------------------------------- |
| | A web app and notebook are provided, both well-documented and functional | One of the two is missing or has issues | Both are either missing or have issues |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,41 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "74e809ffd1e613a1058bbc3e9600859e",
"translation_date": "2025-09-06T10:55:48+00:00",
"source_file": "4-Classification/README.md",
"language_code": "en"
}
-->
# Getting started with classification
## Regional topic: Delicious Asian and Indian Cuisines 🍜
In Asia and India, food traditions are incredibly diverse and absolutely delicious! Let's explore data about regional cuisines to better understand their ingredients.
![Thai food seller](../../../4-Classification/images/thai-food.jpg)
> Photo by <a href="https://unsplash.com/@changlisheng?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Lisheng Chang</a> on <a href="https://unsplash.com/s/photos/asian-food?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## What you will learn
In this section, you will build on your previous study of Regression and explore other classifiers that can help you gain deeper insights into the data.
> There are helpful low-code tools available to assist you in working with classification models. Check out [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-classification-model-azure-machine-learning-designer/?WT.mc_id=academic-77952-leestott)
## Lessons
1. [Introduction to classification](1-Introduction/README.md)
2. [More classifiers](2-Classifiers-1/README.md)
3. [Yet other classifiers](3-Classifiers-2/README.md)
4. [Applied ML: build a web app](4-Applied/README.md)
## Credits
"Getting started with classification" was created with ♥️ by [Cassie Breviu](https://www.twitter.com/cassiebreviu) and [Jen Looper](https://www.twitter.com/jenlooper)
The dataset on delicious cuisines was sourced from [Kaggle](https://www.kaggle.com/hoandan/asian-and-indian-cuisines).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,347 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "730225ea274c9174fe688b21d421539d",
"translation_date": "2025-09-06T10:50:03+00:00",
"source_file": "5-Clustering/1-Visualize/README.md",
"language_code": "en"
}
-->
# Introduction to clustering
Clustering is a type of [Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning) that assumes a dataset is unlabelled or that its inputs are not paired with predefined outputs. It uses various algorithms to analyze unlabeled data and group it based on patterns identified within the data.
[![No One Like You by PSquare](https://img.youtube.com/vi/ty2advRiWJM/0.jpg)](https://youtu.be/ty2advRiWJM "No One Like You by PSquare")
> 🎥 Click the image above for a video. While studying machine learning with clustering, enjoy some Nigerian Dance Hall tracks—this is a highly rated song from 2014 by PSquare.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
### Introduction
[Clustering](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-30164-8_124) is incredibly useful for exploring data. Let's see if it can help uncover trends and patterns in how Nigerian audiences consume music.
✅ Take a moment to think about the applications of clustering. In everyday life, clustering happens when you sort a pile of laundry into family members' clothes 🧦👕👖🩲. In data science, clustering is used to analyze user preferences or identify characteristics in any unlabeled dataset. Clustering, in essence, helps bring order to chaos—like organizing a sock drawer.
[![Introduction to ML](https://img.youtube.com/vi/esmzYhuFnds/0.jpg)](https://youtu.be/esmzYhuFnds "Introduction to Clustering")
> 🎥 Click the image above for a video: MIT's John Guttag introduces clustering.
In a professional context, clustering can be used for tasks like market segmentation—for example, identifying which age groups purchase specific items. It can also be used for anomaly detection, such as identifying fraud in a dataset of credit card transactions. Another application might be detecting tumors in medical scans.
✅ Take a moment to think about how you might have encountered clustering in real-world scenarios, such as in banking, e-commerce, or business.
> 🎓 Interestingly, cluster analysis originated in the fields of Anthropology and Psychology in the 1930s. Can you imagine how it might have been applied back then?
Alternatively, clustering can be used to group search results—for example, by shopping links, images, or reviews. It's particularly useful for large datasets that need to be reduced for more detailed analysis, making it a valuable tool for understanding data before building other models.
✅ Once your data is organized into clusters, you can assign it a cluster ID. This technique is useful for preserving a dataset's privacy, as you can refer to a data point by its cluster ID rather than by more identifiable information. Can you think of other reasons why you might use a cluster ID instead of specific elements of the cluster for identification?
Deepen your understanding of clustering techniques in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-cluster-models?WT.mc_id=academic-77952-leestott).
## Getting started with clustering
[Scikit-learn offers a wide range](https://scikit-learn.org/stable/modules/clustering.html) of methods for clustering. The method you choose will depend on your specific use case. According to the documentation, each method has its own advantages. Here's a simplified table of the methods supported by Scikit-learn and their ideal use cases:
| Method name | Use case |
| :--------------------------- | :--------------------------------------------------------------------- |
| K-Means | General purpose, inductive |
| Affinity propagation | Many, uneven clusters, inductive |
| Mean-shift | Many, uneven clusters, inductive |
| Spectral clustering | Few, even clusters, transductive |
| Ward hierarchical clustering | Many, constrained clusters, transductive |
| Agglomerative clustering | Many, constrained, non-Euclidean distances, transductive |
| DBSCAN | Non-flat geometry, uneven clusters, transductive |
| OPTICS | Non-flat geometry, uneven clusters with variable density, transductive |
| Gaussian mixtures | Flat geometry, inductive |
| BIRCH | Large dataset with outliers, inductive |
> 🎓 How we create clusters depends heavily on how we group data points together. Let's break down some key terms:
>
> 🎓 ['Transductive' vs. 'inductive'](https://wikipedia.org/wiki/Transduction_(machine_learning))
>
> Transductive inference is derived from observed training cases that map to specific test cases. Inductive inference is derived from training cases that map to general rules, which are then applied to test cases.
>
> Example: Imagine you have a dataset that's only partially labeled. Some items are 'records,' some are 'CDs,' and others are blank. Your task is to label the blanks. Using an inductive approach, you'd train a model to identify 'records' and 'CDs' and apply those labels to the unlabeled data. This approach might struggle to classify items that are actually 'cassettes.' A transductive approach, however, groups similar items together and applies labels to the groups. In this case, clusters might represent 'round musical items' and 'square musical items.'
>
> 🎓 ['Non-flat' vs. 'flat' geometry](https://datascience.stackexchange.com/questions/52260/terminology-flat-geometry-in-the-context-of-clustering)
>
> Derived from mathematical terminology, non-flat vs. flat geometry refers to how distances between points are measured—either 'flat' ([Euclidean](https://wikipedia.org/wiki/Euclidean_geometry)) or 'non-flat' (non-Euclidean).
>
>'Flat' refers to Euclidean geometry (often taught as 'plane' geometry), while 'non-flat' refers to non-Euclidean geometry. In machine learning, these methods are used to measure distances between points in clusters. [Euclidean distances](https://wikipedia.org/wiki/Euclidean_distance) are measured as the length of a straight line between two points. [Non-Euclidean distances](https://wikipedia.org/wiki/Non-Euclidean_geometry) are measured along a curve. If your data doesn't exist on a plane when visualized, you may need a specialized algorithm to handle it.
>
![Flat vs Nonflat Geometry Infographic](../../../../5-Clustering/1-Visualize/images/flat-nonflat.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
>
> 🎓 ['Distances'](https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf)
>
> Clusters are defined by their distance matrix, which measures the distances between points. Euclidean clusters are defined by the average of the point values and have a 'centroid' or center point. Distances are measured relative to this centroid. Non-Euclidean distances use 'clustroids,' the point closest to other points, which can be defined in various ways.
>
> 🎓 ['Constrained'](https://wikipedia.org/wiki/Constrained_clustering)
>
> [Constrained Clustering](https://web.cs.ucdavis.edu/~davidson/Publications/ICDMTutorial.pdf) introduces 'semi-supervised' learning into this unsupervised method. Relationships between points are flagged as 'cannot link' or 'must-link,' imposing rules on the dataset.
>
> Example: If an algorithm is applied to unlabelled or semi-labelled data, the resulting clusters may be of poor quality. For instance, clusters might group 'round musical items,' 'square musical items,' 'triangular items,' and 'cookies.' Adding constraints like "the item must be made of plastic" or "the item must produce music" can help the algorithm make better choices.
>
> 🎓 'Density'
>
> Data that is 'noisy' is considered 'dense.' The distances between points in its clusters may vary, requiring the use of appropriate clustering methods. [This article](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) compares K-Means clustering and HDBSCAN algorithms for analyzing noisy datasets with uneven cluster density.
## Clustering algorithms
There are over 100 clustering algorithms, and their application depends on the nature of the data. Let's explore some of the major ones:
- **Hierarchical clustering**. Objects are grouped based on their proximity to nearby objects rather than distant ones. Clusters are formed based on the distances between their members. Scikit-learn's agglomerative clustering is hierarchical.
![Hierarchical clustering Infographic](../../../../5-Clustering/1-Visualize/images/hierarchical.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
- **Centroid clustering**. This popular algorithm requires selecting 'k,' the number of clusters to form. The algorithm then determines the center point of each cluster and groups data around it. [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) is a well-known example. The center is determined by the nearest mean, hence the name. The squared distance from the cluster is minimized.
![Centroid clustering Infographic](../../../../5-Clustering/1-Visualize/images/centroid.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
- **Distribution-based clustering**. Based on statistical modeling, this method assigns data points to clusters based on the probability of their belonging. Gaussian mixture methods fall under this category.
- **Density-based clustering**. Data points are grouped based on their density or proximity to one another. Points far from the group are considered outliers or noise. DBSCAN, Mean-shift, and OPTICS are examples of this type.
- **Grid-based clustering**. For multi-dimensional datasets, a grid is created, and data is divided among the grid's cells, forming clusters.
## Exercise - cluster your data
Clustering is greatly enhanced by effective visualization, so let's start by visualizing our music data. This exercise will help us determine the most suitable clustering method for this dataset.
1. Open the [_notebook.ipynb_](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/notebook.ipynb) file in this folder.
1. Import the `Seaborn` package for better data visualization.
```python
!pip install seaborn
```
1. Append the song data from [_nigerian-songs.csv_](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/data/nigerian-songs.csv). Load a dataframe with song data. Prepare to explore this data by importing the libraries and displaying the data:
```python
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("../data/nigerian-songs.csv")
df.head()
```
Check the first few rows of data:
| | name | album | artist | artist_top_genre | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo | time_signature |
| --- | ------------------------ | ---------------------------- | ------------------- | ---------------- | ------------ | ------ | ---------- | ------------ | ------------ | ------ | ---------------- | -------- | -------- | ----------- | ------- | -------------- |
| 0 | Sparky | Mandy & The Jungle | Cruel Santino | alternative r&b | 2019 | 144000 | 48 | 0.666 | 0.851 | 0.42 | 0.534 | 0.11 | -6.699 | 0.0829 | 133.015 | 5 |
| 1 | shuga rush | EVERYTHING YOU HEARD IS TRUE | Odunsi (The Engine) | afropop | 2020 | 89488 | 30 | 0.71 | 0.0822 | 0.683 | 0.000169 | 0.101 | -5.64 | 0.36 | 129.993 | 3 |
| 2 | LITT! | LITT! | AYLØ | indie r&b | 2018 | 207758 | 40 | 0.836 | 0.272 | 0.564 | 0.000537 | 0.11 | -7.127 | 0.0424 | 130.005 | 4 |
| 3 | Confident / Feeling Cool | Enjoy Your Life | Lady Donli | nigerian pop | 2019 | 175135 | 14 | 0.894 | 0.798 | 0.611 | 0.000187 | 0.0964 | -4.961 | 0.113 | 111.087 | 4 |
| 4 | wanted you | rare. | Odunsi (The Engine) | afropop | 2018 | 152049 | 25 | 0.702 | 0.116 | 0.833 | 0.91 | 0.348 | -6.044 | 0.0447 | 105.115 | 4 |
1. Get some information about the dataframe by calling `info()`:
```python
df.info()
```
The output looks like this:
```output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 530 non-null object
1 album 530 non-null object
2 artist 530 non-null object
3 artist_top_genre 530 non-null object
4 release_date 530 non-null int64
5 length 530 non-null int64
6 popularity 530 non-null int64
7 danceability 530 non-null float64
8 acousticness 530 non-null float64
9 energy 530 non-null float64
10 instrumentalness 530 non-null float64
11 liveness 530 non-null float64
12 loudness 530 non-null float64
13 speechiness 530 non-null float64
14 tempo 530 non-null float64
15 time_signature 530 non-null int64
dtypes: float64(8), int64(4), object(4)
memory usage: 66.4+ KB
```
1. Double-check for null values by calling `isnull()` and verifying the sum is 0:
```python
df.isnull().sum()
```
Everything looks good:
```output
name 0
album 0
artist 0
artist_top_genre 0
release_date 0
length 0
popularity 0
danceability 0
acousticness 0
energy 0
instrumentalness 0
liveness 0
loudness 0
speechiness 0
tempo 0
time_signature 0
dtype: int64
```
1. Describe the data:
```python
df.describe()
```
| | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo | time_signature |
| ----- | ------------ | ----------- | ---------- | ------------ | ------------ | -------- | ---------------- | -------- | --------- | ----------- | ---------- | -------------- |
| count | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 |
| mean | 2015.390566 | 222298.1698 | 17.507547 | 0.741619 | 0.265412 | 0.760623 | 0.016305 | 0.147308 | -4.953011 | 0.130748 | 116.487864 | 3.986792 |
| std | 3.131688 | 39696.82226 | 18.992212 | 0.117522 | 0.208342 | 0.148533 | 0.090321 | 0.123588 | 2.464186 | 0.092939 | 23.518601 | 0.333701 |
| min | 1998 | 89488 | 0 | 0.255 | 0.000665 | 0.111 | 0 | 0.0283 | -19.362 | 0.0278 | 61.695 | 3 |
| 25% | 2014 | 199305 | 0 | 0.681 | 0.089525 | 0.669 | 0 | 0.07565 | -6.29875 | 0.0591 | 102.96125 | 4 |
| 50% | 2016 | 218509 | 13 | 0.761 | 0.2205 | 0.7845 | 0.000004 | 0.1035 | -4.5585 | 0.09795 | 112.7145 | 4 |
| 75% | 2017 | 242098.5 | 31 | 0.8295 | 0.403 | 0.87575 | 0.000234 | 0.164 | -3.331 | 0.177 | 125.03925 | 4 |
| max | 2020 | 511738 | 73 | 0.966 | 0.954 | 0.995 | 0.91 | 0.811 | 0.582 | 0.514 | 206.007 | 5 |
> 🤔 If clustering is an unsupervised method that doesn't require labeled data, why are we showing this data with labels? During the data exploration phase, labels are helpful, but they aren't necessary for clustering algorithms to work. You could remove the column headers and refer to the data by column number instead.
Take a look at the general values in the data. Note that popularity can be '0', which indicates songs with no ranking. We'll remove those shortly.
1. Use a barplot to identify the most popular genres:
```python
import seaborn as sns
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top[:5].index,y=top[:5].values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
![most popular](../../../../5-Clustering/1-Visualize/images/popular.png)
✅ If you'd like to see more top values, change the top `[:5]` to a larger value, or remove it to see everything.
When the top genre is listed as 'Missing', it means Spotify didn't classify it. Let's filter it out.
1. Remove missing data by filtering it out:
```python
df = df[df['artist_top_genre'] != 'Missing']
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
Now check the genres again:
![most popular](../../../../5-Clustering/1-Visualize/images/all-genres.png)
1. The top three genres dominate this dataset. Let's focus on `afro dancehall`, `afropop`, and `nigerian pop`. Additionally, filter the dataset to remove entries with a popularity value of 0 (indicating they weren't classified with popularity and can be considered noise for our purposes):
```python
df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
df = df[(df['popularity'] > 0)]
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
1. Perform a quick test to see if the data has any strong correlations:
```python
corrmat = df.corr(numeric_only=True)
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
```
![correlations](../../../../5-Clustering/1-Visualize/images/correlation.png)
The only strong correlation is between `energy` and `loudness`, which isn't surprising since loud music is often energetic. Otherwise, the correlations are relatively weak. It'll be interesting to see what a clustering algorithm can uncover in this data.
> 🎓 Remember, correlation does not imply causation! We have evidence of correlation but no proof of causation. An [amusing website](https://tylervigen.com/spurious-correlations) provides visuals that emphasize this point.
Is there any convergence in this dataset around a song's perceived popularity and danceability? A FacetGrid shows concentric circles aligning, regardless of genre. Could it be that Nigerian tastes converge at a certain level of danceability for this genre?
✅ Try different data points (energy, loudness, speechiness) and explore more or different musical genres. What can you discover? Refer to the `df.describe()` table to understand the general spread of the data points.
### Exercise - Data Distribution
Are these three genres significantly different in their perception of danceability based on popularity?
1. Examine the data distribution for popularity and danceability in our top three genres along a given x and y axis:
```python
sns.set_theme(style="ticks")
g = sns.jointplot(
data=df,
x="popularity", y="danceability", hue="artist_top_genre",
kind="kde",
)
```
You can observe concentric circles around a general point of convergence, showing the distribution of points.
> 🎓 This example uses a KDE (Kernel Density Estimate) graph, which represents the data using a continuous probability density curve. This helps interpret data when working with multiple distributions.
In general, the three genres align loosely in terms of popularity and danceability. Identifying clusters in this loosely-aligned data will be challenging:
![distribution](../../../../5-Clustering/1-Visualize/images/distribution.png)
1. Create a scatter plot:
```python
sns.FacetGrid(df, hue="artist_top_genre", height=5) \
.map(plt.scatter, "popularity", "danceability") \
.add_legend()
```
A scatterplot of the same axes shows a similar pattern of convergence:
![Facetgrid](../../../../5-Clustering/1-Visualize/images/facetgrid.png)
Scatterplots are useful for visualizing clusters of data, making them essential for clustering tasks. In the next lesson, we'll use k-means clustering to identify groups in this data that overlap in interesting ways.
---
## 🚀Challenge
To prepare for the next lesson, create a chart about the various clustering algorithms you might encounter and use in a production environment. What types of problems is clustering designed to solve?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Before applying clustering algorithms, it's important to understand the nature of your dataset. Learn more about this topic [here](https://www.kdnuggets.com/2019/10/right-clustering-algorithm.html).
[This helpful article](https://www.freecodecamp.org/news/8-clustering-algorithms-in-machine-learning-that-all-data-scientists-should-know/) explains how different clustering algorithms behave with various data shapes.
## Assignment
[Research other visualizations for clustering](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "589fa015a5e7d9e67bd629f7d47b53de",
"translation_date": "2025-09-06T10:50:46+00:00",
"source_file": "5-Clustering/1-Visualize/assignment.md",
"language_code": "en"
}
-->
# Research other visualizations for clustering
## Instructions
In this lesson, you've explored some visualization techniques to help you prepare your data for clustering. Scatterplots, in particular, are effective for identifying groups of objects. Investigate alternative methods and libraries for creating scatterplots, and document your findings in a notebook. You can use the data provided in this lesson, data from other lessons, or data you source independently (make sure to credit the source in your notebook). Create scatterplots with the data and explain your observations.
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ----------------------------------- |
| | A notebook is provided with five well-documented scatterplots | A notebook is provided with fewer than five scatterplots and is less thoroughly documented | An incomplete notebook is provided |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:50:50+00:00",
"source_file": "5-Clustering/1-Visualize/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,261 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "7cdd17338d9bbd7e2171c2cd462eb081",
"translation_date": "2025-09-06T10:50:53+00:00",
"source_file": "5-Clustering/2-K-Means/README.md",
"language_code": "en"
}
-->
# K-Means Clustering
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
In this lesson, you'll learn how to create clusters using Scikit-learn and the Nigerian music dataset you imported earlier. We'll cover the basics of K-Means for clustering. Remember, as you learned in the previous lesson, there are many ways to work with clusters, and the method you choose depends on your data. We'll try K-Means since it's the most common clustering technique. Let's dive in!
Key terms you'll learn:
- Silhouette scoring
- Elbow method
- Inertia
- Variance
## Introduction
[K-Means Clustering](https://wikipedia.org/wiki/K-means_clustering) is a technique from the field of signal processing. It is used to divide and group data into 'k' clusters based on a series of observations. Each observation works to group a given data point closest to its nearest 'mean,' or the center point of a cluster.
The clusters can be visualized as [Voronoi diagrams](https://wikipedia.org/wiki/Voronoi_diagram), which consist of a point (or 'seed') and its corresponding region.
![voronoi diagram](../../../../5-Clustering/2-K-Means/images/voronoi.png)
> Infographic by [Jen Looper](https://twitter.com/jenlooper)
The K-Means clustering process [follows a three-step procedure](https://scikit-learn.org/stable/modules/clustering.html#k-means):
1. The algorithm selects k-number of center points by sampling from the dataset. Then it loops:
1. Assigns each sample to the nearest centroid.
2. Creates new centroids by calculating the mean value of all samples assigned to the previous centroids.
3. Calculates the difference between the new and old centroids and repeats until the centroids stabilize.
One limitation of K-Means is that you need to define 'k,' the number of centroids. Luckily, the 'elbow method' can help estimate a good starting value for 'k.' You'll try it shortly.
## Prerequisite
You'll work in this lesson's [_notebook.ipynb_](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/2-K-Means/notebook.ipynb) file, which includes the data import and preliminary cleaning you completed in the previous lesson.
## Exercise - Preparation
Start by revisiting the songs dataset.
1. Create a boxplot by calling `boxplot()` for each column:
```python
plt.figure(figsize=(20,20), dpi=200)
plt.subplot(4,3,1)
sns.boxplot(x = 'popularity', data = df)
plt.subplot(4,3,2)
sns.boxplot(x = 'acousticness', data = df)
plt.subplot(4,3,3)
sns.boxplot(x = 'energy', data = df)
plt.subplot(4,3,4)
sns.boxplot(x = 'instrumentalness', data = df)
plt.subplot(4,3,5)
sns.boxplot(x = 'liveness', data = df)
plt.subplot(4,3,6)
sns.boxplot(x = 'loudness', data = df)
plt.subplot(4,3,7)
sns.boxplot(x = 'speechiness', data = df)
plt.subplot(4,3,8)
sns.boxplot(x = 'tempo', data = df)
plt.subplot(4,3,9)
sns.boxplot(x = 'time_signature', data = df)
plt.subplot(4,3,10)
sns.boxplot(x = 'danceability', data = df)
plt.subplot(4,3,11)
sns.boxplot(x = 'length', data = df)
plt.subplot(4,3,12)
sns.boxplot(x = 'release_date', data = df)
```
This data is a bit noisy: by observing each column as a boxplot, you can identify outliers.
![outliers](../../../../5-Clustering/2-K-Means/images/boxplots.png)
You could go through the dataset and remove these outliers, but that would leave you with very minimal data.
1. For now, decide which columns to use for your clustering exercise. Choose ones with similar ranges and encode the `artist_top_genre` column as numeric data:
```python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X = df.loc[:, ('artist_top_genre','popularity','danceability','acousticness','loudness','energy')]
y = df['artist_top_genre']
X['artist_top_genre'] = le.fit_transform(X['artist_top_genre'])
y = le.transform(y)
```
1. Next, determine how many clusters to target. You know there are 3 song genres in the dataset, so let's try 3:
```python
from sklearn.cluster import KMeans
nclusters = 3
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X)
# Predict the cluster for each data point
y_cluster_kmeans = km.predict(X)
y_cluster_kmeans
```
You'll see an array printed out with predicted clusters (0, 1, or 2) for each row in the dataframe.
1. Use this array to calculate a 'silhouette score':
```python
from sklearn import metrics
score = metrics.silhouette_score(X, y_cluster_kmeans)
score
```
## Silhouette Score
Aim for a silhouette score closer to 1. This score ranges from -1 to 1. A score of 1 indicates that the cluster is dense and well-separated from other clusters. A value near 0 suggests overlapping clusters with samples close to the decision boundary of neighboring clusters. [(Source)](https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam)
Our score is **0.53**, which is moderate. This suggests that our data isn't particularly well-suited for this type of clustering, but let's proceed.
### Exercise - Build a Model
1. Import `KMeans` and begin the clustering process.
```python
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
```
Here's an explanation of some key parts:
> 🎓 range: These are the iterations of the clustering process.
> 🎓 random_state: "Determines random number generation for centroid initialization." [Source](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
> 🎓 WCSS: "Within-cluster sums of squares" measures the squared average distance of all points within a cluster to the cluster centroid. [Source](https://medium.com/@ODSC/unsupervised-learning-evaluating-clusters-bd47eed175ce)
> 🎓 Inertia: K-Means algorithms aim to choose centroids that minimize 'inertia,' "a measure of how internally coherent clusters are." [Source](https://scikit-learn.org/stable/modules/clustering.html). The value is appended to the wcss variable during each iteration.
> 🎓 k-means++: In [Scikit-learn](https://scikit-learn.org/stable/modules/clustering.html#k-means), you can use the 'k-means++' optimization, which "initializes the centroids to be (generally) distant from each other, leading to likely better results than random initialization."
### Elbow Method
Earlier, you assumed that 3 clusters would be appropriate because of the 3 song genres. But is that correct?
1. Use the 'elbow method' to confirm.
```python
plt.figure(figsize=(10,5))
sns.lineplot(x=range(1, 11), y=wcss, marker='o', color='red')
plt.title('Elbow')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
```
Use the `wcss` variable you built earlier to create a chart showing the 'bend' in the elbow, which indicates the optimal number of clusters. Perhaps it **is** 3!
![elbow method](../../../../5-Clustering/2-K-Means/images/elbow.png)
## Exercise - Display the Clusters
1. Repeat the process, this time setting three clusters, and display the clusters as a scatterplot:
```python
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3)
kmeans.fit(X)
labels = kmeans.predict(X)
plt.scatter(df['popularity'],df['danceability'],c = labels)
plt.xlabel('popularity')
plt.ylabel('danceability')
plt.show()
```
1. Check the model's accuracy:
```python
labels = kmeans.labels_
correct_labels = sum(y == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))
print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))
```
The model's accuracy isn't great, and the shape of the clusters gives you a clue as to why.
![clusters](../../../../5-Clustering/2-K-Means/images/clusters.png)
The data is too imbalanced, poorly correlated, and has too much variance between column values to cluster effectively. In fact, the clusters that form are likely heavily influenced or skewed by the three genre categories we defined earlier. This was a learning experience!
According to Scikit-learn's documentation, a model like this one, with poorly defined clusters, has a 'variance' problem:
![problem models](../../../../5-Clustering/2-K-Means/images/problems.png)
> Infographic from Scikit-learn
## Variance
Variance is defined as "the average of the squared differences from the Mean" [(Source)](https://www.mathsisfun.com/data/standard-deviation.html). In the context of this clustering problem, it means that the numbers in our dataset diverge too much from the mean.
✅ This is a good time to think about ways to address this issue. Should you tweak the data further? Use different columns? Try a different algorithm? Hint: Consider [scaling your data](https://www.mygreatlearning.com/blog/learning-data-science-with-k-means-clustering/) to normalize it and test other columns.
> Try this '[variance calculator](https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php)' to better understand the concept.
---
## 🚀Challenge
Spend some time with this notebook, tweaking parameters. Can you improve the model's accuracy by cleaning the data further (e.g., removing outliers)? You can use weights to give more importance to certain data samples. What else can you do to create better clusters?
Hint: Try scaling your data. There's commented code in the notebook that adds standard scaling to make the data columns more similar in range. You'll find that while the silhouette score decreases, the 'kink' in the elbow graph becomes smoother. This is because leaving the data unscaled allows data with less variance to have more influence. Read more about this issue [here](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226).
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Check out a K-Means Simulator [like this one](https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/). This tool lets you visualize sample data points and determine their centroids. You can adjust the data's randomness, number of clusters, and number of centroids. Does this help you better understand how data can be grouped?
Also, review [this handout on K-Means](https://stanford.edu/~cpiech/cs221/handouts/kmeans.html) from Stanford.
## Assignment
[Experiment with different clustering methods](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "b8e17eff34ad1680eba2a5d3cf9ffc41",
"translation_date": "2025-09-06T10:51:17+00:00",
"source_file": "5-Clustering/2-K-Means/assignment.md",
"language_code": "en"
}
-->
# Try different clustering methods
## Instructions
In this lesson, you learned about K-Means clustering. However, K-Means might not always be the best fit for your data. Create a notebook using data either from these lessons or from another source (make sure to credit your source) and demonstrate a different clustering method that does NOT use K-Means. What insights did you gain?
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | --------------------------------------------------------------- | -------------------------------------------------------------------- | ---------------------------- |
| | A notebook is presented with a well-documented clustering model | A notebook is presented without good documentation and/or incomplete | Incomplete work is submitted |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:51:22+00:00",
"source_file": "5-Clustering/2-K-Means/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,42 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "b28a3a4911584062772c537b653ebbc7",
"translation_date": "2025-09-06T10:49:54+00:00",
"source_file": "5-Clustering/README.md",
"language_code": "en"
}
-->
# Clustering models for machine learning
Clustering is a machine learning task that aims to identify objects that are similar to each other and group them into clusters. What sets clustering apart from other machine learning approaches is that it happens automatically. In fact, its fair to say its the opposite of supervised learning.
## Regional topic: clustering models for a Nigerian audience's musical taste 🎧
Nigerias diverse population has equally diverse musical preferences. Using data scraped from Spotify (inspired by [this article](https://towardsdatascience.com/country-wise-visual-analysis-of-music-taste-using-spotify-api-seaborn-in-python-77f5b749b421)), lets explore some music thats popular in Nigeria. This dataset includes information about various songs, such as their 'danceability' score, 'acousticness', loudness, 'speechiness', popularity, and energy. It will be fascinating to uncover patterns in this data!
![A turntable](../../../5-Clustering/images/turntable.jpg)
> Photo by <a href="https://unsplash.com/@marcelalaskoski?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Marcela Laskoski</a> on <a href="https://unsplash.com/s/photos/nigerian-music?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
In this series of lessons, youll learn new ways to analyze data using clustering techniques. Clustering is especially useful when your dataset doesnt have labels. If your dataset does have labels, classification techniques like the ones you learned in earlier lessons might be more appropriate. However, when you want to group unlabeled data, clustering is an excellent way to uncover patterns.
> There are helpful low-code tools available to assist you in working with clustering models. Consider using [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-clustering-model-azure-machine-learning-designer/?WT.mc_id=academic-77952-leestott).
## Lessons
1. [Introduction to clustering](1-Visualize/README.md)
2. [K-Means clustering](2-K-Means/README.md)
## Credits
These lessons were created with 🎶 by [Jen Looper](https://www.twitter.com/jenlooper), with valuable reviews by [Rishit Dagli](https://rishit_dagli) and [Muhammad Sakib Khan Inan](https://twitter.com/Sakibinan).
The [Nigerian Songs](https://www.kaggle.com/sootersaalu/nigerian-songs-spotify) dataset was sourced from Kaggle, based on data scraped from Spotify.
Useful K-Means examples that contributed to the development of this lesson include this [iris exploration](https://www.kaggle.com/bburns/iris-exploration-pca-k-means-and-gmm-clustering), this [introductory notebook](https://www.kaggle.com/prashant111/k-means-clustering-with-python), and this [hypothetical NGO example](https://www.kaggle.com/ankandash/pca-k-means-clustering-hierarchical-clustering).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,179 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1c2ec40cf55c98a028a359c27ef7e45a",
"translation_date": "2025-09-06T11:02:08+00:00",
"source_file": "6-NLP/1-Introduction-to-NLP/README.md",
"language_code": "en"
}
-->
# Introduction to natural language processing
This lesson provides a brief history and key concepts of *natural language processing*, a subfield of *computational linguistics*.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Introduction
NLP, as it is commonly called, is one of the most prominent areas where machine learning has been applied and integrated into production software.
✅ Can you think of software you use daily that likely incorporates NLP? Consider your word processing programs or mobile apps you use regularly.
You will learn about:
- **The concept of languages**. How languages evolved and the major areas of study.
- **Definitions and concepts**. You will also explore definitions and concepts related to how computers process text, including parsing, grammar, and identifying nouns and verbs. This lesson includes coding tasks and introduces several important concepts that you will learn to code in subsequent lessons.
## Computational linguistics
Computational linguistics is a field of research and development spanning decades, focused on how computers can work with, understand, translate, and communicate using languages. Natural language processing (NLP) is a related field that specifically examines how computers can process 'natural', or human, languages.
### Example - phone dictation
If you've ever dictated to your phone instead of typing or asked a virtual assistant a question, your speech was converted into text and then processed or *parsed* from the language you spoke. The identified keywords were then processed into a format the phone or assistant could understand and act upon.
![comprehension](../../../../6-NLP/1-Introduction-to-NLP/images/comprehension.png)
> Real linguistic comprehension is hard! Image by [Jen Looper](https://twitter.com/jenlooper)
### How is this technology made possible?
This is possible because someone wrote a computer program to enable it. A few decades ago, science fiction writers predicted that people would primarily speak to their computers, and the computers would always understand exactly what they meant. Unfortunately, this turned out to be a much harder problem than many imagined. While the problem is better understood today, achieving 'perfect' natural language processing—especially when it comes to understanding the meaning of a sentence—remains a significant challenge. This is particularly difficult when trying to interpret humor or detect emotions like sarcasm in a sentence.
You might recall school lessons where teachers covered grammar components in a sentence. In some countries, grammar and linguistics are taught as a dedicated subject, while in others, these topics are integrated into language learning: either your first language in primary school (learning to read and write) or perhaps a second language in high school. Don't worry if you're not an expert at distinguishing nouns from verbs or adverbs from adjectives!
If you struggle with the difference between the *simple present* and *present progressive*, you're not alone. This is challenging for many people, even native speakers of a language. The good news is that computers excel at applying formal rules, and you'll learn to write code that can *parse* a sentence as well as a human. The greater challenge you'll explore later is understanding the *meaning* and *sentiment* of a sentence.
## Prerequisites
For this lesson, the main prerequisite is being able to read and understand the language of this lesson. There are no math problems or equations to solve. While the original author wrote this lesson in English, it has been translated into other languages, so you might be reading a translation. Examples include several different languages (to compare grammar rules across languages). These examples are *not* translated, but the explanatory text is, so the meaning should be clear.
For the coding tasks, you'll use Python, and the examples are based on Python 3.8.
In this section, you will need and use:
- **Python 3 comprehension**. Understanding the Python 3 programming language, including input, loops, file reading, and arrays.
- **Visual Studio Code + extension**. We'll use Visual Studio Code and its Python extension. You can also use a Python IDE of your choice.
- **TextBlob**. [TextBlob](https://github.com/sloria/TextBlob) is a simplified text processing library for Python. Follow the instructions on the TextBlob site to install it on your system (install the corpora as well, as shown below):
```bash
pip install -U textblob
python -m textblob.download_corpora
```
> 💡 Tip: You can run Python directly in VS Code environments. Check the [docs](https://code.visualstudio.com/docs/languages/python?WT.mc_id=academic-77952-leestott) for more information.
## Talking to machines
The history of trying to make computers understand human language spans decades, and one of the earliest scientists to explore natural language processing was *Alan Turing*.
### The 'Turing test'
When Turing researched *artificial intelligence* in the 1950s, he proposed a conversational test where a human and computer (via typed correspondence) would interact, and the human participant would try to determine whether they were conversing with another human or a computer.
If, after a certain length of conversation, the human could not distinguish whether the responses came from a computer or another human, could the computer be said to be *thinking*?
### The inspiration - 'the imitation game'
This idea was inspired by a party game called *The Imitation Game*, where an interrogator in one room tries to determine which of two people (in another room) is male and which is female. The interrogator sends notes and attempts to ask questions that reveal the gender of the mystery person. Meanwhile, the players in the other room try to mislead or confuse the interrogator while appearing to answer honestly.
### Developing Eliza
In the 1960s, an MIT scientist named *Joseph Weizenbaum* developed [*Eliza*](https://wikipedia.org/wiki/ELIZA), a computer 'therapist' that asked humans questions and gave the impression of understanding their answers. However, while Eliza could parse a sentence and identify certain grammatical constructs and keywords to provide reasonable responses, it could not truly *understand* the sentence. For example, if Eliza was presented with a sentence like "**I am** <u>sad</u>", it might rearrange and substitute words to form the response "How long have **you been** <u>sad</u>?"
This gave the impression that Eliza understood the statement and was asking a follow-up question, but in reality, it was simply changing the tense and adding some words. If Eliza couldn't identify a keyword it had a response for, it would provide a random response applicable to many different statements. Eliza could be easily tricked; for instance, if a user wrote "**You are** a <u>bicycle</u>", it might respond with "How long have **I been** a <u>bicycle</u>?" instead of a more logical reply.
[![Chatting with Eliza](https://img.youtube.com/vi/RMK9AphfLco/0.jpg)](https://youtu.be/RMK9AphfLco "Chatting with Eliza")
> 🎥 Click the image above for a video about the original ELIZA program
> Note: You can read the original description of [Eliza](https://cacm.acm.org/magazines/1966/1/13317-elizaa-computer-program-for-the-study-of-natural-language-communication-between-man-and-machine/abstract) published in 1966 if you have an ACM account. Alternatively, read about Eliza on [Wikipedia](https://wikipedia.org/wiki/ELIZA).
## Exercise - coding a basic conversational bot
A conversational bot, like Eliza, is a program that elicits user input and appears to understand and respond intelligently. Unlike Eliza, our bot will not have multiple rules to simulate intelligent conversation. Instead, it will have one simple ability: to keep the conversation going with random responses that could fit almost any trivial conversation.
### The plan
Steps to build a conversational bot:
1. Print instructions advising the user how to interact with the bot.
2. Start a loop:
1. Accept user input.
2. If the user asks to exit, then exit.
3. Process user input and determine a response (in this case, the response is a random choice from a list of generic responses).
4. Print the response.
3. Loop back to step 2.
### Building the bot
Let's create the bot. We'll start by defining some phrases.
1. Create this bot yourself in Python with the following random responses:
```python
random_responses = ["That is quite interesting, please tell me more.",
"I see. Do go on.",
"Why do you say that?",
"Funny weather we've been having, isn't it?",
"Let's change the subject.",
"Did you catch the game last night?"]
```
Here is some sample output to guide you (user input is on the lines starting with `>`):
```output
Hello, I am Marvin, the simple robot.
You can end this conversation at any time by typing 'bye'
After typing each answer, press 'enter'
How are you today?
> I am good thanks
That is quite interesting, please tell me more.
> today I went for a walk
Did you catch the game last night?
> I did, but my team lost
Funny weather we've been having, isn't it?
> yes but I hope next week is better
Let's change the subject.
> ok, lets talk about music
Why do you say that?
> because I like music!
Why do you say that?
> bye
It was nice talking to you, goodbye!
```
One possible solution to the task is [here](https://github.com/microsoft/ML-For-Beginners/blob/main/6-NLP/1-Introduction-to-NLP/solution/bot.py).
✅ Stop and consider:
1. Do you think the random responses would 'trick' someone into thinking the bot actually understood them?
2. What features would the bot need to be more effective?
3. If a bot could truly 'understand' the meaning of a sentence, would it also need to 'remember' the meaning of previous sentences in a conversation?
---
## 🚀Challenge
Choose one of the "stop and consider" elements above and either try to implement it in code or write a solution on paper using pseudocode.
In the next lesson, you'll explore other approaches to parsing natural language and machine learning.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Explore the references below for further reading.
### References
1. Schubert, Lenhart, "Computational Linguistics", *The Stanford Encyclopedia of Philosophy* (Spring 2020 Edition), Edward N. Zalta (ed.), URL = <https://plato.stanford.edu/archives/spr2020/entries/computational-linguistics/>.
2. Princeton University "About WordNet." [WordNet](https://wordnet.princeton.edu/). Princeton University. 2010.
## Assignment
[Search for a bot](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1d7583e8046dacbb0c056d5ba0a71b16",
"translation_date": "2025-09-06T11:02:29+00:00",
"source_file": "6-NLP/1-Introduction-to-NLP/assignment.md",
"language_code": "en"
}
-->
# Search for a bot
## Instructions
Bots are everywhere. Your task: find one and interact with it! You can encounter them on websites, in banking apps, or even over the phone—for instance, when you contact financial services for advice or account details. Study the bot and see if you can confuse it. If you manage to confuse the bot, why do you think that happened? Write a brief paper describing your experience.
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | ------------------------------------------------------------------------------------------------------------- | -------------------------------------------- | --------------------- |
| | A full page paper is written, explaining the presumed bot architecture and outlining your experience with it | A paper is incomplete or not well researched | No paper is submitted |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,228 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "5f3cb462e3122e1afe7ab0050ccf2bd3",
"translation_date": "2025-09-06T11:00:38+00:00",
"source_file": "6-NLP/2-Tasks/README.md",
"language_code": "en"
}
-->
# Common natural language processing tasks and techniques
For most *natural language processing* tasks, the text to be processed must be broken down, analyzed, and the results stored or cross-referenced with rules and datasets. These tasks allow the programmer to derive the _meaning_, _intent_, or simply the _frequency_ of terms and words in a text.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
Let's explore common techniques used in processing text. Combined with machine learning, these techniques help you analyze large amounts of text efficiently. Before applying ML to these tasks, however, let's understand the challenges faced by an NLP specialist.
## Tasks common to NLP
There are various ways to analyze a text you are working on. These tasks help you understand the text and draw conclusions. Typically, these tasks are performed in a sequence.
### Tokenization
The first step most NLP algorithms take is splitting the text into tokens, or words. While this sounds simple, accounting for punctuation and different languages' word and sentence delimiters can make it challenging. You may need to use different methods to determine boundaries.
![tokenization](../../../../6-NLP/2-Tasks/images/tokenization.png)
> Tokenizing a sentence from **Pride and Prejudice**. Infographic by [Jen Looper](https://twitter.com/jenlooper)
### Embeddings
[Word embeddings](https://wikipedia.org/wiki/Word_embedding) are a way to convert text data into numerical form. Embeddings are designed so that words with similar meanings or words often used together are grouped closely.
![word embeddings](../../../../6-NLP/2-Tasks/images/embedding.png)
> "I have the highest respect for your nerves, they are my old friends." - Word embeddings for a sentence in **Pride and Prejudice**. Infographic by [Jen Looper](https://twitter.com/jenlooper)
✅ Try [this interesting tool](https://projector.tensorflow.org/) to experiment with word embeddings. Clicking on one word shows clusters of similar words: 'toy' clusters with 'disney', 'lego', 'playstation', and 'console'.
### Parsing & Part-of-speech Tagging
Every tokenized word can be tagged as a part of speech, such as a noun, verb, or adjective. For example, the sentence `the quick red fox jumped over the lazy brown dog` might be POS tagged as fox = noun, jumped = verb.
![parsing](../../../../6-NLP/2-Tasks/images/parse.png)
> Parsing a sentence from **Pride and Prejudice**. Infographic by [Jen Looper](https://twitter.com/jenlooper)
Parsing involves identifying relationships between words in a sentence—for instance, `the quick red fox jumped` is an adjective-noun-verb sequence that is separate from the `lazy brown dog` sequence.
### Word and Phrase Frequencies
A useful technique for analyzing large texts is building a dictionary of every word or phrase of interest and tracking how often it appears. For example, in the phrase `the quick red fox jumped over the lazy brown dog`, the word "the" appears twice.
Consider an example text where we count word frequencies. Rudyard Kipling's poem *The Winners* contains the following verse:
```output
What the moral? Who rides may read.
When the night is thick and the tracks are blind
A friend at a pinch is a friend, indeed,
But a fool to wait for the laggard behind.
Down to Gehenna or up to the Throne,
He travels the fastest who travels alone.
```
Depending on whether phrase frequencies are case-sensitive or case-insensitive, the phrase `a friend` has a frequency of 2, `the` has a frequency of 6, and `travels` appears twice.
### N-grams
Text can be divided into sequences of words of a fixed length: single words (unigrams), pairs of words (bigrams), triplets (trigrams), or any number of words (n-grams).
For example, the phrase `the quick red fox jumped over the lazy brown dog` with an n-gram score of 2 produces the following n-grams:
1. the quick
2. quick red
3. red fox
4. fox jumped
5. jumped over
6. over the
7. the lazy
8. lazy brown
9. brown dog
It can be visualized as a sliding box over the sentence. For n-grams of 3 words, the n-gram is highlighted in bold in each sentence:
1. **the quick red** fox jumped over the lazy brown dog
2. the **quick red fox** jumped over the lazy brown dog
3. the quick **red fox jumped** over the lazy brown dog
4. the quick red **fox jumped over** the lazy brown dog
5. the quick red fox **jumped over the** lazy brown dog
6. the quick red fox jumped **over the lazy** brown dog
7. the quick red fox jumped over **the lazy brown** dog
8. the quick red fox jumped over the **lazy brown dog**
![n-grams sliding window](../../../../6-NLP/2-Tasks/images/n-grams.gif)
> N-gram value of 3: Infographic by [Jen Looper](https://twitter.com/jenlooper)
### Noun phrase Extraction
In most sentences, there is a noun that serves as the subject or object. In English, it is often preceded by 'a', 'an', or 'the'. Identifying the subject or object of a sentence by extracting the noun phrase is a common NLP task when trying to understand the meaning of a sentence.
✅ In the sentence "I cannot fix on the hour, or the spot, or the look or the words, which laid the foundation. It is too long ago. I was in the middle before I knew that I had begun.", can you identify the noun phrases?
In the sentence `the quick red fox jumped over the lazy brown dog`, there are 2 noun phrases: **quick red fox** and **lazy brown dog**.
### Sentiment analysis
A sentence or text can be analyzed for sentiment, or how *positive* or *negative* it is. Sentiment is measured in terms of *polarity* and *objectivity/subjectivity*. Polarity ranges from -1.0 to 1.0 (negative to positive), while objectivity ranges from 0.0 to 1.0 (most objective to most subjective).
✅ Later you'll learn that there are different ways to determine sentiment using machine learning, but one approach is to use a list of words and phrases categorized as positive or negative by a human expert and apply that model to text to calculate a polarity score. Can you see how this would work in some cases but not in others?
### Inflection
Inflection allows you to take a word and determine its singular or plural form.
### Lemmatization
A *lemma* is the root or base form of a word. For example, *flew*, *flies*, and *flying* all have the lemma *fly*.
There are also useful databases available for NLP researchers, such as:
### WordNet
[WordNet](https://wordnet.princeton.edu/) is a database of words, synonyms, antonyms, and other details for many words in various languages. It is incredibly useful for building translations, spell checkers, or any type of language tool.
## NLP Libraries
Fortunately, you don't have to build all these techniques from scratch. There are excellent Python libraries available that make NLP more accessible to developers who aren't specialized in natural language processing or machine learning. The next lessons will include more examples, but here are some useful examples to help you with the next task.
### Exercise - using `TextBlob` library
Let's use a library called TextBlob, which contains helpful APIs for tackling these types of tasks. TextBlob "stands on the giant shoulders of [NLTK](https://nltk.org) and [pattern](https://github.com/clips/pattern), and plays nicely with both." It incorporates a significant amount of ML into its API.
> Note: A useful [Quick Start](https://textblob.readthedocs.io/en/dev/quickstart.html#quickstart) guide is available for TextBlob and is recommended for experienced Python developers.
When attempting to identify *noun phrases*, TextBlob offers several extractors to find them.
1. Take a look at `ConllExtractor`.
```python
from textblob import TextBlob
from textblob.np_extractors import ConllExtractor
# import and create a Conll extractor to use later
extractor = ConllExtractor()
# later when you need a noun phrase extractor:
user_input = input("> ")
user_input_blob = TextBlob(user_input, np_extractor=extractor) # note non-default extractor specified
np = user_input_blob.noun_phrases
```
> What's happening here? [ConllExtractor](https://textblob.readthedocs.io/en/dev/api_reference.html?highlight=Conll#textblob.en.np_extractors.ConllExtractor) is "A noun phrase extractor that uses chunk parsing trained with the ConLL-2000 training corpus." ConLL-2000 refers to the 2000 Conference on Computational Natural Language Learning. Each year, the conference hosted a workshop to tackle a challenging NLP problem, and in 2000, it focused on noun chunking. A model was trained on the Wall Street Journal, using "sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens)." You can review the procedures [here](https://www.clips.uantwerpen.be/conll2000/chunking/) and the [results](https://ifarm.nl/erikt/research/np-chunking.html).
### Challenge - improving your bot with NLP
In the previous lesson, you built a simple Q&A bot. Now, you'll make Marvin a bit more empathetic by analyzing your input for sentiment and printing a response to match the sentiment. You'll also need to identify a `noun_phrase` and ask about it.
Steps to build a better conversational bot:
1. Print instructions advising the user how to interact with the bot.
2. Start a loop:
1. Accept user input.
2. If the user asks to exit, then exit.
3. Process user input and determine an appropriate sentiment response.
4. If a noun phrase is detected in the sentiment, pluralize it and ask for more input on that topic.
5. Print a response.
3. Loop back to step 2.
Here is the code snippet to determine sentiment using TextBlob. Note that there are only four *gradients* of sentiment response (you can add more if you like):
```python
if user_input_blob.polarity <= -0.5:
response = "Oh dear, that sounds bad. "
elif user_input_blob.polarity <= 0:
response = "Hmm, that's not great. "
elif user_input_blob.polarity <= 0.5:
response = "Well, that sounds positive. "
elif user_input_blob.polarity <= 1:
response = "Wow, that sounds great. "
```
Here is some sample output to guide you (user input starts with >):
```output
Hello, I am Marvin, the friendly robot.
You can end this conversation at any time by typing 'bye'
After typing each answer, press 'enter'
How are you today?
> I am ok
Well, that sounds positive. Can you tell me more?
> I went for a walk and saw a lovely cat
Well, that sounds positive. Can you tell me more about lovely cats?
> cats are the best. But I also have a cool dog
Wow, that sounds great. Can you tell me more about cool dogs?
> I have an old hounddog but he is sick
Hmm, that's not great. Can you tell me more about old hounddogs?
> bye
It was nice talking to you, goodbye!
```
One possible solution to the task is [here](https://github.com/microsoft/ML-For-Beginners/blob/main/6-NLP/2-Tasks/solution/bot.py)
✅ Knowledge Check
1. Do you think the empathetic responses would 'trick' someone into thinking the bot actually understood them?
2. Does identifying the noun phrase make the bot more 'believable'?
3. Why would extracting a 'noun phrase' from a sentence be a useful thing to do?
---
Implement the bot in the prior knowledge check and test it on a friend. Can it trick them? Can you make your bot more 'believable'?
## 🚀Challenge
Take a task from the prior knowledge check and try to implement it. Test the bot on a friend. Can it trick them? Can you make your bot more 'believable'?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
In the next few lessons, you will learn more about sentiment analysis. Research this fascinating technique in articles such as those on [KDNuggets](https://www.kdnuggets.com/tag/nlp).
## Assignment
[Make a bot talk back](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "2efc4c2aba5ed06c780c05539c492ae3",
"translation_date": "2025-09-06T11:01:02+00:00",
"source_file": "6-NLP/2-Tasks/assignment.md",
"language_code": "en"
}
-->
# Make a Bot talk back
## Instructions
In the previous lessons, you programmed a basic bot to chat with. This bot gives random responses until you say 'bye'. Can you make the responses a bit less random and trigger specific answers when you say certain things, like 'why' or 'how'? Think about how machine learning could make this process less manual as you enhance your bot. You can use libraries like NLTK or TextBlob to simplify your tasks.
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | --------------------------------------------- | ------------------------------------------------ | ----------------------- |
| | A new bot.py file is provided and documented | A new bot file is provided but contains errors | No file is provided |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,200 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "be03c8182982b87ced155e4e9d1438e8",
"translation_date": "2025-09-06T11:02:34+00:00",
"source_file": "6-NLP/3-Translation-Sentiment/README.md",
"language_code": "en"
}
-->
# Translation and sentiment analysis with ML
In the previous lessons, you learned how to create a basic bot using `TextBlob`, a library that incorporates machine learning behind the scenes to perform basic NLP tasks like extracting noun phrases. Another significant challenge in computational linguistics is accurately _translating_ a sentence from one spoken or written language to another.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
Translation is a very complex problem, made even harder by the fact that there are thousands of languages, each with its own unique grammar rules. One approach is to convert the formal grammar rules of one language, such as English, into a structure that is not dependent on any specific language, and then translate it by converting it back into another language. This approach involves the following steps:
1. **Identification**: Identify or tag the words in the input language as nouns, verbs, etc.
2. **Create translation**: Generate a direct translation of each word in the format of the target language.
### Example sentence, English to Irish
In 'English,' the sentence _I feel happy_ consists of three words in the following order:
- **subject** (I)
- **verb** (feel)
- **adjective** (happy)
However, in the 'Irish' language, the same sentence follows a very different grammatical structure—emotions like "*happy*" or "*sad*" are expressed as being *upon* you.
The English phrase `I feel happy` in Irish would be `Tá athas orm`. A *literal* translation would be `Happy is upon me`.
An Irish speaker translating to English would say `I feel happy`, not `Happy is upon me`, because they understand the meaning of the sentence, even though the words and sentence structure differ.
The formal order for the sentence in Irish is:
- **verb** (Tá or is)
- **adjective** (athas, or happy)
- **subject** (orm, or upon me)
## Translation
A simple translation program might translate words individually, ignoring the sentence structure.
✅ If you've learned a second (or third or more) language as an adult, you might have started by thinking in your native language, translating a concept word by word in your head to the second language, and then speaking out your translation. This is similar to what simple translation computer programs do. It's important to move beyond this phase to achieve fluency!
Simple translation often results in poor (and sometimes amusing) mistranslations: `I feel happy` translates literally to `Mise bhraitheann athas` in Irish. That means (literally) `me feel happy` and is not a valid Irish sentence. Even though English and Irish are spoken on two closely neighboring islands, they are very different languages with distinct grammar structures.
> You can watch some videos about Irish linguistic traditions, such as [this one](https://www.youtube.com/watch?v=mRIaLSdRMMs).
### Machine learning approaches
So far, you've learned about the formal rules approach to natural language processing. Another approach is to ignore the meaning of the words and _instead use machine learning to detect patterns_. This can work in translation if you have a large amount of text (a *corpus*) or texts (*corpora*) in both the source and target languages.
For example, consider the case of *Pride and Prejudice*, a famous English novel written by Jane Austen in 1813. If you compare the book in English with a human translation of the book in *French*, you could identify phrases in one language that are _idiomatically_ translated into the other. You'll try this shortly.
For instance, when an English phrase like `I have no money` is translated literally into French, it might become `Je n'ai pas de monnaie`. "Monnaie" is a tricky French 'false cognate,' as 'money' and 'monnaie' are not synonymous. A better translation that a human might make would be `Je n'ai pas d'argent`, because it better conveys the meaning that you have no money (rather than 'loose change,' which is the meaning of 'monnaie').
![monnaie](../../../../6-NLP/3-Translation-Sentiment/images/monnaie.png)
> Image by [Jen Looper](https://twitter.com/jenlooper)
If a machine learning model has access to enough human translations to build a model, it can improve the accuracy of translations by identifying common patterns in texts that have been previously translated by expert human speakers of both languages.
### Exercise - translation
You can use `TextBlob` to translate sentences. Try the famous first line of **Pride and Prejudice**:
```python
from textblob import TextBlob
blob = TextBlob(
"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife!"
)
print(blob.translate(to="fr"))
```
`TextBlob` does a pretty good job at the translation: "C'est une vérité universellement reconnue, qu'un homme célibataire en possession d'une bonne fortune doit avoir besoin d'une femme!".
It can be argued that TextBlob's translation is far more precise, in fact, than the 1932 French translation of the book by V. Leconte and Ch. Pressoir:
"C'est une vérité universelle qu'un célibataire pourvu d'une belle fortune doit avoir envie de se marier, et, si peu que l'on sache de son sentiment à cet egard, lorsqu'il arrive dans une nouvelle résidence, cette idée est si bien fixée dans l'esprit de ses voisins qu'ils le considèrent sur-le-champ comme la propriété légitime de l'une ou l'autre de leurs filles."
In this case, the translation informed by machine learning does a better job than the human translator, who unnecessarily adds words to the original author's text for 'clarity.'
> What's happening here? Why is TextBlob so good at translation? Well, behind the scenes, it's using Google Translate, a sophisticated AI capable of analyzing millions of phrases to predict the best strings for the task at hand. There's nothing manual happening here, and you need an internet connection to use `blob.translate`.
✅ Try some more sentences. Which is better, machine learning or human translation? In which cases?
## Sentiment analysis
Another area where machine learning excels is sentiment analysis. A non-machine learning approach to sentiment analysis involves identifying words and phrases that are 'positive' or 'negative.' Then, given a new piece of text, the total value of positive, negative, and neutral words is calculated to determine the overall sentiment.
This approach can be easily fooled, as you may have seen in the Marvin task—the sentence `Great, that was a wonderful waste of time, I'm glad we are lost on this dark road` is sarcastic and negative, but the simple algorithm detects 'great,' 'wonderful,' and 'glad' as positive and 'waste,' 'lost,' and 'dark' as negative. The overall sentiment is skewed by these conflicting words.
✅ Pause for a moment and think about how we convey sarcasm as human speakers. Tone inflection plays a significant role. Try saying the phrase "Well, that film was awesome" in different ways to see how your voice conveys meaning.
### ML approaches
The machine learning approach involves manually gathering negative and positive bodies of text—tweets, movie reviews, or anything where a human has provided both a score *and* a written opinion. NLP techniques can then be applied to these opinions and scores, allowing patterns to emerge (e.g., positive movie reviews might frequently include the phrase 'Oscar worthy,' while negative reviews might use 'disgusting' more often).
> ⚖️ **Example**: Imagine you work in a politician's office, and a new law is being debated. Constituents might write emails either supporting or opposing the law. If you're tasked with reading and sorting these emails into two piles, *for* and *against*, you might feel overwhelmed by the sheer volume. Wouldn't it be helpful if a bot could read and sort them for you?
>
> One way to achieve this is by using machine learning. You would train the model with a sample of *for* and *against* emails. The model would associate certain phrases and words with the respective categories, *but it wouldn't understand the content itself*. You could test the model with emails it hadn't seen before and compare its conclusions to your own. Once satisfied with the model's accuracy, you could process future emails without reading each one.
✅ Does this process sound similar to methods you've used in previous lessons?
## Exercise - sentimental sentences
Sentiment is measured with a *polarity* score ranging from -1 to 1, where -1 represents the most negative sentiment and 1 represents the most positive. Sentiment is also measured with a score for objectivity (0) and subjectivity (1).
Take another look at Jane Austen's *Pride and Prejudice*. The text is available here at [Project Gutenberg](https://www.gutenberg.org/files/1342/1342-h/1342-h.htm). The sample below shows a short program that analyzes the sentiment of the first and last sentences of the book and displays their sentiment polarity and subjectivity/objectivity scores.
You should use the `TextBlob` library (described above) to determine `sentiment` (you don't need to write your own sentiment calculator) for the following task.
```python
from textblob import TextBlob
quote1 = """It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."""
quote2 = """Darcy, as well as Elizabeth, really loved them; and they were both ever sensible of the warmest gratitude towards the persons who, by bringing her into Derbyshire, had been the means of uniting them."""
sentiment1 = TextBlob(quote1).sentiment
sentiment2 = TextBlob(quote2).sentiment
print(quote1 + " has a sentiment of " + str(sentiment1))
print(quote2 + " has a sentiment of " + str(sentiment2))
```
You see the following output:
```output
It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want # of a wife. has a sentiment of Sentiment(polarity=0.20952380952380953, subjectivity=0.27142857142857146)
Darcy, as well as Elizabeth, really loved them; and they were
both ever sensible of the warmest gratitude towards the persons
who, by bringing her into Derbyshire, had been the means of
uniting them. has a sentiment of Sentiment(polarity=0.7, subjectivity=0.8)
```
## Challenge - check sentiment polarity
Your task is to determine, using sentiment polarity, whether *Pride and Prejudice* contains more absolutely positive sentences than absolutely negative ones. For this task, you may assume that a polarity score of 1 or -1 represents absolute positivity or negativity, respectively.
**Steps:**
1. Download a [copy of Pride and Prejudice](https://www.gutenberg.org/files/1342/1342-h/1342-h.htm) from Project Gutenberg as a .txt file. Remove the metadata at the beginning and end of the file, leaving only the original text.
2. Open the file in Python and extract the contents as a string.
3. Create a TextBlob using the book string.
4. Analyze each sentence in the book in a loop:
1. If the polarity is 1 or -1, store the sentence in an array or list of positive or negative messages.
5. At the end, print out all the positive sentences and negative sentences (separately) and the count of each.
Here is a sample [solution](https://github.com/microsoft/ML-For-Beginners/blob/main/6-NLP/3-Translation-Sentiment/solution/notebook.ipynb).
✅ Knowledge Check
1. Sentiment is based on the words used in the sentence, but does the code *understand* the words?
2. Do you think the sentiment polarity is accurate? In other words, do you *agree* with the scores?
1. Specifically, do you agree or disagree with the absolute **positive** polarity of the following sentences?
* “What an excellent father you have, girls!” said she, when the door was shut.
* “Your examination of Mr. Darcy is over, I presume,” said Miss Bingley; “and pray what is the result?” “I am perfectly convinced by it that Mr. Darcy has no defect.
* How wonderfully these sort of things occur!
* I have the greatest dislike in the world to that sort of thing.
* Charlotte is an excellent manager, I dare say.
* “This is delightful indeed!
* I am so happy!
* Your idea of the ponies is delightful.
2. The next 3 sentences were scored with an absolute positive sentiment, but upon closer reading, they are not positive sentences. Why did the sentiment analysis think they were positive sentences?
* Happy shall I be, when his stay at Netherfield is over!” “I wish I could say anything to comfort you,” replied Elizabeth; “but it is wholly out of my power.
* If I could but see you as happy!
* Our distress, my dear Lizzy, is very great.
3. Do you agree or disagree with the absolute **negative** polarity of the following sentences?
- Everybody is disgusted with his pride.
- “I should like to know how he behaves among strangers.” “You shall hear then—but prepare yourself for something very dreadful.
- The pause was to Elizabeths feelings dreadful.
- It would be dreadful!
✅ Any fan of Jane Austen will understand that she often uses her books to critique the more absurd aspects of English Regency society. Elizabeth Bennett, the main character in *Pride and Prejudice*, is a sharp social observer (like the author), and her language is often heavily nuanced. Even Mr. Darcy (the love interest in the story) notes Elizabeth's playful and teasing use of language: "I have had the pleasure of your acquaintance long enough to know that you find great enjoyment in occasionally professing opinions which in fact are not your own."
---
## 🚀Challenge
Can you make Marvin even better by extracting other features from the user input?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
There are many ways to determine sentiment in text. Consider the business applications that could benefit from this approach. Reflect on how it might fail or produce unintended results. Learn more about advanced, enterprise-grade systems for sentiment analysis, such as [Azure Text Analysis](https://docs.microsoft.com/azure/cognitive-services/Text-Analytics/how-tos/text-analytics-how-to-sentiment-analysis?tabs=version-3-1?WT.mc_id=academic-77952-leestott). Try analyzing some of the sentences from Pride and Prejudice mentioned earlier and see if it can pick up on subtle nuances.
## Assignment
[Poetic license](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "9d2a734deb904caff310d1a999c6bd7a",
"translation_date": "2025-09-06T11:03:03+00:00",
"source_file": "6-NLP/3-Translation-Sentiment/assignment.md",
"language_code": "en"
}
-->
# Poetic license
## Instructions
In [this notebook](https://www.kaggle.com/jenlooper/emily-dickinson-word-frequency), you will find over 500 poems by Emily Dickinson that have been previously analyzed for sentiment using Azure text analytics. Using this dataset, apply the techniques discussed in the lesson to conduct your own analysis. Does the sentiment suggested by a poem align with the decision made by the more advanced Azure service? Why or why not, in your opinion? Did anything surprise you during the analysis?
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | -------------------------------------------------------------------------- | ------------------------------------------------------- | ------------------------ |
| | A notebook is presented with a thorough analysis of sample outputs from the author | The notebook is incomplete or lacks proper analysis | No notebook is presented |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T11:03:12+00:00",
"source_file": "6-NLP/3-Translation-Sentiment/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "81db6ff2cf6e62fbe2340b094bb9509e",
"translation_date": "2025-09-06T11:03:08+00:00",
"source_file": "6-NLP/3-Translation-Sentiment/solution/R/README.md",
"language_code": "en"
}
-->
this is a temporary placeholder
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,417 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "8d32dadeda93c6fb5c43619854882ab1",
"translation_date": "2025-09-06T11:01:06+00:00",
"source_file": "6-NLP/4-Hotel-Reviews-1/README.md",
"language_code": "en"
}
-->
# Sentiment analysis with hotel reviews - processing the data
In this section, you'll apply techniques from previous lessons to perform exploratory data analysis on a large dataset. Once you understand the relevance of the various columns, you'll learn:
- How to remove unnecessary columns
- How to calculate new data based on existing columns
- How to save the resulting dataset for use in the final challenge
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
### Introduction
Up to this point, you've learned that text data is quite different from numerical data. If the text is written or spoken by humans, it can be analyzed to uncover patterns, frequencies, sentiment, and meaning. This lesson introduces you to a real dataset with a real challenge: **[515K Hotel Reviews Data in Europe](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe)**, which is licensed under [CC0: Public Domain license](https://creativecommons.org/publicdomain/zero/1.0/). The dataset was scraped from Booking.com using public sources and created by Jiashen Liu.
### Preparation
You will need:
- The ability to run .ipynb notebooks using Python 3
- pandas
- NLTK, [which you should install locally](https://www.nltk.org/install.html)
- The dataset, available on Kaggle: [515K Hotel Reviews Data in Europe](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe). It is approximately 230 MB unzipped. Download it to the root `/data` folder associated with these NLP lessons.
## Exploratory data analysis
This challenge assumes you're building a hotel recommendation bot using sentiment analysis and guest review scores. The dataset includes reviews of 1493 hotels across 6 cities.
Using Python, the hotel reviews dataset, and NLTK's sentiment analysis, you could explore:
- What are the most frequently used words and phrases in reviews?
- Do the official *tags* describing a hotel correlate with review scores (e.g., are negative reviews more common for *Family with young children* than for *Solo traveler*, possibly indicating the hotel is better suited for *Solo travelers*?)
- Do the NLTK sentiment scores align with the numerical scores given by hotel reviewers?
#### Dataset
Let's examine the dataset you've downloaded and saved locally. Open the file in an editor like VS Code or Excel.
The dataset headers are as follows:
*Hotel_Address, Additional_Number_of_Scoring, Review_Date, Average_Score, Hotel_Name, Reviewer_Nationality, Negative_Review, Review_Total_Negative_Word_Counts, Total_Number_of_Reviews, Positive_Review, Review_Total_Positive_Word_Counts, Total_Number_of_Reviews_Reviewer_Has_Given, Reviewer_Score, Tags, days_since_review, lat, lng*
Here they are grouped for easier analysis:
##### Hotel columns
- `Hotel_Name`, `Hotel_Address`, `lat` (latitude), `lng` (longitude)
- Using *lat* and *lng*, you could plot a map with Python showing hotel locations, perhaps color-coded for negative and positive reviews.
- `Hotel_Address` is not particularly useful and could be replaced with a country for easier sorting and searching.
**Hotel Meta-review columns**
- `Average_Score`
- According to the dataset creator, this column represents the *Average Score of the hotel, calculated based on the latest comment in the last year*. While this calculation method seems unusual, we'll take it at face value for now.
✅ Can you think of another way to calculate the average score using the other columns in this dataset?
- `Total_Number_of_Reviews`
- The total number of reviews the hotel has received. It's unclear (without writing code) whether this refers to the reviews in the dataset.
- `Additional_Number_of_Scoring`
- Indicates a review score was given without a positive or negative review being written.
**Review columns**
- `Reviewer_Score`
- A numerical value with at most one decimal place, ranging between 2.5 and 10.
- It's unclear why 2.5 is the lowest possible score.
- `Negative_Review`
- If no negative review was written, this field contains "**No Negative**."
- Sometimes, reviewers write positive comments in the Negative_Review column (e.g., "there is nothing bad about this hotel").
- `Review_Total_Negative_Word_Counts`
- Higher negative word counts generally indicate a lower score (without sentiment analysis).
- `Positive_Review`
- If no positive review was written, this field contains "**No Positive**."
- Sometimes, reviewers write negative comments in the Positive_Review column (e.g., "there is nothing good about this hotel at all").
- `Review_Total_Positive_Word_Counts`
- Higher positive word counts generally indicate a higher score (without sentiment analysis).
- `Review_Date` and `days_since_review`
- You could apply a freshness or staleness measure to reviews (e.g., older reviews might be less accurate due to changes in hotel management, renovations, or new amenities like a pool).
- `Tags`
- Short descriptors selected by reviewers to describe their type of stay (e.g., solo or family), room type, length of stay, and how the review was submitted.
- Unfortunately, these tags are problematic; see the section below discussing their usefulness.
**Reviewer columns**
- `Total_Number_of_Reviews_Reviewer_Has_Given`
- This could be a factor in a recommendation model. For example, prolific reviewers with hundreds of reviews might tend to be more negative. However, reviewers are not identified with unique codes, so their reviews cannot be linked. There are 30 reviewers with 100 or more reviews, but it's unclear how this could aid the recommendation model.
- `Reviewer_Nationality`
- Some might assume certain nationalities are more likely to give positive or negative reviews due to cultural tendencies. Be cautious about incorporating such anecdotal views into your models. These are stereotypes, and each reviewer is an individual whose review reflects their personal experience, filtered through factors like previous hotel stays, travel distance, and temperament. It's hard to justify attributing review scores solely to nationality.
##### Examples
| Average Score | Total Number Reviews | Reviewer Score | Negative Review | Positive Review | Tags |
| -------------- | -------------------- | -------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- | ----------------------------------------------------------------------------------------- |
| 7.8 | 1945 | 2.5 | This is currently not a hotel but a construction site. I was terrorized from early morning and all day with unacceptable building noise while resting after a long trip and working in the room. People were working all day, i.e., with jackhammers in the adjacent rooms. I asked for a room change, but no silent room was available. To make things worse, I was overcharged. I checked out in the evening since I had to leave for an early flight and received an appropriate bill. A day later, the hotel made another charge without my consent in excess of the booked price. It's a terrible place. Don't punish yourself by booking here. | Nothing. Terrible place. Stay away. | Business trip, Couple, Standard Double Room, Stayed 2 nights |
As you can see, this guest had a very negative experience. The hotel has a decent average score of 7.8 and 1945 reviews, but this reviewer gave it a 2.5 and wrote 115 words about their dissatisfaction. They wrote only 7 words in the Positive_Review column, warning others to avoid the hotel. If we only counted words instead of analyzing their sentiment, we might misinterpret the reviewer's intent. Interestingly, the lowest possible score is 2.5, not 0, which raises questions about the scoring system.
##### Tags
At first glance, using `Tags` to categorize data seems logical. However, these tags are not standardized. For example, one hotel might use *Single room*, *Twin room*, and *Double room*, while another uses *Deluxe Single Room*, *Classic Queen Room*, and *Executive King Room*. These might refer to the same types of rooms, but the variations make standardization difficult.
Options for handling this:
1. Attempt to standardize all terms, which is challenging due to unclear mappings (e.g., *Classic single room* maps to *Single room*, but *Superior Queen Room with Courtyard Garden or City View* is harder to map).
2. Use an NLP approach to measure the frequency of terms like *Solo*, *Business Traveller*, or *Family with young kids* and factor them into the recommendation.
Tags usually contain 5-6 comma-separated values, including *Type of trip*, *Type of guests*, *Type of room*, *Number of nights*, and *Type of device used to submit the review*. However, since some reviewers leave fields blank, the values are not always in the same order.
For example, filtering for *Family with* yields over 80,000 results containing phrases like "Family with young children" or "Family with older children." This shows the `Tags` column has potential but requires effort to make it useful.
##### Average hotel score
There are some oddities in the dataset that are worth noting:
The dataset includes the following columns related to average scores and reviews:
1. Hotel_Name
2. Additional_Number_of_Scoring
3. Average_Score
4. Total_Number_of_Reviews
5. Reviewer_Score
The hotel with the most reviews in the dataset is *Britannia International Hotel Canary Wharf*, with 4789 reviews out of 515,000. However, its `Total_Number_of_Reviews` value is 9086. Adding the `Additional_Number_of_Scoring` value (2682) to 4789 gives 7471, which is still 1615 short of the `Total_Number_of_Reviews`.
The `Average_Score` column description from Kaggle states it is "*Average Score of the hotel, calculated based on the latest comment in the last year*." This calculation method seems unhelpful, but we can calculate our own average based on the review scores in the dataset. For example, the average score for this hotel is listed as 7.1, but the calculated average (based on reviewer scores in the dataset) is 6.8. This discrepancy might be due to scores from `Additional_Number_of_Scoring` reviews, but there's no way to confirm this.
To further complicate matters, the hotel with the second-highest number of reviews has a calculated average score of 8.12, while the dataset lists it as 8.1. Is this coincidence or a discrepancy?
Given these inconsistencies, we'll write a short program to explore the dataset and determine the best way to use (or not use) these values.
> 🚨 A note of caution
>
> When working with this dataset, you'll be writing code to calculate something based on the text without needing to read or analyze the text yourself. This is the core of NLP—interpreting meaning or sentiment without requiring human intervention. However, you might come across some negative reviews. I strongly recommend avoiding reading them, as it's unnecessary. Some of these reviews are trivial or irrelevant, like "The weather wasn't great," which is beyond the hotel's control—or anyone's, for that matter. But there is also a darker side to some reviews. Occasionally, negative reviews may contain racist, sexist, or ageist remarks. This is unfortunate but not surprising, given that the dataset is scraped from a public website. Some reviewers leave comments that you might find offensive, uncomfortable, or upsetting. It's better to let the code assess the sentiment rather than exposing yourself to potentially distressing content. That said, such reviews are written by a minority, but they do exist nonetheless.
## Exercise - Data exploration
### Load the data
Enough of visually examining the data—now it's time to write some code and get answers! This section uses the pandas library. Your first task is to ensure you can load and read the CSV data. The pandas library has a fast CSV loader, and the result is stored in a dataframe, as you've seen in previous lessons. The CSV file we are loading contains over half a million rows but only 17 columns. Pandas provides many powerful ways to interact with a dataframe, including the ability to perform operations on every row.
From this point onward in the lesson, you'll encounter code snippets, explanations of the code, and discussions about the results. Use the included _notebook.ipynb_ for your code.
Let's start by loading the data file you'll be working with:
```python
# Load the hotel reviews from CSV
import pandas as pd
import time
# importing time so the start and end time can be used to calculate file loading time
print("Loading data file now, this could take a while depending on file size")
start = time.time()
# df is 'DataFrame' - make sure you downloaded the file to the data folder
df = pd.read_csv('../../data/Hotel_Reviews.csv')
end = time.time()
print("Loading took " + str(round(end - start, 2)) + " seconds")
```
Now that the data is loaded, you can perform some operations on it. Keep this code at the top of your program for the next part.
## Explore the data
In this case, the data is already *clean*, meaning it is ready to work with and does not contain characters in other languages that might confuse algorithms expecting only English characters.
✅ You might encounter data that requires initial processing to format it before applying NLP techniques, but not this time. If you had to, how would you handle non-English characters?
Take a moment to ensure that once the data is loaded, you can explore it with code. It's tempting to focus on the `Negative_Review` and `Positive_Review` columns, as they contain natural text for your NLP algorithms to process. But wait! Before diving into NLP and sentiment analysis, follow the code below to verify whether the values in the dataset match the values you calculate using pandas.
## Dataframe operations
Your first task in this lesson is to check whether the following assertions are correct by writing code to examine the dataframe (without modifying it).
> As with many programming tasks, there are multiple ways to approach this, but a good rule of thumb is to choose the simplest and easiest method, especially if it will be easier to understand when revisiting the code later. With dataframes, the comprehensive API often provides efficient ways to achieve your goals.
Treat the following questions as coding tasks and try to answer them without looking at the solution.
1. Print the *shape* of the dataframe you just loaded (the shape refers to the number of rows and columns).
2. Calculate the frequency count for reviewer nationalities:
1. How many distinct values exist in the `Reviewer_Nationality` column, and what are they?
2. Which reviewer nationality is the most common in the dataset (print the country and the number of reviews)?
3. What are the next top 10 most frequently found nationalities, along with their frequency counts?
3. What is the most frequently reviewed hotel for each of the top 10 reviewer nationalities?
4. How many reviews are there per hotel (frequency count of hotels) in the dataset?
5. While the dataset includes an `Average_Score` column for each hotel, you can also calculate an average score (by averaging all reviewer scores in the dataset for each hotel). Add a new column to your dataframe called `Calc_Average_Score` that contains this calculated average.
6. Do any hotels have the same (rounded to 1 decimal place) `Average_Score` and `Calc_Average_Score`?
1. Try writing a Python function that takes a Series (row) as an argument and compares the values, printing a message when the values are not equal. Then use the `.apply()` method to process every row with the function.
7. Calculate and print how many rows have `Negative_Review` values of "No Negative."
8. Calculate and print how many rows have `Positive_Review` values of "No Positive."
9. Calculate and print how many rows have `Positive_Review` values of "No Positive" **and** `Negative_Review` values of "No Negative."
### Code answers
1. Print the *shape* of the dataframe you just loaded (the shape refers to the number of rows and columns).
```python
print("The shape of the data (rows, cols) is " + str(df.shape))
> The shape of the data (rows, cols) is (515738, 17)
```
2. Calculate the frequency count for reviewer nationalities:
1. How many distinct values exist in the `Reviewer_Nationality` column, and what are they?
2. Which reviewer nationality is the most common in the dataset (print the country and the number of reviews)?
```python
# value_counts() creates a Series object that has index and values in this case, the country and the frequency they occur in reviewer nationality
nationality_freq = df["Reviewer_Nationality"].value_counts()
print("There are " + str(nationality_freq.size) + " different nationalities")
# print first and last rows of the Series. Change to nationality_freq.to_string() to print all of the data
print(nationality_freq)
There are 227 different nationalities
United Kingdom 245246
United States of America 35437
Australia 21686
Ireland 14827
United Arab Emirates 10235
...
Comoros 1
Palau 1
Northern Mariana Islands 1
Cape Verde 1
Guinea 1
Name: Reviewer_Nationality, Length: 227, dtype: int64
```
3. What are the next top 10 most frequently found nationalities, along with their frequency counts?
```python
print("The highest frequency reviewer nationality is " + str(nationality_freq.index[0]).strip() + " with " + str(nationality_freq[0]) + " reviews.")
# Notice there is a leading space on the values, strip() removes that for printing
# What is the top 10 most common nationalities and their frequencies?
print("The next 10 highest frequency reviewer nationalities are:")
print(nationality_freq[1:11].to_string())
The highest frequency reviewer nationality is United Kingdom with 245246 reviews.
The next 10 highest frequency reviewer nationalities are:
United States of America 35437
Australia 21686
Ireland 14827
United Arab Emirates 10235
Saudi Arabia 8951
Netherlands 8772
Switzerland 8678
Germany 7941
Canada 7894
France 7296
```
3. What is the most frequently reviewed hotel for each of the top 10 reviewer nationalities?
```python
# What was the most frequently reviewed hotel for the top 10 nationalities
# Normally with pandas you will avoid an explicit loop, but wanted to show creating a new dataframe using criteria (don't do this with large amounts of data because it could be very slow)
for nat in nationality_freq[:10].index:
# First, extract all the rows that match the criteria into a new dataframe
nat_df = df[df["Reviewer_Nationality"] == nat]
# Now get the hotel freq
freq = nat_df["Hotel_Name"].value_counts()
print("The most reviewed hotel for " + str(nat).strip() + " was " + str(freq.index[0]) + " with " + str(freq[0]) + " reviews.")
The most reviewed hotel for United Kingdom was Britannia International Hotel Canary Wharf with 3833 reviews.
The most reviewed hotel for United States of America was Hotel Esther a with 423 reviews.
The most reviewed hotel for Australia was Park Plaza Westminster Bridge London with 167 reviews.
The most reviewed hotel for Ireland was Copthorne Tara Hotel London Kensington with 239 reviews.
The most reviewed hotel for United Arab Emirates was Millennium Hotel London Knightsbridge with 129 reviews.
The most reviewed hotel for Saudi Arabia was The Cumberland A Guoman Hotel with 142 reviews.
The most reviewed hotel for Netherlands was Jaz Amsterdam with 97 reviews.
The most reviewed hotel for Switzerland was Hotel Da Vinci with 97 reviews.
The most reviewed hotel for Germany was Hotel Da Vinci with 86 reviews.
The most reviewed hotel for Canada was St James Court A Taj Hotel London with 61 reviews.
```
4. How many reviews are there per hotel (frequency count of hotels) in the dataset?
```python
# First create a new dataframe based on the old one, removing the uneeded columns
hotel_freq_df = df.drop(["Hotel_Address", "Additional_Number_of_Scoring", "Review_Date", "Average_Score", "Reviewer_Nationality", "Negative_Review", "Review_Total_Negative_Word_Counts", "Positive_Review", "Review_Total_Positive_Word_Counts", "Total_Number_of_Reviews_Reviewer_Has_Given", "Reviewer_Score", "Tags", "days_since_review", "lat", "lng"], axis = 1)
# Group the rows by Hotel_Name, count them and put the result in a new column Total_Reviews_Found
hotel_freq_df['Total_Reviews_Found'] = hotel_freq_df.groupby('Hotel_Name').transform('count')
# Get rid of all the duplicated rows
hotel_freq_df = hotel_freq_df.drop_duplicates(subset = ["Hotel_Name"])
display(hotel_freq_df)
```
| Hotel_Name | Total_Number_of_Reviews | Total_Reviews_Found |
| :----------------------------------------: | :---------------------: | :-----------------: |
| Britannia International Hotel Canary Wharf | 9086 | 4789 |
| Park Plaza Westminster Bridge London | 12158 | 4169 |
| Copthorne Tara Hotel London Kensington | 7105 | 3578 |
| ... | ... | ... |
| Mercure Paris Porte d Orleans | 110 | 10 |
| Hotel Wagner | 135 | 10 |
| Hotel Gallitzinberg | 173 | 8 |
You may notice that the *counted in the dataset* results do not match the value in `Total_Number_of_Reviews`. It is unclear whether this value represents the total number of reviews the hotel received but not all were scraped, or some other calculation. `Total_Number_of_Reviews` is not used in the model due to this ambiguity.
5. While the dataset includes an `Average_Score` column for each hotel, you can also calculate an average score (by averaging all reviewer scores in the dataset for each hotel). Add a new column to your dataframe called `Calc_Average_Score` that contains this calculated average. Print the columns `Hotel_Name`, `Average_Score`, and `Calc_Average_Score`.
```python
# define a function that takes a row and performs some calculation with it
def get_difference_review_avg(row):
return row["Average_Score"] - row["Calc_Average_Score"]
# 'mean' is mathematical word for 'average'
df['Calc_Average_Score'] = round(df.groupby('Hotel_Name').Reviewer_Score.transform('mean'), 1)
# Add a new column with the difference between the two average scores
df["Average_Score_Difference"] = df.apply(get_difference_review_avg, axis = 1)
# Create a df without all the duplicates of Hotel_Name (so only 1 row per hotel)
review_scores_df = df.drop_duplicates(subset = ["Hotel_Name"])
# Sort the dataframe to find the lowest and highest average score difference
review_scores_df = review_scores_df.sort_values(by=["Average_Score_Difference"])
display(review_scores_df[["Average_Score_Difference", "Average_Score", "Calc_Average_Score", "Hotel_Name"]])
```
You may also wonder about the `Average_Score` value and why it sometimes differs from the calculated average score. Since we can't determine why some values match while others differ, it's safest to use the review scores we have to calculate the average ourselves. That said, the differences are usually very small. Here are the hotels with the greatest deviation between the dataset average and the calculated average:
| Average_Score_Difference | Average_Score | Calc_Average_Score | Hotel_Name |
| :----------------------: | :-----------: | :----------------: | ------------------------------------------: |
| -0.8 | 7.7 | 8.5 | Best Western Hotel Astoria |
| -0.7 | 8.8 | 9.5 | Hotel Stendhal Place Vend me Paris MGallery |
| -0.7 | 7.5 | 8.2 | Mercure Paris Porte d Orleans |
| -0.7 | 7.9 | 8.6 | Renaissance Paris Vendome Hotel |
| -0.5 | 7.0 | 7.5 | Hotel Royal Elys es |
| ... | ... | ... | ... |
| 0.7 | 7.5 | 6.8 | Mercure Paris Op ra Faubourg Montmartre |
| 0.8 | 7.1 | 6.3 | Holiday Inn Paris Montparnasse Pasteur |
| 0.9 | 6.8 | 5.9 | Villa Eugenie |
| 0.9 | 8.6 | 7.7 | MARQUIS Faubourg St Honor Relais Ch teaux |
| 1.3 | 7.2 | 5.9 | Kube Hotel Ice Bar |
With only one hotel having a score difference greater than 1, we can likely ignore the difference and use the calculated average score.
6. Calculate and print how many rows have `Negative_Review` values of "No Negative."
7. Calculate and print how many rows have `Positive_Review` values of "No Positive."
8. Calculate and print how many rows have `Positive_Review` values of "No Positive" **and** `Negative_Review` values of "No Negative."
```python
# with lambdas:
start = time.time()
no_negative_reviews = df.apply(lambda x: True if x['Negative_Review'] == "No Negative" else False , axis=1)
print("Number of No Negative reviews: " + str(len(no_negative_reviews[no_negative_reviews == True].index)))
no_positive_reviews = df.apply(lambda x: True if x['Positive_Review'] == "No Positive" else False , axis=1)
print("Number of No Positive reviews: " + str(len(no_positive_reviews[no_positive_reviews == True].index)))
both_no_reviews = df.apply(lambda x: True if x['Negative_Review'] == "No Negative" and x['Positive_Review'] == "No Positive" else False , axis=1)
print("Number of both No Negative and No Positive reviews: " + str(len(both_no_reviews[both_no_reviews == True].index)))
end = time.time()
print("Lambdas took " + str(round(end - start, 2)) + " seconds")
Number of No Negative reviews: 127890
Number of No Positive reviews: 35946
Number of both No Negative and No Positive reviews: 127
Lambdas took 9.64 seconds
```
## Another way
Another way to count items without using Lambdas is to use the sum function to count rows:
```python
# without lambdas (using a mixture of notations to show you can use both)
start = time.time()
no_negative_reviews = sum(df.Negative_Review == "No Negative")
print("Number of No Negative reviews: " + str(no_negative_reviews))
no_positive_reviews = sum(df["Positive_Review"] == "No Positive")
print("Number of No Positive reviews: " + str(no_positive_reviews))
both_no_reviews = sum((df.Negative_Review == "No Negative") & (df.Positive_Review == "No Positive"))
print("Number of both No Negative and No Positive reviews: " + str(both_no_reviews))
end = time.time()
print("Sum took " + str(round(end - start, 2)) + " seconds")
Number of No Negative reviews: 127890
Number of No Positive reviews: 35946
Number of both No Negative and No Positive reviews: 127
Sum took 0.19 seconds
```
You may have noticed that there are 127 rows with both "No Negative" and "No Positive" values for the columns `Negative_Review` and `Positive_Review`, respectively. This means the reviewer gave the hotel a numerical score but chose not to write either a positive or negative review. Fortunately, this is a small number of rows (127 out of 515,738, or 0.02%), so it likely won't skew the model or results significantly. However, you might not have expected a dataset of reviews to include rows without any reviews, so it's worth exploring the data to uncover such anomalies.
Now that you've explored the dataset, the next lesson will focus on filtering the data and adding sentiment analysis.
---
## 🚀Challenge
This lesson demonstrates, as we've seen in previous lessons, how critically important it is to understand your data and its quirks before performing operations on it. Text-based data, in particular, requires careful scrutiny. Explore various text-heavy datasets and see if you can identify areas that might introduce bias or skewed sentiment into a model.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Take [this Learning Path on NLP](https://docs.microsoft.com/learn/paths/explore-natural-language-processing/?WT.mc_id=academic-77952-leestott) to discover tools to try when building speech and text-heavy models.
## Assignment
[NLTK](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,19 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "bf39bceb833cd628f224941dca8041df",
"translation_date": "2025-09-06T11:01:55+00:00",
"source_file": "6-NLP/4-Hotel-Reviews-1/assignment.md",
"language_code": "en"
}
-->
# NLTK
## Instructions
NLTK is a well-known library used in computational linguistics and NLP. Take this opportunity to go through the '[NLTK book](https://www.nltk.org/book/)' and try out its exercises. In this ungraded assignment, you will explore this library in greater depth.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T11:02:02+00:00",
"source_file": "6-NLP/4-Hotel-Reviews-1/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "81db6ff2cf6e62fbe2340b094bb9509e",
"translation_date": "2025-09-06T11:01:58+00:00",
"source_file": "6-NLP/4-Hotel-Reviews-1/solution/R/README.md",
"language_code": "en"
}
-->
this is a temporary placeholder
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,384 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "2c742993fe95d5bcbb2846eda3d442a1",
"translation_date": "2025-09-06T11:03:14+00:00",
"source_file": "6-NLP/5-Hotel-Reviews-2/README.md",
"language_code": "en"
}
-->
# Sentiment analysis with hotel reviews
Now that you've explored the dataset in detail, it's time to filter the columns and apply NLP techniques to gain new insights about the hotels.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
### Filtering & Sentiment Analysis Operations
As you may have noticed, the dataset has some issues. Certain columns contain irrelevant information, while others seem inaccurate. Even if they are accurate, it's unclear how the values were calculated, making it impossible to verify them independently.
## Exercise: Additional Data Processing
Clean the data further by adding useful columns, modifying values in others, and removing unnecessary columns.
1. Initial column processing
1. Remove `lat` and `lng`.
2. Replace `Hotel_Address` values with the following format: if the address contains the city and country, simplify it to just the city and country.
These are the only cities and countries in the dataset:
Amsterdam, Netherlands
Barcelona, Spain
London, United Kingdom
Milan, Italy
Paris, France
Vienna, Austria
```python
def replace_address(row):
if "Netherlands" in row["Hotel_Address"]:
return "Amsterdam, Netherlands"
elif "Barcelona" in row["Hotel_Address"]:
return "Barcelona, Spain"
elif "United Kingdom" in row["Hotel_Address"]:
return "London, United Kingdom"
elif "Milan" in row["Hotel_Address"]:
return "Milan, Italy"
elif "France" in row["Hotel_Address"]:
return "Paris, France"
elif "Vienna" in row["Hotel_Address"]:
return "Vienna, Austria"
# Replace all the addresses with a shortened, more useful form
df["Hotel_Address"] = df.apply(replace_address, axis = 1)
# The sum of the value_counts() should add up to the total number of reviews
print(df["Hotel_Address"].value_counts())
```
Now you can query data at the country level:
```python
display(df.groupby("Hotel_Address").agg({"Hotel_Name": "nunique"}))
```
| Hotel_Address | Hotel_Name |
| :--------------------- | :--------: |
| Amsterdam, Netherlands | 105 |
| Barcelona, Spain | 211 |
| London, United Kingdom | 400 |
| Milan, Italy | 162 |
| Paris, France | 458 |
| Vienna, Austria | 158 |
2. Process Hotel Meta-review columns
1. Remove `Additional_Number_of_Scoring`.
2. Replace `Total_Number_of_Reviews` with the actual total number of reviews for each hotel in the dataset.
3. Replace `Average_Score` with a score calculated by you.
```python
# Drop `Additional_Number_of_Scoring`
df.drop(["Additional_Number_of_Scoring"], axis = 1, inplace=True)
# Replace `Total_Number_of_Reviews` and `Average_Score` with our own calculated values
df.Total_Number_of_Reviews = df.groupby('Hotel_Name').transform('count')
df.Average_Score = round(df.groupby('Hotel_Name').Reviewer_Score.transform('mean'), 1)
```
3. Process review columns
1. Remove `Review_Total_Negative_Word_Counts`, `Review_Total_Positive_Word_Counts`, `Review_Date`, and `days_since_review`.
2. Keep `Reviewer_Score`, `Negative_Review`, and `Positive_Review` as they are.
3. Retain `Tags` for now.
- Additional filtering will be applied to the tags in the next section, after which they will be removed.
4. Process reviewer columns
1. Remove `Total_Number_of_Reviews_Reviewer_Has_Given`.
2. Retain `Reviewer_Nationality`.
### Tag Columns
The `Tag` column is problematic because it contains a list (in text form) stored in the column. The order and number of subsections in this column are inconsistent. With 515,000 rows and 1,427 hotels, each with slightly different options for reviewers, it's difficult for a human to identify the relevant phrases. This is where NLP excels. By scanning the text, you can identify the most common phrases and count their occurrences.
However, we are not interested in single words but rather multi-word phrases (e.g., *Business trip*). Running a multi-word frequency distribution algorithm on such a large dataset (6,762,646 words) could take a significant amount of time. Without analyzing the data, it might seem like a necessary step. But exploratory data analysis can help reduce the processing time. For example, based on a sample of tags like `[' Business trip ', ' Solo traveler ', ' Single Room ', ' Stayed 5 nights ', ' Submitted from a mobile device ']`, you can determine whether it's possible to simplify the process. Fortunately, it is—but you'll need to follow a few steps to identify the relevant tags.
### Filtering Tags
The goal of the dataset is to add sentiment and columns that help you choose the best hotel (for yourself or for a client who needs a hotel recommendation bot). You need to decide which tags are useful for the final dataset. Here's one interpretation (though different goals might lead to different choices):
1. The type of trip is relevant and should be retained.
2. The type of guest group is important and should be retained.
3. The type of room, suite, or studio the guest stayed in is irrelevant (most hotels offer similar rooms).
4. The device used to submit the review is irrelevant.
5. The number of nights stayed *might* be relevant if longer stays indicate satisfaction, but it's probably not significant.
In summary, **keep two types of tags and discard the rest**.
First, you need to reformat the tags before counting them. This involves removing square brackets and quotes. There are several ways to do this, but you'll want the fastest method since processing a large dataset can be time-consuming. Fortunately, pandas provides an efficient way to handle these steps.
```Python
# Remove opening and closing brackets
df.Tags = df.Tags.str.strip("[']")
# remove all quotes too
df.Tags = df.Tags.str.replace(" ', '", ",", regex = False)
```
Each tag will look like this: `Business trip, Solo traveler, Single Room, Stayed 5 nights, Submitted from a mobile device`.
Next, you'll encounter a problem: some reviews have 5 tags, others have 3, and some have 6. This inconsistency is due to how the dataset was created and is difficult to fix. While this could lead to inaccurate counts, you can use the varying order of tags to your advantage. Since each tag is multi-word and separated by commas, the simplest solution is to create 6 temporary columns, each containing one tag based on its position. You can then merge these columns into one and use the `value_counts()` method to count occurrences. This will reveal 2,428 unique tags. Here's a small sample:
| Tag | Count |
| ------------------------------ | ------ |
| Leisure trip | 417778 |
| Submitted from a mobile device | 307640 |
| Couple | 252294 |
| Stayed 1 night | 193645 |
| Stayed 2 nights | 133937 |
| Solo traveler | 108545 |
| Stayed 3 nights | 95821 |
| Business trip | 82939 |
| Group | 65392 |
| Family with young children | 61015 |
| Stayed 4 nights | 47817 |
| Double Room | 35207 |
| Standard Double Room | 32248 |
| Superior Double Room | 31393 |
| Family with older children | 26349 |
| Deluxe Double Room | 24823 |
| Double or Twin Room | 22393 |
| Stayed 5 nights | 20845 |
| Standard Double or Twin Room | 17483 |
| Classic Double Room | 16989 |
| Superior Double or Twin Room | 13570 |
| 2 rooms | 12393 |
Some tags, like `Submitted from a mobile device`, are irrelevant and can be ignored. However, since counting phrases is a fast operation, you can leave them in and disregard them later.
### Removing Length-of-Stay Tags
The first step is to remove length-of-stay tags from consideration. This slightly reduces the total number of tags. Note that you are not removing these tags from the dataset, just excluding them from the analysis.
| Length of stay | Count |
| ---------------- | ------ |
| Stayed 1 night | 193645 |
| Stayed 2 nights | 133937 |
| Stayed 3 nights | 95821 |
| Stayed 4 nights | 47817 |
| Stayed 5 nights | 20845 |
| Stayed 6 nights | 9776 |
| Stayed 7 nights | 7399 |
| Stayed 8 nights | 2502 |
| Stayed 9 nights | 1293 |
| ... | ... |
Similarly, there are many types of rooms, suites, and studios. These are all essentially the same and irrelevant to your analysis, so exclude them as well.
| Type of room | Count |
| ----------------------------- | ----- |
| Double Room | 35207 |
| Standard Double Room | 32248 |
| Superior Double Room | 31393 |
| Deluxe Double Room | 24823 |
| Double or Twin Room | 22393 |
| Standard Double or Twin Room | 17483 |
| Classic Double Room | 16989 |
| Superior Double or Twin Room | 13570 |
Finally, after minimal processing, you'll be left with the following *useful* tags:
| Tag | Count |
| --------------------------------------------- | ------ |
| Leisure trip | 417778 |
| Couple | 252294 |
| Solo traveler | 108545 |
| Business trip | 82939 |
| Group (combined with Travellers with friends) | 67535 |
| Family with young children | 61015 |
| Family with older children | 26349 |
| With a pet | 1405 |
You might combine `Travellers with friends` and `Group` into one category, as shown above. The code for identifying the correct tags can be found in [the Tags notebook](https://github.com/microsoft/ML-For-Beginners/blob/main/6-NLP/5-Hotel-Reviews-2/solution/1-notebook.ipynb).
The final step is to create new columns for each of these tags. For every review row, if the `Tag` column matches one of the new columns, assign a value of 1; otherwise, assign 0. This will allow you to count how many reviewers chose a hotel for reasons like business, leisure, or bringing a pet—valuable information for hotel recommendations.
```python
# Process the Tags into new columns
# The file Hotel_Reviews_Tags.py, identifies the most important tags
# Leisure trip, Couple, Solo traveler, Business trip, Group combined with Travelers with friends,
# Family with young children, Family with older children, With a pet
df["Leisure_trip"] = df.Tags.apply(lambda tag: 1 if "Leisure trip" in tag else 0)
df["Couple"] = df.Tags.apply(lambda tag: 1 if "Couple" in tag else 0)
df["Solo_traveler"] = df.Tags.apply(lambda tag: 1 if "Solo traveler" in tag else 0)
df["Business_trip"] = df.Tags.apply(lambda tag: 1 if "Business trip" in tag else 0)
df["Group"] = df.Tags.apply(lambda tag: 1 if "Group" in tag or "Travelers with friends" in tag else 0)
df["Family_with_young_children"] = df.Tags.apply(lambda tag: 1 if "Family with young children" in tag else 0)
df["Family_with_older_children"] = df.Tags.apply(lambda tag: 1 if "Family with older children" in tag else 0)
df["With_a_pet"] = df.Tags.apply(lambda tag: 1 if "With a pet" in tag else 0)
```
### Save Your File
Finally, save the updated dataset with a new name.
```python
df.drop(["Review_Total_Negative_Word_Counts", "Review_Total_Positive_Word_Counts", "days_since_review", "Total_Number_of_Reviews_Reviewer_Has_Given"], axis = 1, inplace=True)
# Saving new data file with calculated columns
print("Saving results to Hotel_Reviews_Filtered.csv")
df.to_csv(r'../data/Hotel_Reviews_Filtered.csv', index = False)
```
## Sentiment Analysis Operations
In this final section, you'll apply sentiment analysis to the review columns and save the results in the dataset.
## Exercise: Load and Save the Filtered Data
Make sure to load the filtered dataset saved in the previous section, **not** the original dataset.
```python
import time
import pandas as pd
import nltk as nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
# Load the filtered hotel reviews from CSV
df = pd.read_csv('../../data/Hotel_Reviews_Filtered.csv')
# You code will be added here
# Finally remember to save the hotel reviews with new NLP data added
print("Saving results to Hotel_Reviews_NLP.csv")
df.to_csv(r'../data/Hotel_Reviews_NLP.csv', index = False)
```
### Removing Stop Words
Running sentiment analysis on the Negative and Positive review columns can take a long time. On a high-performance test laptop, it took 1214 minutes depending on the sentiment library used. This is relatively slow, so it's worth exploring ways to speed up the process.
Removing stop words—common English words that don't affect sentiment—is the first step. By eliminating these words, sentiment analysis should run faster without losing accuracy. Stop words don't influence sentiment but do slow down processing.
For example, the longest negative review in the dataset was 395 words. After removing stop words, it was reduced to 195 words.
Removing stop words is a quick operation. On the test device, it took 3.3 seconds to process the stop words in two review columns across 515,000 rows. The exact time may vary depending on your device's CPU, RAM, storage type (SSD or HDD), and other factors. Given the short processing time, it's worth doing if it speeds up sentiment analysis.
```python
from nltk.corpus import stopwords
# Load the hotel reviews from CSV
df = pd.read_csv("../../data/Hotel_Reviews_Filtered.csv")
# Remove stop words - can be slow for a lot of text!
# Ryan Han (ryanxjhan on Kaggle) has a great post measuring performance of different stop words removal approaches
# https://www.kaggle.com/ryanxjhan/fast-stop-words-removal # using the approach that Ryan recommends
start = time.time()
cache = set(stopwords.words("english"))
def remove_stopwords(review):
text = " ".join([word for word in review.split() if word not in cache])
return text
# Remove the stop words from both columns
df.Negative_Review = df.Negative_Review.apply(remove_stopwords)
df.Positive_Review = df.Positive_Review.apply(remove_stopwords)
```
### Performing Sentiment Analysis
Now calculate sentiment scores for both the negative and positive review columns, storing the results in two new columns. You can test the sentiment analysis by comparing the sentiment scores to the reviewer's score for the same review. For example, if the sentiment analysis assigns a positive score to both the negative and positive reviews, but the reviewer gave the hotel the lowest possible score, there may be a mismatch between the review text and the score. Alternatively, the sentiment analyzer might have misinterpreted the sentiment.
Some sentiment scores will inevitably be incorrect. For instance, sarcasm in a review—e.g., "Of course I LOVED sleeping in a room with no heating"—might be interpreted as positive sentiment by the analyzer, even though a human reader would recognize the sarcasm.
NLTK offers various sentiment analyzers to experiment with, allowing you to swap them out and assess whether the sentiment analysis becomes more or less accurate. In this example, the VADER sentiment analysis tool is utilized.
> Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
```python
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Create the vader sentiment analyser (there are others in NLTK you can try too)
vader_sentiment = SentimentIntensityAnalyzer()
# Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
# There are 3 possibilities of input for a review:
# It could be "No Negative", in which case, return 0
# It could be "No Positive", in which case, return 0
# It could be a review, in which case calculate the sentiment
def calc_sentiment(review):
if review == "No Negative" or review == "No Positive":
return 0
return vader_sentiment.polarity_scores(review)["compound"]
```
Later in your program, when you're ready to calculate sentiment, you can apply it to each review as shown below:
```python
# Add a negative sentiment and positive sentiment column
print("Calculating sentiment columns for both positive and negative reviews")
start = time.time()
df["Negative_Sentiment"] = df.Negative_Review.apply(calc_sentiment)
df["Positive_Sentiment"] = df.Positive_Review.apply(calc_sentiment)
end = time.time()
print("Calculating sentiment took " + str(round(end - start, 2)) + " seconds")
```
On my computer, this process takes roughly 120 seconds, though the time may vary depending on the machine. If you'd like to print the results and check whether the sentiment aligns with the review:
```python
df = df.sort_values(by=["Negative_Sentiment"], ascending=True)
print(df[["Negative_Review", "Negative_Sentiment"]])
df = df.sort_values(by=["Positive_Sentiment"], ascending=True)
print(df[["Positive_Review", "Positive_Sentiment"]])
```
The final step before using the file in the challenge is to save it! Additionally, you might want to reorder all your new columns to make them more user-friendly (this is purely a cosmetic adjustment for easier handling).
```python
# Reorder the columns (This is cosmetic, but to make it easier to explore the data later)
df = df.reindex(["Hotel_Name", "Hotel_Address", "Total_Number_of_Reviews", "Average_Score", "Reviewer_Score", "Negative_Sentiment", "Positive_Sentiment", "Reviewer_Nationality", "Leisure_trip", "Couple", "Solo_traveler", "Business_trip", "Group", "Family_with_young_children", "Family_with_older_children", "With_a_pet", "Negative_Review", "Positive_Review"], axis=1)
print("Saving results to Hotel_Reviews_NLP.csv")
df.to_csv(r"../data/Hotel_Reviews_NLP.csv", index = False)
```
You should execute the complete code from [the analysis notebook](https://github.com/microsoft/ML-For-Beginners/blob/main/6-NLP/5-Hotel-Reviews-2/solution/3-notebook.ipynb) (after running [the filtering notebook](https://github.com/microsoft/ML-For-Beginners/blob/main/6-NLP/5-Hotel-Reviews-2/solution/1-notebook.ipynb) to generate the Hotel_Reviews_Filtered.csv file).
To summarize, the steps are:
1. The original dataset file **Hotel_Reviews.csv** is explored in the previous lesson using [the explorer notebook](https://github.com/microsoft/ML-For-Beginners/blob/main/6-NLP/4-Hotel-Reviews-1/solution/notebook.ipynb).
2. **Hotel_Reviews.csv** is filtered using [the filtering notebook](https://github.com/microsoft/ML-For-Beginners/blob/main/6-NLP/5-Hotel-Reviews-2/solution/1-notebook.ipynb), resulting in **Hotel_Reviews_Filtered.csv**.
3. **Hotel_Reviews_Filtered.csv** is processed using [the sentiment analysis notebook](https://github.com/microsoft/ML-For-Beginners/blob/main/6-NLP/5-Hotel-Reviews-2/solution/3-notebook.ipynb), producing **Hotel_Reviews_NLP.csv**.
4. Use **Hotel_Reviews_NLP.csv** in the NLP Challenge below.
### Conclusion
At the beginning, you had a dataset with columns and data, but not all of it was usable or verifiable. You've explored the data, filtered out unnecessary parts, transformed tags into meaningful information, calculated averages, added sentiment columns, and hopefully gained valuable insights into processing natural text.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Challenge
Now that your dataset has been analyzed for sentiment, try applying strategies you've learned in this curriculum (such as clustering) to identify patterns related to sentiment.
## Review & Self Study
Take [this Learn module](https://docs.microsoft.com/en-us/learn/modules/classify-user-feedback-with-the-text-analytics-api/?WT.mc_id=academic-77952-leestott) to deepen your understanding and explore sentiment analysis using different tools.
## Assignment
[Experiment with a different dataset](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "daf144daa552da6a7d442aff6f3e77d8",
"translation_date": "2025-09-06T11:03:48+00:00",
"source_file": "6-NLP/5-Hotel-Reviews-2/assignment.md",
"language_code": "en"
}
-->
# Try a different dataset
## Instructions
Now that youve learned how to use NLTK to analyze sentiment in text, experiment with a different dataset. Youll likely need to perform some data preprocessing, so create a notebook and document your thought process. What insights do you gain?
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | ---------------------------------------------------------------------------------------------------------------- | ----------------------------------------- | ---------------------- |
| | A comprehensive notebook and dataset are provided, with well-documented cells explaining the sentiment analysis | The notebook lacks clear explanations | The notebook has major issues |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T11:03:56+00:00",
"source_file": "6-NLP/5-Hotel-Reviews-2/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "81db6ff2cf6e62fbe2340b094bb9509e",
"translation_date": "2025-09-06T11:03:52+00:00",
"source_file": "6-NLP/5-Hotel-Reviews-2/solution/R/README.md",
"language_code": "en"
}
-->
this is a temporary placeholder
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,38 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1eb379dc2d0c9940b320732d16083778",
"translation_date": "2025-09-06T11:00:30+00:00",
"source_file": "6-NLP/README.md",
"language_code": "en"
}
-->
# Getting started with natural language processing
Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken and written—commonly referred to as natural language. It is a branch of artificial intelligence (AI). NLP has been around for over 50 years and has its origins in the field of linguistics. The entire field focuses on enabling machines to comprehend and process human language. This capability can then be applied to tasks such as spell checking or machine translation. NLP has numerous practical applications across various domains, including medical research, search engines, and business intelligence.
## Regional topic: European languages and literature and romantic hotels of Europe ❤️
In this part of the curriculum, you'll explore one of the most prevalent applications of machine learning: natural language processing (NLP). Rooted in computational linguistics, this area of artificial intelligence serves as the connection between humans and machines through voice or text-based communication.
Throughout these lessons, we'll cover the fundamentals of NLP by creating small conversational bots to understand how machine learning enhances the intelligence of these interactions. You'll take a journey back in time, engaging in conversations with Elizabeth Bennett and Mr. Darcy from Jane Austen's timeless novel, **Pride and Prejudice**, published in 1813. Afterward, you'll deepen your understanding by exploring sentiment analysis using hotel reviews from Europe.
![Pride and Prejudice book and tea](../../../6-NLP/images/p&p.jpg)
> Photo by <a href="https://unsplash.com/@elaineh?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Elaine Howlin</a> on <a href="https://unsplash.com/s/photos/pride-and-prejudice?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## Lessons
1. [Introduction to natural language processing](1-Introduction-to-NLP/README.md)
2. [Common NLP tasks and techniques](2-Tasks/README.md)
3. [Translation and sentiment analysis with machine learning](3-Translation-Sentiment/README.md)
4. [Preparing your data](4-Hotel-Reviews-1/README.md)
5. [NLTK for Sentiment Analysis](5-Hotel-Reviews-2/README.md)
## Credits
These natural language processing lessons were written with ☕ by [Stephen Howell](https://twitter.com/Howell_MSFT)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "ee0670655c89e4719319764afb113624",
"translation_date": "2025-09-06T11:02:05+00:00",
"source_file": "6-NLP/data/README.md",
"language_code": "en"
}
-->
Download the hotel review data to this folder.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,199 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "662b509c39eee205687726636d0a8455",
"translation_date": "2025-09-06T10:48:49+00:00",
"source_file": "7-TimeSeries/1-Introduction/README.md",
"language_code": "en"
}
-->
# Introduction to time series forecasting
![Summary of time series in a sketchnote](../../../../sketchnotes/ml-timeseries.png)
> Sketchnote by [Tomomi Imura](https://www.twitter.com/girlie_mac)
In this lesson and the next, you'll learn about time series forecasting, an intriguing and valuable skill for machine learning scientists that is less commonly discussed compared to other topics. Time series forecasting is like a "crystal ball": by analyzing past behavior of a variable, such as price, you can predict its potential future value.
[![Introduction to time series forecasting](https://img.youtube.com/vi/cBojo1hsHiI/0.jpg)](https://youtu.be/cBojo1hsHiI "Introduction to time series forecasting")
> 🎥 Click the image above to watch a video about time series forecasting
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
This field is both fascinating and practical, offering significant value to businesses due to its direct applications in pricing, inventory management, and supply chain optimization. While deep learning techniques are increasingly being used to improve predictions, time series forecasting remains heavily influenced by traditional machine learning methods.
> Penn State offers a helpful time series curriculum [here](https://online.stat.psu.edu/stat510/lesson/1)
## Introduction
Imagine you manage a network of smart parking meters that collect data on usage frequency and duration over time.
> What if you could predict the future value of a meter based on its past performance, using principles of supply and demand?
Forecasting the right time to act in order to achieve your goals is a challenge that time series forecasting can address. While charging more during peak times might not make people happy, it could be an effective way to generate revenue for street maintenance!
Let's dive into some types of time series algorithms and start working with a notebook to clean and prepare data. The dataset you'll analyze comes from the GEFCom2014 forecasting competition. It includes three years of hourly electricity load and temperature data from 2012 to 2014. Using historical patterns in electricity load and temperature, you can predict future electricity load values.
In this example, you'll learn how to forecast one time step ahead using only historical load data. But before we begin, it's important to understand the underlying concepts.
## Some definitions
When you encounter the term "time series," it's important to understand its use in various contexts.
🎓 **Time series**
In mathematics, "a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time." An example of a time series is the daily closing value of the [Dow Jones Industrial Average](https://wikipedia.org/wiki/Time_series). Time series plots and statistical modeling are often used in signal processing, weather forecasting, earthquake prediction, and other fields where events occur and data points can be tracked over time.
🎓 **Time series analysis**
Time series analysis involves examining the time series data mentioned above. This data can take various forms, including "interrupted time series," which identifies patterns before and after a disruptive event. The type of analysis depends on the nature of the data, which can consist of numbers or characters.
The analysis employs various methods, such as frequency-domain and time-domain approaches, linear and nonlinear techniques, and more. [Learn more](https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm) about the different ways to analyze this type of data.
🎓 **Time series forecasting**
Time series forecasting uses a model to predict future values based on patterns observed in past data. While regression models can be used to explore time series data, with time indices as x variables on a plot, specialized models are better suited for this type of analysis.
Time series data consists of ordered observations, unlike data analyzed through linear regression. The most common model is ARIMA, which stands for "Autoregressive Integrated Moving Average."
[ARIMA models](https://online.stat.psu.edu/stat510/lesson/1/1.1) "relate the present value of a series to past values and past prediction errors." These models are particularly useful for analyzing time-domain data, where observations are ordered chronologically.
> There are several types of ARIMA models, which you can explore [here](https://people.duke.edu/~rnau/411arim.htm) and will learn about in the next lesson.
In the next lesson, you'll build an ARIMA model using [Univariate Time Series](https://itl.nist.gov/div898/handbook/pmc/section4/pmc44.htm), which focuses on a single variable that changes over time. An example of this type of data is [this dataset](https://itl.nist.gov/div898/handbook/pmc/section4/pmc4411.htm) that records monthly CO2 concentrations at the Mauna Loa Observatory:
| CO2 | YearMonth | Year | Month |
| :----: | :-------: | :---: | :---: |
| 330.62 | 1975.04 | 1975 | 1 |
| 331.40 | 1975.13 | 1975 | 2 |
| 331.87 | 1975.21 | 1975 | 3 |
| 333.18 | 1975.29 | 1975 | 4 |
| 333.92 | 1975.38 | 1975 | 5 |
| 333.43 | 1975.46 | 1975 | 6 |
| 331.85 | 1975.54 | 1975 | 7 |
| 330.01 | 1975.63 | 1975 | 8 |
| 328.51 | 1975.71 | 1975 | 9 |
| 328.41 | 1975.79 | 1975 | 10 |
| 329.25 | 1975.88 | 1975 | 11 |
| 330.97 | 1975.96 | 1975 | 12 |
✅ Identify the variable that changes over time in this dataset.
## Time Series data characteristics to consider
When analyzing time series data, you may notice [certain characteristics](https://online.stat.psu.edu/stat510/lesson/1/1.1) that need to be addressed to better understand its patterns. If you think of time series data as providing a "signal" you want to analyze, these characteristics can be considered "noise." You'll often need to reduce this "noise" using statistical techniques.
Here are some key concepts to understand when working with time series data:
🎓 **Trends**
Trends refer to measurable increases or decreases over time. [Read more](https://machinelearningmastery.com/time-series-trends-in-python) about how to identify and, if necessary, remove trends from your time series.
🎓 **[Seasonality](https://machinelearningmastery.com/time-series-seasonality-with-python/)**
Seasonality refers to periodic fluctuations, such as holiday sales spikes. [Learn more](https://itl.nist.gov/div898/handbook/pmc/section4/pmc443.htm) about how different types of plots reveal seasonality in data.
🎓 **Outliers**
Outliers are data points that deviate significantly from the standard variance.
🎓 **Long-run cycle**
Independent of seasonality, data may exhibit long-term cycles, such as economic downturns lasting over a year.
🎓 **Constant variance**
Some data show consistent fluctuations over time, like daily and nightly energy usage.
🎓 **Abrupt changes**
Data may display sudden changes that require further analysis. For example, the abrupt closure of businesses due to COVID caused significant shifts in data.
✅ Here's a [sample time series plot](https://www.kaggle.com/kashnitsky/topic-9-part-1-time-series-analysis-in-python) showing daily in-game currency spent over several years. Can you identify any of the characteristics listed above in this data?
![In-game currency spend](../../../../7-TimeSeries/1-Introduction/images/currency.png)
## Exercise - getting started with power usage data
Let's begin creating a time series model to predict future power usage based on past usage.
> The data in this example comes from the GEFCom2014 forecasting competition. It includes three years of hourly electricity load and temperature data from 2012 to 2014.
>
> Tao Hong, Pierre Pinson, Shu Fan, Hamidreza Zareipour, Alberto Troccoli, and Rob J. Hyndman, "Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond," International Journal of Forecasting, vol.32, no.3, pp 896-913, July-September, 2016.
1. In the `working` folder of this lesson, open the _notebook.ipynb_ file. Start by adding libraries to load and visualize data:
```python
import os
import matplotlib.pyplot as plt
from common.utils import load_data
%matplotlib inline
```
Note: You're using files from the included `common` folder, which sets up your environment and handles data downloading.
2. Next, examine the data as a dataframe by calling `load_data()` and `head()`:
```python
data_dir = './data'
energy = load_data(data_dir)[['load']]
energy.head()
```
You'll see two columns representing date and load:
| | load |
| :-----------------: | :----: |
| 2012-01-01 00:00:00 | 2698.0 |
| 2012-01-01 01:00:00 | 2558.0 |
| 2012-01-01 02:00:00 | 2444.0 |
| 2012-01-01 03:00:00 | 2402.0 |
| 2012-01-01 04:00:00 | 2403.0 |
3. Now, plot the data by calling `plot()`:
```python
energy.plot(y='load', subplots=True, figsize=(15, 8), fontsize=12)
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
```
![energy plot](../../../../7-TimeSeries/1-Introduction/images/energy-plot.png)
4. Next, plot the first week of July 2014 by providing the date range as input to `energy` in `[from date]: [to date]` format:
```python
energy['2014-07-01':'2014-07-07'].plot(y='load', subplots=True, figsize=(15, 8), fontsize=12)
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
```
![july](../../../../7-TimeSeries/1-Introduction/images/july-2014.png)
A beautiful plot! Examine these plots and see if you can identify any of the characteristics listed above. What insights can you gather by visualizing the data?
In the next lesson, you'll create an ARIMA model to generate forecasts.
---
## 🚀Challenge
Make a list of industries and fields that could benefit from time series forecasting. Can you think of applications in the arts? Econometrics? Ecology? Retail? Industry? Finance? What other areas?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
Although not covered here, neural networks are sometimes used to enhance traditional time series forecasting methods. Read more about them [in this article](https://medium.com/microsoftazure/neural-networks-for-forecasting-financial-and-economic-time-series-6aca370ff412).
## Assignment
[Visualize additional time series](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "d1781b0b92568ea1d119d0a198b576b4",
"translation_date": "2025-09-06T10:49:16+00:00",
"source_file": "7-TimeSeries/1-Introduction/assignment.md",
"language_code": "en"
}
-->
# Visualize some more Time Series
## Instructions
You've started learning about Time Series Forecasting by exploring the type of data that requires this specialized modeling. You've visualized some data related to energy. Now, search for other datasets that could benefit from Time Series Forecasting. Find three examples (consider [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/catalog/?WT.mc_id=academic-77952-leestott)) and create a notebook to visualize them. Make notes in the notebook about any unique characteristics they exhibit (seasonality, sudden changes, or other trends).
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | ----------------------------------------------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| | Three datasets are visualized and explained in a notebook | Two datasets are visualized and explained in a notebook | Few datasets are visualized or explained in a notebook, or the data provided is insufficient |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:49:25+00:00",
"source_file": "7-TimeSeries/1-Introduction/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "81db6ff2cf6e62fbe2340b094bb9509e",
"translation_date": "2025-09-06T10:49:22+00:00",
"source_file": "7-TimeSeries/1-Introduction/solution/R/README.md",
"language_code": "en"
}
-->
this is a temporary placeholder
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,406 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "917dbf890db71a322f306050cb284749",
"translation_date": "2025-09-06T10:48:08+00:00",
"source_file": "7-TimeSeries/2-ARIMA/README.md",
"language_code": "en"
}
-->
# Time series forecasting with ARIMA
In the previous lesson, you explored time series forecasting and worked with a dataset showing variations in electrical load over time.
[![Introduction to ARIMA](https://img.youtube.com/vi/IUSk-YDau10/0.jpg)](https://youtu.be/IUSk-YDau10 "Introduction to ARIMA")
> 🎥 Click the image above to watch a video: A brief introduction to ARIMA models. The example uses R, but the concepts apply universally.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Introduction
In this lesson, you'll learn how to build models using [ARIMA: *A*uto*R*egressive *I*ntegrated *M*oving *A*verage](https://wikipedia.org/wiki/Autoregressive_integrated_moving_average). ARIMA models are particularly effective for analyzing data with [non-stationarity](https://wikipedia.org/wiki/Stationary_process).
## General concepts
To work with ARIMA, you need to understand a few key concepts:
- 🎓 **Stationarity**: In statistics, stationarity refers to data whose distribution remains constant over time. Non-stationary data, on the other hand, exhibits trends or fluctuations that need to be transformed for analysis. For example, seasonality can cause fluctuations in data, which can be addressed through 'seasonal differencing.'
- 🎓 **[Differencing](https://wikipedia.org/wiki/Autoregressive_integrated_moving_average#Differencing)**: Differencing is a statistical technique used to transform non-stationary data into stationary data by removing trends. "Differencing eliminates changes in the level of a time series, removing trends and seasonality, and stabilizing the mean of the time series." [Paper by Shixiong et al](https://arxiv.org/abs/1904.07632)
## ARIMA in the context of time series
Lets break down the components of ARIMA to understand how it models time series data and enables predictions.
- **AR - AutoRegressive**: Autoregressive models analyze past values (lags) in your data to make predictions. For example, if you have monthly sales data for pencils, each month's sales total is considered an 'evolving variable.' The model is built by regressing the variable of interest on its lagged (previous) values. [Wikipedia](https://wikipedia.org/wiki/Autoregressive_integrated_moving_average)
- **I - Integrated**: Unlike ARMA models, the 'I' in ARIMA refers to its *[integrated](https://wikipedia.org/wiki/Order_of_integration)* aspect. Integration involves applying differencing steps to eliminate non-stationarity.
- **MA - Moving Average**: The [moving-average](https://wikipedia.org/wiki/Moving-average_model) component of the model uses current and past values of lags to determine the output variable.
In summary, ARIMA is designed to fit time series data as closely as possible for effective modeling and forecasting.
## Exercise - Build an ARIMA model
Navigate to the [_/working_](https://github.com/microsoft/ML-For-Beginners/tree/main/7-TimeSeries/2-ARIMA/working) folder in this lesson and locate the [_notebook.ipynb_](https://github.com/microsoft/ML-For-Beginners/blob/main/7-TimeSeries/2-ARIMA/working/notebook.ipynb) file.
1. Run the notebook to load the `statsmodels` Python library, which is required for ARIMA models.
1. Import the necessary libraries.
1. Next, load additional libraries for data visualization:
```python
import os
import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime as dt
import math
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.preprocessing import MinMaxScaler
from common.utils import load_data, mape
from IPython.display import Image
%matplotlib inline
pd.options.display.float_format = '{:,.2f}'.format
np.set_printoptions(precision=2)
warnings.filterwarnings("ignore") # specify to ignore warning messages
```
1. Load the data from the `/data/energy.csv` file into a Pandas dataframe and inspect it:
```python
energy = load_data('./data')[['load']]
energy.head(10)
```
1. Plot the energy data from January 2012 to December 2014. This data should look familiar from the previous lesson:
```python
energy.plot(y='load', subplots=True, figsize=(15, 8), fontsize=12)
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
```
Now, lets build a model!
### Create training and testing datasets
After loading the data, split it into training and testing sets. The model will be trained on the training set and evaluated for accuracy using the testing set. Ensure the testing set covers a later time period than the training set to avoid data leakage.
1. Assign the period from September 1 to October 31, 2014 to the training set. The testing set will cover November 1 to December 31, 2014:
```python
train_start_dt = '2014-11-01 00:00:00'
test_start_dt = '2014-12-30 00:00:00'
```
Since the data represents daily energy consumption, it exhibits a strong seasonal pattern, but recent days' consumption is most similar to current consumption.
1. Visualize the differences:
```python
energy[(energy.index < test_start_dt) & (energy.index >= train_start_dt)][['load']].rename(columns={'load':'train'}) \
.join(energy[test_start_dt:][['load']].rename(columns={'load':'test'}), how='outer') \
.plot(y=['train', 'test'], figsize=(15, 8), fontsize=12)
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
```
![training and testing data](../../../../7-TimeSeries/2-ARIMA/images/train-test.png)
Using a relatively small training window should suffice.
> Note: The function used to fit the ARIMA model performs in-sample validation during fitting, so validation data is omitted.
### Prepare the data for training
Prepare the data for training by filtering and scaling it. Filter the dataset to include only the required time periods and columns, and scale the data to fit within the range 0 to 1.
1. Filter the dataset to include only the specified time periods and the 'load' column along with the date:
```python
train = energy.copy()[(energy.index >= train_start_dt) & (energy.index < test_start_dt)][['load']]
test = energy.copy()[energy.index >= test_start_dt][['load']]
print('Training data shape: ', train.shape)
print('Test data shape: ', test.shape)
```
Check the shape of the filtered data:
```output
Training data shape: (1416, 1)
Test data shape: (48, 1)
```
1. Scale the data to fit within the range (0, 1):
```python
scaler = MinMaxScaler()
train['load'] = scaler.fit_transform(train)
train.head(10)
```
1. Compare the original data to the scaled data:
```python
energy[(energy.index >= train_start_dt) & (energy.index < test_start_dt)][['load']].rename(columns={'load':'original load'}).plot.hist(bins=100, fontsize=12)
train.rename(columns={'load':'scaled load'}).plot.hist(bins=100, fontsize=12)
plt.show()
```
![original](../../../../7-TimeSeries/2-ARIMA/images/original.png)
> The original data
![scaled](../../../../7-TimeSeries/2-ARIMA/images/scaled.png)
> The scaled data
1. Scale the test data using the same approach:
```python
test['load'] = scaler.transform(test)
test.head()
```
### Implement ARIMA
Now its time to implement ARIMA using the `statsmodels` library.
Follow these steps:
1. Define the model by calling `SARIMAX()` and specifying the model parameters: p, d, q, and P, D, Q.
2. Train the model on the training data using the `fit()` function.
3. Make predictions using the `forecast()` function, specifying the number of steps (the `horizon`) to forecast.
> 🎓 What do these parameters mean? ARIMA models use three parameters to capture key aspects of time series data: seasonality, trend, and noise.
`p`: Represents the auto-regressive component, incorporating past values.
`d`: Represents the integrated component, determining the level of differencing to apply.
`q`: Represents the moving-average component.
> Note: For seasonal data (like this dataset), use a seasonal ARIMA model (SARIMA) with additional parameters: `P`, `D`, and `Q`, which correspond to the seasonal components of `p`, `d`, and `q`.
1. Set the horizon value to 3 hours:
```python
# Specify the number of steps to forecast ahead
HORIZON = 3
print('Forecasting horizon:', HORIZON, 'hours')
```
Selecting optimal ARIMA parameters can be challenging. Consider using the `auto_arima()` function from the [`pyramid` library](https://alkaline-ml.com/pmdarima/0.9.0/modules/generated/pyramid.arima.auto_arima.html).
1. For now, manually select parameters to find a suitable model:
```python
order = (4, 1, 0)
seasonal_order = (1, 1, 0, 24)
model = SARIMAX(endog=train, order=order, seasonal_order=seasonal_order)
results = model.fit()
print(results.summary())
```
A table of results is displayed.
Congratulations! Youve built your first model. Next, evaluate its performance.
### Evaluate your model
Evaluate your model using `walk forward` validation. In practice, time series models are retrained whenever new data becomes available, enabling the best forecast at each time step.
Using this technique:
1. Train the model on the training set.
2. Predict the next time step.
3. Compare the prediction to the actual value.
4. Expand the training set to include the actual value and repeat the process.
> Note: Keep the training set window fixed for efficiency. When adding a new observation to the training set, remove the oldest observation.
This approach provides a robust evaluation of model performance but requires significant computational resources. Its ideal for small datasets or simple models but may be challenging at scale.
Walk-forward validation is the gold standard for time series model evaluation and is recommended for your projects.
1. Create a test data point for each HORIZON step:
```python
test_shifted = test.copy()
for t in range(1, HORIZON+1):
test_shifted['load+'+str(t)] = test_shifted['load'].shift(-t, freq='H')
test_shifted = test_shifted.dropna(how='any')
test_shifted.head(5)
```
| | | load | load+1 | load+2 |
| ---------- | -------- | ---- | ------ | ------ |
| 2014-12-30 | 00:00:00 | 0.33 | 0.29 | 0.27 |
| 2014-12-30 | 01:00:00 | 0.29 | 0.27 | 0.27 |
| 2014-12-30 | 02:00:00 | 0.27 | 0.27 | 0.30 |
| 2014-12-30 | 03:00:00 | 0.27 | 0.30 | 0.41 |
| 2014-12-30 | 04:00:00 | 0.30 | 0.41 | 0.57 |
The data shifts horizontally based on the horizon point.
1. Use a sliding window approach to make predictions on the test data in a loop:
```python
%%time
training_window = 720 # dedicate 30 days (720 hours) for training
train_ts = train['load']
test_ts = test_shifted
history = [x for x in train_ts]
history = history[(-training_window):]
predictions = list()
order = (2, 1, 0)
seasonal_order = (1, 1, 0, 24)
for t in range(test_ts.shape[0]):
model = SARIMAX(endog=history, order=order, seasonal_order=seasonal_order)
model_fit = model.fit()
yhat = model_fit.forecast(steps = HORIZON)
predictions.append(yhat)
obs = list(test_ts.iloc[t])
# move the training window
history.append(obs[0])
history.pop(0)
print(test_ts.index[t])
print(t+1, ': predicted =', yhat, 'expected =', obs)
```
Observe the training process:
```output
2014-12-30 00:00:00
1 : predicted = [0.32 0.29 0.28] expected = [0.32945389435989236, 0.2900626678603402, 0.2739480752014323]
2014-12-30 01:00:00
2 : predicted = [0.3 0.29 0.3 ] expected = [0.2900626678603402, 0.2739480752014323, 0.26812891674127126]
2014-12-30 02:00:00
3 : predicted = [0.27 0.28 0.32] expected = [0.2739480752014323, 0.26812891674127126, 0.3025962399283795]
```
1. Compare predictions to actual load values:
```python
eval_df = pd.DataFrame(predictions, columns=['t+'+str(t) for t in range(1, HORIZON+1)])
eval_df['timestamp'] = test.index[0:len(test.index)-HORIZON+1]
eval_df = pd.melt(eval_df, id_vars='timestamp', value_name='prediction', var_name='h')
eval_df['actual'] = np.array(np.transpose(test_ts)).ravel()
eval_df[['prediction', 'actual']] = scaler.inverse_transform(eval_df[['prediction', 'actual']])
eval_df.head()
```
Output:
| | | timestamp | h | prediction | actual |
| --- | ---------- | --------- | --- | ---------- | -------- |
| 0 | 2014-12-30 | 00:00:00 | t+1 | 3,008.74 | 3,023.00 |
| 1 | 2014-12-30 | 01:00:00 | t+1 | 2,955.53 | 2,935.00 |
| 2 | 2014-12-30 | 02:00:00 | t+1 | 2,900.17 | 2,899.00 |
| 3 | 2014-12-30 | 03:00:00 | t+1 | 2,917.69 | 2,886.00 |
| 4 | 2014-12-30 | 04:00:00 | t+1 | 2,946.99 | 2,963.00 |
Examine the hourly predictions compared to actual load values. How accurate are they?
### Check model accuracy
Assess your models accuracy by calculating its mean absolute percentage error (MAPE) across all predictions.
and predicted values is divided by the actual value. "The absolute value in this calculation is summed for every forecasted point in time and divided by the number of fitted points n." [wikipedia](https://wikipedia.org/wiki/Mean_absolute_percentage_error)
1. Express the equation in code:
```python
if(HORIZON > 1):
eval_df['APE'] = (eval_df['prediction'] - eval_df['actual']).abs() / eval_df['actual']
print(eval_df.groupby('h')['APE'].mean())
```
1. Calculate the MAPE for one step:
```python
print('One step forecast MAPE: ', (mape(eval_df[eval_df['h'] == 't+1']['prediction'], eval_df[eval_df['h'] == 't+1']['actual']))*100, '%')
```
One-step forecast MAPE: 0.5570581332313952 %
1. Print the MAPE for the multi-step forecast:
```python
print('Multi-step forecast MAPE: ', mape(eval_df['prediction'], eval_df['actual'])*100, '%')
```
```output
Multi-step forecast MAPE: 1.1460048657704118 %
```
A lower number is better: keep in mind that a forecast with a MAPE of 10 means it's off by 10%.
1. However, as always, it's easier to understand this kind of accuracy measurement visually, so let's plot it:
```python
if(HORIZON == 1):
## Plotting single step forecast
eval_df.plot(x='timestamp', y=['actual', 'prediction'], style=['r', 'b'], figsize=(15, 8))
else:
## Plotting multi step forecast
plot_df = eval_df[(eval_df.h=='t+1')][['timestamp', 'actual']]
for t in range(1, HORIZON+1):
plot_df['t+'+str(t)] = eval_df[(eval_df.h=='t+'+str(t))]['prediction'].values
fig = plt.figure(figsize=(15, 8))
ax = plt.plot(plot_df['timestamp'], plot_df['actual'], color='red', linewidth=4.0)
ax = fig.add_subplot(111)
for t in range(1, HORIZON+1):
x = plot_df['timestamp'][(t-1):]
y = plot_df['t+'+str(t)][0:len(x)]
ax.plot(x, y, color='blue', linewidth=4*math.pow(.9,t), alpha=math.pow(0.8,t))
ax.legend(loc='best')
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
```
![a time series model](../../../../7-TimeSeries/2-ARIMA/images/accuracy.png)
🏆 A very nice plot, showing a model with good accuracy. Well done!
---
## 🚀Challenge
Explore different ways to test the accuracy of a Time Series Model. In this lesson, we covered MAPE, but are there other methods you could use? Research them and make notes. A helpful document can be found [here](https://otexts.com/fpp2/accuracy.html)
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
This lesson only introduces the basics of Time Series Forecasting with ARIMA. Take some time to expand your knowledge by exploring [this repository](https://microsoft.github.io/forecasting/) and its various model types to learn other approaches to building Time Series models.
## Assignment
[A new ARIMA model](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1c814013e10866dfd92cdb32caaae3ac",
"translation_date": "2025-09-06T10:48:38+00:00",
"source_file": "7-TimeSeries/2-ARIMA/assignment.md",
"language_code": "en"
}
-->
# A new ARIMA model
## Instructions
Now that you have built an ARIMA model, create a new one using fresh data (try one of [these datasets from Duke](http://www2.stat.duke.edu/~mw/ts_data_sets.html)). Document your work in a notebook, visualize the data and your model, and evaluate its accuracy using MAPE.
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | ----------------------------------- |
| | A notebook is presented with a new ARIMA model built, tested, and explained with visualizations and accuracy stated. | The notebook presented is not annotated or contains bugs | An incomplete notebook is presented |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:48:46+00:00",
"source_file": "7-TimeSeries/2-ARIMA/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "81db6ff2cf6e62fbe2340b094bb9509e",
"translation_date": "2025-09-06T10:48:43+00:00",
"source_file": "7-TimeSeries/2-ARIMA/solution/R/README.md",
"language_code": "en"
}
-->
this is a temporary placeholder
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,393 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "482bccabe1df958496ea71a3667995cd",
"translation_date": "2025-09-06T10:49:28+00:00",
"source_file": "7-TimeSeries/3-SVR/README.md",
"language_code": "en"
}
-->
# Time Series Forecasting with Support Vector Regressor
In the previous lesson, you learned how to use the ARIMA model for time series predictions. Now, you'll explore the Support Vector Regressor model, which is used to predict continuous data.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Introduction
In this lesson, you'll learn how to build models using [**SVM**: **S**upport **V**ector **M**achine](https://en.wikipedia.org/wiki/Support-vector_machine) for regression, specifically **SVR: Support Vector Regressor**.
### SVR in the context of time series [^1]
Before diving into the importance of SVR for time series prediction, here are some key concepts to understand:
- **Regression:** A supervised learning technique used to predict continuous values based on input data. The goal is to fit a curve (or line) in the feature space that aligns with the maximum number of data points. [Learn more](https://en.wikipedia.org/wiki/Regression_analysis).
- **Support Vector Machine (SVM):** A supervised machine learning model used for classification, regression, and outlier detection. The model creates a hyperplane in the feature space, which serves as a boundary for classification or as the best-fit line for regression. SVM often uses a Kernel function to transform the dataset into a higher-dimensional space for better separability. [Learn more](https://en.wikipedia.org/wiki/Support-vector_machine).
- **Support Vector Regressor (SVR):** A type of SVM designed to find the best-fit line (or hyperplane) that aligns with the maximum number of data points.
### Why SVR? [^1]
In the previous lesson, you explored ARIMA, a highly effective statistical linear method for forecasting time series data. However, time series data often exhibit *non-linearity*, which linear models like ARIMA cannot capture. SVR's ability to handle non-linear data makes it a powerful tool for time series forecasting.
## Exercise - Build an SVR Model
The initial steps for data preparation are similar to those in the previous lesson on [ARIMA](https://github.com/microsoft/ML-For-Beginners/tree/main/7-TimeSeries/2-ARIMA).
Open the [_/working_](https://github.com/microsoft/ML-For-Beginners/tree/main/7-TimeSeries/3-SVR/working) folder in this lesson and locate the [_notebook.ipynb_](https://github.com/microsoft/ML-For-Beginners/blob/main/7-TimeSeries/3-SVR/working/notebook.ipynb) file. [^2]
1. Run the notebook and import the necessary libraries: [^2]
```python
import sys
sys.path.append('../../')
```
```python
import os
import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime as dt
import math
from sklearn.svm import SVR
from sklearn.preprocessing import MinMaxScaler
from common.utils import load_data, mape
```
2. Load the data from the `/data/energy.csv` file into a Pandas dataframe and inspect it: [^2]
```python
energy = load_data('../../data')[['load']]
```
3. Plot all available energy data from January 2012 to December 2014: [^2]
```python
energy.plot(y='load', subplots=True, figsize=(15, 8), fontsize=12)
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
```
![full data](../../../../7-TimeSeries/3-SVR/images/full-data.png)
Now, let's build our SVR model.
### Create Training and Testing Datasets
Once the data is loaded, separate it into training and testing sets. Reshape the data to create a time-step-based dataset required for SVR. Train the model on the training set, then evaluate its accuracy on the training set, testing set, and the full dataset to assess overall performance. Ensure the test set covers a later time period than the training set to prevent the model from learning future information [^2] (a phenomenon known as *Overfitting*).
1. Assign the two-month period from September 1 to October 31, 2014 to the training set. The test set will include the two-month period from November 1 to December 31, 2014: [^2]
```python
train_start_dt = '2014-11-01 00:00:00'
test_start_dt = '2014-12-30 00:00:00'
```
2. Visualize the differences: [^2]
```python
energy[(energy.index < test_start_dt) & (energy.index >= train_start_dt)][['load']].rename(columns={'load':'train'}) \
.join(energy[test_start_dt:][['load']].rename(columns={'load':'test'}), how='outer') \
.plot(y=['train', 'test'], figsize=(15, 8), fontsize=12)
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
```
![training and testing data](../../../../7-TimeSeries/3-SVR/images/train-test.png)
### Prepare the Data for Training
Filter and scale the data to prepare it for training. Filter the dataset to include only the required time periods and columns, and scale the data to fit within the range 0 to 1.
1. Filter the original dataset to include only the specified time periods for each set, and include only the 'load' column and the date: [^2]
```python
train = energy.copy()[(energy.index >= train_start_dt) & (energy.index < test_start_dt)][['load']]
test = energy.copy()[energy.index >= test_start_dt][['load']]
print('Training data shape: ', train.shape)
print('Test data shape: ', test.shape)
```
```output
Training data shape: (1416, 1)
Test data shape: (48, 1)
```
2. Scale the training data to the range (0, 1): [^2]
```python
scaler = MinMaxScaler()
train['load'] = scaler.fit_transform(train)
```
4. Scale the testing data: [^2]
```python
test['load'] = scaler.transform(test)
```
### Create Data with Time-Steps [^1]
For SVR, transform the input data into the format `[batch, timesteps]`. Reshape the `train_data` and `test_data` to include a new dimension for timesteps.
```python
# Converting to numpy arrays
train_data = train.values
test_data = test.values
```
For this example, set `timesteps = 5`. The model's inputs will be data from the first 4 timesteps, and the output will be data from the 5th timestep.
```python
timesteps=5
```
Convert training data to a 2D tensor using nested list comprehension:
```python
train_data_timesteps=np.array([[j for j in train_data[i:i+timesteps]] for i in range(0,len(train_data)-timesteps+1)])[:,:,0]
train_data_timesteps.shape
```
```output
(1412, 5)
```
Convert testing data to a 2D tensor:
```python
test_data_timesteps=np.array([[j for j in test_data[i:i+timesteps]] for i in range(0,len(test_data)-timesteps+1)])[:,:,0]
test_data_timesteps.shape
```
```output
(44, 5)
```
Select inputs and outputs from training and testing data:
```python
x_train, y_train = train_data_timesteps[:,:timesteps-1],train_data_timesteps[:,[timesteps-1]]
x_test, y_test = test_data_timesteps[:,:timesteps-1],test_data_timesteps[:,[timesteps-1]]
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
```
```output
(1412, 4) (1412, 1)
(44, 4) (44, 1)
```
### Implement SVR [^1]
Now, implement SVR. For more details, refer to [this documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html). Follow these steps:
1. Define the model by calling `SVR()` and specifying hyperparameters: kernel, gamma, C, and epsilon.
2. Train the model using the `fit()` function.
3. Make predictions using the `predict()` function.
Create an SVR model using the [RBF kernel](https://scikit-learn.org/stable/modules/svm.html#parameters-of-the-rbf-kernel), with hyperparameters gamma, C, and epsilon set to 0.5, 10, and 0.05, respectively.
```python
model = SVR(kernel='rbf',gamma=0.5, C=10, epsilon = 0.05)
```
#### Train the Model on Training Data [^1]
```python
model.fit(x_train, y_train[:,0])
```
```output
SVR(C=10, cache_size=200, coef0=0.0, degree=3, epsilon=0.05, gamma=0.5,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
```
#### Make Model Predictions [^1]
```python
y_train_pred = model.predict(x_train).reshape(-1,1)
y_test_pred = model.predict(x_test).reshape(-1,1)
print(y_train_pred.shape, y_test_pred.shape)
```
```output
(1412, 1) (44, 1)
```
Your SVR model is ready! Now, let's evaluate it.
### Evaluate Your Model [^1]
To evaluate the model, first scale the data back to its original range. Then, assess performance by plotting the original and predicted time series and calculating the MAPE.
Scale the predicted and original output:
```python
# Scaling the predictions
y_train_pred = scaler.inverse_transform(y_train_pred)
y_test_pred = scaler.inverse_transform(y_test_pred)
print(len(y_train_pred), len(y_test_pred))
```
```python
# Scaling the original values
y_train = scaler.inverse_transform(y_train)
y_test = scaler.inverse_transform(y_test)
print(len(y_train), len(y_test))
```
#### Evaluate Model Performance on Training and Testing Data [^1]
Extract timestamps from the dataset for the x-axis of the plot. Note that the first ```timesteps-1``` values are used as input for the first output, so the timestamps for the output start after that.
```python
train_timestamps = energy[(energy.index < test_start_dt) & (energy.index >= train_start_dt)].index[timesteps-1:]
test_timestamps = energy[test_start_dt:].index[timesteps-1:]
print(len(train_timestamps), len(test_timestamps))
```
```output
1412 44
```
Plot predictions for training data:
```python
plt.figure(figsize=(25,6))
plt.plot(train_timestamps, y_train, color = 'red', linewidth=2.0, alpha = 0.6)
plt.plot(train_timestamps, y_train_pred, color = 'blue', linewidth=0.8)
plt.legend(['Actual','Predicted'])
plt.xlabel('Timestamp')
plt.title("Training data prediction")
plt.show()
```
![training data prediction](../../../../7-TimeSeries/3-SVR/images/train-data-predict.png)
Print MAPE for training data:
```python
print('MAPE for training data: ', mape(y_train_pred, y_train)*100, '%')
```
```output
MAPE for training data: 1.7195710200875551 %
```
Plot predictions for testing data:
```python
plt.figure(figsize=(10,3))
plt.plot(test_timestamps, y_test, color = 'red', linewidth=2.0, alpha = 0.6)
plt.plot(test_timestamps, y_test_pred, color = 'blue', linewidth=0.8)
plt.legend(['Actual','Predicted'])
plt.xlabel('Timestamp')
plt.show()
```
![testing data prediction](../../../../7-TimeSeries/3-SVR/images/test-data-predict.png)
Print MAPE for testing data:
```python
print('MAPE for testing data: ', mape(y_test_pred, y_test)*100, '%')
```
```output
MAPE for testing data: 1.2623790187854018 %
```
🏆 Excellent results on the testing dataset!
### Evaluate Model Performance on Full Dataset [^1]
```python
# Extracting load values as numpy array
data = energy.copy().values
# Scaling
data = scaler.transform(data)
# Transforming to 2D tensor as per model input requirement
data_timesteps=np.array([[j for j in data[i:i+timesteps]] for i in range(0,len(data)-timesteps+1)])[:,:,0]
print("Tensor shape: ", data_timesteps.shape)
# Selecting inputs and outputs from data
X, Y = data_timesteps[:,:timesteps-1],data_timesteps[:,[timesteps-1]]
print("X shape: ", X.shape,"\nY shape: ", Y.shape)
```
```output
Tensor shape: (26300, 5)
X shape: (26300, 4)
Y shape: (26300, 1)
```
```python
# Make model predictions
Y_pred = model.predict(X).reshape(-1,1)
# Inverse scale and reshape
Y_pred = scaler.inverse_transform(Y_pred)
Y = scaler.inverse_transform(Y)
```
```python
plt.figure(figsize=(30,8))
plt.plot(Y, color = 'red', linewidth=2.0, alpha = 0.6)
plt.plot(Y_pred, color = 'blue', linewidth=0.8)
plt.legend(['Actual','Predicted'])
plt.xlabel('Timestamp')
plt.show()
```
![full data prediction](../../../../7-TimeSeries/3-SVR/images/full-data-predict.png)
```python
print('MAPE: ', mape(Y_pred, Y)*100, '%')
```
```output
MAPE: 2.0572089029888656 %
```
🏆 Great plots, showing a model with strong accuracy. Well done!
---
## 🚀Challenge
- Experiment with different hyperparameters (gamma, C, epsilon) and evaluate their impact on the testing data. Learn more about these hyperparameters [here](https://scikit-learn.org/stable/modules/svm.html#parameters-of-the-rbf-kernel).
- Try different kernel functions and analyze their performance on the dataset. Refer to [this document](https://scikit-learn.org/stable/modules/svm.html#kernel-functions).
- Test different values for `timesteps` to see how the model performs with varying look-back periods.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
This lesson introduced SVR for time series forecasting. For more information on SVR, check out [this blog](https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/). The [scikit-learn documentation](https://scikit-learn.org/stable/modules/svm.html) provides a detailed explanation of SVMs, [SVRs](https://scikit-learn.org/stable/modules/svm.html#regression), and kernel functions.
## Assignment
[A new SVR model](assignment.md)
## Credits
[^1]: Text, code, and output in this section contributed by [@AnirbanMukherjeeXD](https://github.com/AnirbanMukherjeeXD)
[^2]: Text, code, and output in this section sourced from [ARIMA](https://github.com/microsoft/ML-For-Beginners/tree/main/7-TimeSeries/2-ARIMA)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,27 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "94aa2fc6154252ae30a3f3740299707a",
"translation_date": "2025-09-06T10:49:49+00:00",
"source_file": "7-TimeSeries/3-SVR/assignment.md",
"language_code": "en"
}
-->
# A new SVR model
## Instructions [^1]
Now that you have created an SVR model, build a new one using fresh data (consider one of [these datasets from Duke](http://www2.stat.duke.edu/~mw/ts_data_sets.html)). Document your work in a notebook, visualize the data and your model, and evaluate its accuracy using appropriate plots and MAPE. Additionally, experiment with adjusting various hyperparameters and try different values for the timesteps.
## Rubric [^1]
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | ------------------------------------------------------------ | --------------------------------------------------------- | ----------------------------------- |
| | A notebook is provided with an SVR model that is built, tested, and explained, including visualizations and stated accuracy. | The notebook provided lacks annotations or contains errors. | An incomplete notebook is submitted |
[^1]:The text in this section is adapted from the [assignment from ARIMA](https://github.com/microsoft/ML-For-Beginners/tree/main/7-TimeSeries/2-ARIMA/assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,37 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "61342603bad8acadbc6b2e4e3aab3f66",
"translation_date": "2025-09-06T10:48:01+00:00",
"source_file": "7-TimeSeries/README.md",
"language_code": "en"
}
-->
# Introduction to time series forecasting
What is time series forecasting? It's the process of predicting future events by analyzing past trends.
## Regional topic: worldwide electricity usage ✨
In these two lessons, you will explore time series forecasting, a relatively lesser-known area of machine learning that is highly valuable for applications in industry, business, and other fields. While neural networks can enhance the effectiveness of these models, we will focus on classical machine learning approaches to predict future outcomes based on historical data.
Our regional focus is on global electricity usage, an intriguing dataset that helps us learn how to forecast future power consumption by analyzing past load patterns. This type of forecasting can be incredibly useful in a business context.
![electric grid](../../../7-TimeSeries/images/electric-grid.jpg)
Photo by [Peddi Sai hrithik](https://unsplash.com/@shutter_log?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) of electrical towers on a road in Rajasthan on [Unsplash](https://unsplash.com/s/photos/electric-india?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)
## Lessons
1. [Introduction to time series forecasting](1-Introduction/README.md)
2. [Building ARIMA time series models](2-ARIMA/README.md)
3. [Building Support Vector Regressor for time series forecasting](3-SVR/README.md)
## Credits
"Introduction to time series forecasting" was created with ⚡️ by [Francesca Lazzeri](https://twitter.com/frlazzeri) and [Jen Looper](https://twitter.com/jenlooper). The notebooks were originally published in the [Azure "Deep Learning For Time Series" repo](https://github.com/Azure/DeepLearningForTimeSeriesForecasting) authored by Francesca Lazzeri. The SVR lesson was written by [Anirban Mukherjee](https://github.com/AnirbanMukherjeeXD)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,91 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "911efd5e595089000cb3c16fce1beab8",
"translation_date": "2025-09-06T10:59:08+00:00",
"source_file": "8-Reinforcement/1-QLearning/README.md",
"language_code": "en"
}
-->
## Visualizing the Learned Policy
After running the learning algorithm, we can visualize the Q-Table to see the learned policy. The arrows (or circles) in each cell will indicate the preferred direction of movement based on the Q-Table values. This visualization helps us understand how the agent has learned to navigate the environment.
For example, the updated Q-Table might look like this:
![Peter's Learned Policy](../../../../8-Reinforcement/1-QLearning/images/learned_policy.png)
In this visualization:
- The arrows point in the direction of the action with the highest Q-Table value for each state.
- The agent is more likely to follow these directions to reach the apple while avoiding the wolf and other obstacles.
## Testing the Learned Policy
Once the Q-Table is trained, we can test the learned policy by letting the agent navigate the environment using the Q-Table values. Instead of randomly choosing actions, the agent will now select the action with the highest Q-Table value at each state.
Run the following code to test the learned policy: (code block 9)
```python
def qpolicy_strict(m):
x,y = m.human
v = probs(Q[x,y])
a = list(actions)[np.argmax(v)]
return a
walk(m,qpolicy_strict)
```
This code will simulate the agent's movement based on the learned policy. You can observe how efficiently the agent reaches the apple compared to the random walk strategy.
## Results and Observations
After training the agent using Q-Learning:
- The agent should be able to reach the apple in significantly fewer steps compared to the random walk strategy.
- The learned policy will guide the agent to avoid the wolf and other obstacles while maximizing the reward.
You can also visualize the agent's movement during the test run:
![Peter's Learned Movement](../../../../8-Reinforcement/1-QLearning/images/learned_movement.gif)
Notice how the agent's movements are more purposeful and efficient compared to the random walk.
## Summary
In this lesson, we explored the basics of reinforcement learning and implemented the Q-Learning algorithm to train an agent to navigate an environment. Here's what we covered:
- The concepts of states, actions, rewards, and policies in reinforcement learning.
- How to define a reward function to guide the agent's learning process.
- The Bellman equation and its role in updating the Q-Table.
- The balance between exploration and exploitation during training.
- How to implement and visualize the Q-Learning algorithm in Python.
By the end of this lesson, you should have a solid understanding of how reinforcement learning works and how Q-Learning can be used to solve problems in a simulated environment.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Assignment
1. Modify the reward function to include penalties for stepping into water or grass. How does this affect the agent's learning process and the resulting policy?
2. Experiment with different values of the discount factor (γ) and learning rate (α). How do these parameters influence the agent's behavior and the speed of learning?
3. Create a new environment with a different layout (e.g., more obstacles, multiple apples, or multiple wolves). Train the agent in this new environment and observe how it adapts.
By completing these assignments, you'll gain a deeper understanding of how to fine-tune reinforcement learning algorithms and apply them to various scenarios.
The learnings can be summarized as:
- **Average path length increases**. Initially, the average path length increases. This is likely because, when we know nothing about the environment, the agent is prone to getting stuck in unfavorable states, such as water or encountering a wolf. As the agent gathers more knowledge and begins to use it, it can explore the environment for longer periods, but it still doesn't have a clear understanding of where the apples are located.
- **Path length decreases as we learn more**. Once the agent has learned enough, it becomes easier to achieve the goal, and the path length starts to decrease. However, since the agent is still exploring, it occasionally deviates from the optimal path to investigate new possibilities, which can make the path longer than necessary.
- **Abrupt length increase**. Another observation from the graph is that, at some point, the path length increases abruptly. This highlights the stochastic nature of the process, where the Q-Table coefficients can be "spoiled" by being overwritten with new values. Ideally, this should be minimized by reducing the learning rate (e.g., toward the end of training, adjusting Q-Table values by only small amounts).
Overall, its important to note that the success and quality of the learning process depend heavily on parameters such as the learning rate, learning rate decay, and discount factor. These are often referred to as **hyperparameters**, to distinguish them from **parameters**, which are optimized during training (e.g., Q-Table coefficients). The process of finding the best hyperparameter values is called **hyperparameter optimization**, which is a topic worthy of its own discussion.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Assignment
[A More Realistic World](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,41 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "68394b2102d3503882e5e914bd0ff5c1",
"translation_date": "2025-09-06T10:59:31+00:00",
"source_file": "8-Reinforcement/1-QLearning/assignment.md",
"language_code": "en"
}
-->
# A More Realistic World
In our scenario, Peter could move around almost endlessly without feeling tired or hungry. In a more realistic world, he would need to sit down and rest occasionally, as well as eat to sustain himself. Let's make our world more realistic by implementing the following rules:
1. Moving from one location to another causes Peter to lose **energy** and gain **fatigue**.
2. Peter can regain energy by eating apples.
3. Peter can reduce fatigue by resting under a tree or on the grass (i.e., stepping into a board location with a tree or grass - green field).
4. Peter needs to locate and defeat the wolf.
5. To defeat the wolf, Peter must have specific levels of energy and fatigue; otherwise, he will lose the battle.
## Instructions
Use the original [notebook.ipynb](../../../../8-Reinforcement/1-QLearning/notebook.ipynb) notebook as the starting point for your solution.
Modify the reward function described above according to the game's rules, run the reinforcement learning algorithm to determine the best strategy for winning the game, and compare the results of random walk with your algorithm in terms of the number of games won and lost.
> **Note**: In this new world, the state is more complex and includes not only Peter's position but also his fatigue and energy levels. You can choose to represent the state as a tuple (Board, energy, fatigue), define a class for the state (you may also want to derive it from `Board`), or even modify the original `Board` class inside [rlboard.py](../../../../8-Reinforcement/1-QLearning/rlboard.py).
In your solution, ensure that the code responsible for the random walk strategy is retained, and compare the results of your algorithm with the random walk strategy at the end.
> **Note**: You may need to adjust hyperparameters to make the algorithm work, especially the number of epochs. Since the game's success (defeating the wolf) is a rare event, you should expect a much longer training time.
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| | A notebook is presented with the definition of new world rules, Q-Learning algorithm, and some textual explanations. Q-Learning significantly improves results compared to random walk. | A notebook is presented, Q-Learning is implemented and improves results compared to random walk, but not significantly; or the notebook is poorly documented and the code is not well-structured. | Some attempt to redefine the world's rules is made, but the Q-Learning algorithm does not work, or the reward function is not fully defined. |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T10:59:43+00:00",
"source_file": "8-Reinforcement/1-QLearning/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "81db6ff2cf6e62fbe2340b094bb9509e",
"translation_date": "2025-09-06T10:59:39+00:00",
"source_file": "8-Reinforcement/1-QLearning/solution/R/README.md",
"language_code": "en"
}
-->
this is a temporary placeholder
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,332 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "107d5bb29da8a562e7ae72262d251a75",
"translation_date": "2025-09-06T10:59:45+00:00",
"source_file": "8-Reinforcement/2-Gym/README.md",
"language_code": "en"
}
-->
## Prerequisites
In this lesson, we will use a library called **OpenAI Gym** to simulate different **environments**. You can run the code for this lesson locally (e.g., using Visual Studio Code), in which case the simulation will open in a new window. If you're running the code online, you may need to make some adjustments, as described [here](https://towardsdatascience.com/rendering-openai-gym-envs-on-binder-and-google-colab-536f99391cc7).
## OpenAI Gym
In the previous lesson, the rules of the game and the state were defined by the `Board` class that we created ourselves. Here, we will use a specialized **simulation environment** to simulate the physics of the balancing pole. One of the most popular simulation environments for training reinforcement learning algorithms is called [Gym](https://gym.openai.com/), maintained by [OpenAI](https://openai.com/). Using Gym, we can create various **environments**, ranging from cartpole simulations to Atari games.
> **Note**: You can explore other environments available in OpenAI Gym [here](https://gym.openai.com/envs/#classic_control).
First, let's install Gym and import the required libraries (code block 1):
```python
import sys
!{sys.executable} -m pip install gym
import gym
import matplotlib.pyplot as plt
import numpy as np
import random
```
## Exercise - Initialize a CartPole Environment
To work on the cartpole balancing problem, we need to initialize the corresponding environment. Each environment is associated with:
- **Observation space**, which defines the structure of the information we receive from the environment. For the cartpole problem, we receive the position of the pole, velocity, and other values.
- **Action space**, which defines the possible actions. In this case, the action space is discrete and consists of two actions: **left** and **right**. (code block 2)
1. To initialize the environment, type the following code:
```python
env = gym.make("CartPole-v1")
print(env.action_space)
print(env.observation_space)
print(env.action_space.sample())
```
To understand how the environment works, let's run a short simulation for 100 steps. At each step, we provide an action to be taken—here, we randomly select an action from `action_space`.
1. Run the code below and observe the results.
✅ It's recommended to run this code on a local Python installation! (code block 3)
```python
env.reset()
for i in range(100):
env.render()
env.step(env.action_space.sample())
env.close()
```
You should see something similar to this image:
![non-balancing cartpole](../../../../8-Reinforcement/2-Gym/images/cartpole-nobalance.gif)
1. During the simulation, we need to gather observations to decide on the next action. The `step` function returns the current observations, a reward value, and a flag (`done`) indicating whether the simulation should continue or stop: (code block 4)
```python
env.reset()
done = False
while not done:
env.render()
obs, rew, done, info = env.step(env.action_space.sample())
print(f"{obs} -> {rew}")
env.close()
```
You will see output similar to this in the notebook:
```text
[ 0.03403272 -0.24301182 0.02669811 0.2895829 ] -> 1.0
[ 0.02917248 -0.04828055 0.03248977 0.00543839] -> 1.0
[ 0.02820687 0.14636075 0.03259854 -0.27681916] -> 1.0
[ 0.03113408 0.34100283 0.02706215 -0.55904489] -> 1.0
[ 0.03795414 0.53573468 0.01588125 -0.84308041] -> 1.0
...
[ 0.17299878 0.15868546 -0.20754175 -0.55975453] -> 1.0
[ 0.17617249 0.35602306 -0.21873684 -0.90998894] -> 1.0
```
The observation vector returned at each step contains the following values:
- Position of the cart
- Velocity of the cart
- Angle of the pole
- Rotation rate of the pole
1. Retrieve the minimum and maximum values of these numbers: (code block 5)
```python
print(env.observation_space.low)
print(env.observation_space.high)
```
You may also notice that the reward value at each simulation step is always 1. This is because the goal is to survive as long as possible, i.e., to keep the pole reasonably vertical for the longest time.
✅ The CartPole simulation is considered solved if we achieve an average reward of 195 over 100 consecutive trials.
## State Discretization
In Q-Learning, we need to build a Q-Table that defines the actions for each state. To do this, the state must be **discrete**, meaning it should consist of a finite number of discrete values. Therefore, we need to **discretize** our observations, mapping them to a finite set of states.
There are a few ways to achieve this:
- **Divide into bins**: If we know the range of a value, we can divide it into a number of **bins** and replace the value with the bin number it belongs to. This can be done using the numpy [`digitize`](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) method. This approach gives precise control over the state size, as it depends on the number of bins chosen for discretization.
✅ Alternatively, we can use linear interpolation to map values to a finite interval (e.g., from -20 to 20) and then convert them to integers by rounding. This approach offers less control over the state size, especially if the exact ranges of input values are unknown. For example, in our case, 2 out of 4 values lack upper/lower bounds, which could result in an infinite number of states.
In this example, we'll use the second approach. As you'll notice later, despite undefined upper/lower bounds, these values rarely exceed certain finite intervals, making states with extreme values very rare.
1. Here's a function that takes the observation from our model and produces a tuple of 4 integer values: (code block 6)
```python
def discretize(x):
return tuple((x/np.array([0.25, 0.25, 0.01, 0.1])).astype(np.int))
```
1. Let's also explore another discretization method using bins: (code block 7)
```python
def create_bins(i,num):
return np.arange(num+1)*(i[1]-i[0])/num+i[0]
print("Sample bins for interval (-5,5) with 10 bins\n",create_bins((-5,5),10))
ints = [(-5,5),(-2,2),(-0.5,0.5),(-2,2)] # intervals of values for each parameter
nbins = [20,20,10,10] # number of bins for each parameter
bins = [create_bins(ints[i],nbins[i]) for i in range(4)]
def discretize_bins(x):
return tuple(np.digitize(x[i],bins[i]) for i in range(4))
```
1. Now, run a short simulation and observe the discrete environment values. Feel free to try both `discretize` and `discretize_bins` to see if there's a difference.
`discretize_bins` returns the bin number, which is 0-based. For input values around 0, it returns the middle bin number (10). In `discretize`, we didn't constrain the output range, allowing negative values, so 0 corresponds directly to 0. (code block 8)
```python
env.reset()
done = False
while not done:
#env.render()
obs, rew, done, info = env.step(env.action_space.sample())
#print(discretize_bins(obs))
print(discretize(obs))
env.close()
```
✅ Uncomment the line starting with `env.render` if you want to visualize the environment's execution. Otherwise, you can run it in the background for faster execution. We'll use this "invisible" execution during the Q-Learning process.
## The Q-Table Structure
In the previous lesson, the state was a simple pair of numbers ranging from 0 to 8, making it convenient to represent the Q-Table as a numpy tensor with a shape of 8x8x2. If we use bin discretization, the size of our state vector is also known, so we can use a similar approach and represent the state as an array with a shape of 20x20x10x10x2 (where 2 corresponds to the action space dimension, and the first dimensions represent the number of bins chosen for each parameter in the observation space).
However, sometimes the precise dimensions of the observation space are unknown. In the case of the `discretize` function, we can't guarantee that the state will remain within certain limits, as some original values are unbounded. Therefore, we'll use a slightly different approach and represent the Q-Table as a dictionary.
1. Use the pair *(state, action)* as the dictionary key, with the corresponding Q-Table entry value as the value. (code block 9)
```python
Q = {}
actions = (0,1)
def qvalues(state):
return [Q.get((state,a),0) for a in actions]
```
Here, we also define a function `qvalues()` that returns a list of Q-Table values for a given state corresponding to all possible actions. If the entry isn't present in the Q-Table, it defaults to 0.
## Let's Start Q-Learning
Now it's time to teach Peter how to balance!
1. First, set some hyperparameters: (code block 10)
```python
# hyperparameters
alpha = 0.3
gamma = 0.9
epsilon = 0.90
```
Here:
- `alpha` is the **learning rate**, which determines how much we adjust the current Q-Table values at each step. In the previous lesson, we started with 1 and gradually decreased `alpha` during training. In this example, we'll keep it constant for simplicity, but you can experiment with adjusting `alpha` later.
- `gamma` is the **discount factor**, which indicates how much we prioritize future rewards over immediate rewards.
- `epsilon` is the **exploration/exploitation factor**, which decides whether to favor exploration or exploitation. In our algorithm, we'll select the next action based on Q-Table values in `epsilon` percent of cases, and choose a random action in the remaining cases. This helps explore areas of the search space that haven't been visited yet.
✅ In terms of balancing, choosing a random action (exploration) acts like a random push in the wrong direction, forcing the pole to learn how to recover balance from these "mistakes."
### Improve the Algorithm
We can make two improvements to the algorithm from the previous lesson:
- **Calculate average cumulative reward** over multiple simulations. We'll print progress every 5000 iterations and average the cumulative reward over that period. If we achieve more than 195 points, we can consider the problem solved, exceeding the required quality.
- **Track maximum average cumulative reward**, `Qmax`, and store the Q-Table corresponding to that result. During training, you'll notice that the average cumulative reward sometimes drops, so we want to preserve the Q-Table values corresponding to the best model observed.
1. Collect all cumulative rewards from each simulation in the `rewards` vector for later plotting. (code block 11)
```python
def probs(v,eps=1e-4):
v = v-v.min()+eps
v = v/v.sum()
return v
Qmax = 0
cum_rewards = []
rewards = []
for epoch in range(100000):
obs = env.reset()
done = False
cum_reward=0
# == do the simulation ==
while not done:
s = discretize(obs)
if random.random()<epsilon:
# exploitation - chose the action according to Q-Table probabilities
v = probs(np.array(qvalues(s)))
a = random.choices(actions,weights=v)[0]
else:
# exploration - randomly chose the action
a = np.random.randint(env.action_space.n)
obs, rew, done, info = env.step(a)
cum_reward+=rew
ns = discretize(obs)
Q[(s,a)] = (1 - alpha) * Q.get((s,a),0) + alpha * (rew + gamma * max(qvalues(ns)))
cum_rewards.append(cum_reward)
rewards.append(cum_reward)
# == Periodically print results and calculate average reward ==
if epoch%5000==0:
print(f"{epoch}: {np.average(cum_rewards)}, alpha={alpha}, epsilon={epsilon}")
if np.average(cum_rewards) > Qmax:
Qmax = np.average(cum_rewards)
Qbest = Q
cum_rewards=[]
```
What you'll notice from the results:
- **Close to the goal**: We're very close to achieving the goal of 195 cumulative rewards over 100+ consecutive runs, or we may have already achieved it! Even if the numbers are slightly lower, we can't be certain because we're averaging over 5000 runs, while the formal criteria require only 100 runs.
- **Reward drops**: Sometimes the reward starts to drop, indicating that we might overwrite well-learned Q-Table values with worse ones.
This observation becomes clearer when we plot the training progress.
## Plotting Training Progress
During training, we collected cumulative reward values at each iteration in the `rewards` vector. Here's how it looks when plotted against the iteration number:
```python
plt.plot(rewards)
```
![raw progress](../../../../8-Reinforcement/2-Gym/images/train_progress_raw.png)
This graph doesn't provide much insight due to the stochastic nature of the training process, which causes session lengths to vary significantly. To make the graph more meaningful, we can calculate the **running average** over a series of experiments, say 100. This can be done conveniently using `np.convolve`: (code block 12)
```python
def running_average(x,window):
return np.convolve(x,np.ones(window)/window,mode='valid')
plt.plot(running_average(rewards,100))
```
![training progress](../../../../8-Reinforcement/2-Gym/images/train_progress_runav.png)
## Adjusting Hyperparameters
To make learning more stable, we can adjust some hyperparameters during training. Specifically:
- **Learning rate (`alpha`)**: Start with values close to 1 and gradually decrease it. Over time, as the Q-Table values become more reliable, adjustments should be smaller to avoid overwriting good values completely.
- **Exploration factor (`epsilon`)**: Gradually increase `epsilon` to explore less and exploit more. It might be better to start with a lower `epsilon` value and increase it to nearly 1 over time.
> **Task 1**: Experiment with the hyperparameter values and see if you can achieve a higher cumulative reward. Are you reaching above 195?
> **Task 2**: To formally solve the problem, you need to achieve an average reward of 195 across 100 consecutive runs. Track this during training to ensure the problem is officially solved!
## Seeing the result in action
Its fascinating to observe how the trained model performs. Lets run the simulation and use the same action selection strategy as during training, sampling based on the probability distribution in the Q-Table: (code block 13)
```python
obs = env.reset()
done = False
while not done:
s = discretize(obs)
env.render()
v = probs(np.array(qvalues(s)))
a = random.choices(actions,weights=v)[0]
obs,_,done,_ = env.step(a)
env.close()
```
You should see something similar to this:
![a balancing cartpole](../../../../8-Reinforcement/2-Gym/images/cartpole-balance.gif)
---
## 🚀Challenge
> **Task 3**: In this example, we used the final version of the Q-Table, which might not be the optimal one. Remember, we saved the best-performing Q-Table in the `Qbest` variable! Try running the same example using the best-performing Q-Table by copying `Qbest` into `Q` and observe if theres any noticeable difference.
> **Task 4**: In this example, we didnt always select the best action at each step but instead sampled actions based on the corresponding probability distribution. Would it be better to always choose the best action—the one with the highest Q-Table value? This can be achieved using the `np.argmax` function to identify the action number with the highest Q-Table value. Implement this strategy and check if it improves the balancing performance.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Assignment
[Train a Mountain Car](assignment.md)
## Conclusion
Weve now learned how to train agents to achieve strong results simply by providing them with a reward function that defines the desired state of the game and allowing them to intelligently explore the search space. We successfully applied the Q-Learning algorithm in both discrete and continuous environments, though with discrete actions.
Its also crucial to study scenarios where the action space is continuous and the observation space is more complex, such as an image from an Atari game screen. In such cases, more advanced machine learning techniques, like neural networks, are often required to achieve good results. These advanced topics will be covered in our upcoming, more advanced AI course.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,57 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "1f2b7441745eb52e25745423b247016b",
"translation_date": "2025-09-06T11:00:15+00:00",
"source_file": "8-Reinforcement/2-Gym/assignment.md",
"language_code": "en"
}
-->
# Train Mountain Car
[OpenAI Gym](http://gym.openai.com) is designed so that all environments share the same API—i.e., the same methods `reset`, `step`, and `render`, as well as the same abstractions for **action space** and **observation space**. This makes it possible to adapt the same reinforcement learning algorithms to different environments with minimal code changes.
## A Mountain Car Environment
The [Mountain Car environment](https://gym.openai.com/envs/MountainCar-v0/) involves a car stuck in a valley:
The goal is to get out of the valley and reach the flag by performing one of the following actions at each step:
| Value | Meaning |
|---|---|
| 0 | Accelerate to the left |
| 1 | Do not accelerate |
| 2 | Accelerate to the right |
The main challenge of this problem is that the car's engine is not powerful enough to climb the mountain in a single attempt. Therefore, the only way to succeed is to drive back and forth to build up momentum.
The observation space consists of just two values:
| Num | Observation | Min | Max |
|-----|--------------|-----|-----|
| 0 | Car Position | -1.2| 0.6 |
| 1 | Car Velocity | -0.07 | 0.07 |
The reward system for the mountain car is somewhat tricky:
* A reward of 0 is given if the agent reaches the flag (position = 0.5) at the top of the mountain.
* A reward of -1 is given if the agent's position is less than 0.5.
The episode ends if the car's position exceeds 0.5 or if the episode length exceeds 200 steps.
## Instructions
Adapt our reinforcement learning algorithm to solve the mountain car problem. Start with the existing [notebook.ipynb](../../../../8-Reinforcement/2-Gym/notebook.ipynb) code, substitute the new environment, modify the state discretization functions, and try to train the existing algorithm with minimal code changes. Optimize the results by adjusting hyperparameters.
> **Note**: Adjusting hyperparameters will likely be necessary to make the algorithm converge.
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | --------- | -------- | ----------------- |
| | The Q-Learning algorithm is successfully adapted from the CartPole example with minimal code modifications and is able to solve the problem of capturing the flag in under 200 steps. | A new Q-Learning algorithm is adopted from the Internet but is well-documented; or the existing algorithm is adapted but does not achieve the desired results. | The student was unable to successfully adopt any algorithm but made substantial progress toward a solution (e.g., implemented state discretization, Q-Table data structure, etc.). |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a39c15d63f3b2795ee2284a82b986b93",
"translation_date": "2025-09-06T11:00:27+00:00",
"source_file": "8-Reinforcement/2-Gym/solution/Julia/README.md",
"language_code": "en"
}
-->
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,15 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "81db6ff2cf6e62fbe2340b094bb9509e",
"translation_date": "2025-09-06T11:00:24+00:00",
"source_file": "8-Reinforcement/2-Gym/solution/R/README.md",
"language_code": "en"
}
-->
this is a temporary placeholder
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,67 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "20ca019012b1725de956681d036d8b18",
"translation_date": "2025-09-06T10:58:42+00:00",
"source_file": "8-Reinforcement/README.md",
"language_code": "en"
}
-->
# Introduction to reinforcement learning
Reinforcement learning, or RL, is considered one of the fundamental paradigms of machine learning, alongside supervised learning and unsupervised learning. RL focuses on decision-making: making the right decisions or, at the very least, learning from them.
Imagine you have a simulated environment, like the stock market. What happens if you implement a specific regulation? Does it lead to positive or negative outcomes? If something negative occurs, you need to take this _negative reinforcement_, learn from it, and adjust your approach. If the outcome is positive, you should build on that _positive reinforcement_.
![peter and the wolf](../../../8-Reinforcement/images/peter.png)
> Peter and his friends need to escape the hungry wolf! Image by [Jen Looper](https://twitter.com/jenlooper)
## Regional topic: Peter and the Wolf (Russia)
[Peter and the Wolf](https://en.wikipedia.org/wiki/Peter_and_the_Wolf) is a musical fairy tale written by the Russian composer [Sergei Prokofiev](https://en.wikipedia.org/wiki/Sergei_Prokofiev). It tells the story of a young pioneer, Peter, who bravely ventures out of his house into a forest clearing to confront a wolf. In this section, we will train machine learning algorithms to help Peter:
- **Explore** the surrounding area and create an optimal navigation map.
- **Learn** how to use a skateboard and maintain balance on it to move around more quickly.
[![Peter and the Wolf](https://img.youtube.com/vi/Fmi5zHg4QSM/0.jpg)](https://www.youtube.com/watch?v=Fmi5zHg4QSM)
> 🎥 Click the image above to listen to Peter and the Wolf by Prokofiev
## Reinforcement learning
In earlier sections, you encountered two types of machine learning problems:
- **Supervised learning**, where we have datasets that provide example solutions to the problem we aim to solve. [Classification](../4-Classification/README.md) and [regression](../2-Regression/README.md) are examples of supervised learning tasks.
- **Unsupervised learning**, where we lack labeled training data. A primary example of unsupervised learning is [Clustering](../5-Clustering/README.md).
In this section, we will introduce a new type of learning problem that does not rely on labeled training data. There are several types of such problems:
- **[Semi-supervised learning](https://wikipedia.org/wiki/Semi-supervised_learning)**, where we have a large amount of unlabeled data that can be used to pre-train the model.
- **[Reinforcement learning](https://wikipedia.org/wiki/Reinforcement_learning)**, where an agent learns how to behave by conducting experiments in a simulated environment.
### Example - computer game
Imagine you want to teach a computer to play a game, such as chess or [Super Mario](https://wikipedia.org/wiki/Super_Mario). For the computer to play the game, it needs to predict which move to make in each game state. While this might seem like a classification problem, it is not—because we do not have a dataset containing states and corresponding actions. Although we might have some data, like records of chess matches or gameplay footage of Super Mario, it is unlikely that this data will sufficiently cover the vast number of possible states.
Instead of relying on existing game data, **Reinforcement Learning** (RL) is based on the idea of *letting the computer play* the game repeatedly and observing the outcomes. To apply Reinforcement Learning, we need two key components:
- **An environment** and **a simulator** that allow the computer to play the game multiple times. This simulator defines all the game rules, as well as possible states and actions.
- **A reward function**, which evaluates how well the computer performed during each move or game.
The primary difference between RL and other types of machine learning is that in RL, we typically do not know whether we have won or lost until the game is over. Therefore, we cannot determine whether a specific move is good or bad on its own—we only receive feedback (a reward) at the end of the game. Our goal is to design algorithms that enable us to train a model under these uncertain conditions. In this section, we will explore one RL algorithm called **Q-learning**.
## Lessons
1. [Introduction to reinforcement learning and Q-Learning](1-QLearning/README.md)
2. [Using a gym simulation environment](2-Gym/README.md)
## Credits
"Introduction to Reinforcement Learning" was written with ♥️ by [Dmitry Soshnikov](http://soshnikov.com)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,151 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "83320d6b6994909e35d830cebf214039",
"translation_date": "2025-09-06T10:51:39+00:00",
"source_file": "9-Real-World/1-Applications/README.md",
"language_code": "en"
}
-->
# Postscript: Machine Learning in the Real World
![Summary of machine learning in the real world in a sketchnote](../../../../sketchnotes/ml-realworld.png)
> Sketchnote by [Tomomi Imura](https://www.twitter.com/girlie_mac)
In this curriculum, youve learned various ways to prepare data for training and create machine learning models. Youve built a series of classic regression, clustering, classification, natural language processing, and time series models. Congratulations! Now, you might be wondering what its all for... what are the real-world applications for these models?
While AI, often powered by deep learning, has captured much of the industrys attention, classical machine learning models still have valuable applications. In fact, you might already be using some of these applications today! In this lesson, youll explore how eight different industries and domains use these types of models to make their applications more efficient, reliable, intelligent, and valuable to users.
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## 💰 Finance
The finance sector offers numerous opportunities for machine learning. Many problems in this field are well-suited to being modeled and solved using ML.
### Credit Card Fraud Detection
We learned about [k-means clustering](../../5-Clustering/2-K-Means/README.md) earlier in the course, but how can it be applied to credit card fraud detection?
K-means clustering is useful in a fraud detection technique called **outlier detection**. Outliers, or deviations in data patterns, can indicate whether a credit card is being used normally or if something suspicious is happening. As detailed in the paper linked below, you can use k-means clustering to group credit card transactions and identify clusters with unusual patterns. These clusters can then be analyzed to distinguish fraudulent transactions from legitimate ones.
[Reference](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.680.1195&rep=rep1&type=pdf)
### Wealth Management
In wealth management, individuals or firms manage investments on behalf of clients, aiming to sustain and grow wealth over time. Choosing high-performing investments is critical.
Statistical regression is a valuable tool for evaluating investment performance. [Linear regression](../../2-Regression/1-Tools/README.md) can help assess how a fund performs relative to a benchmark. It can also determine whether the results are statistically significant and how they might impact a clients portfolio. Multiple regression can further enhance the analysis by accounting for additional risk factors. For an example of how regression can be used to evaluate fund performance, see the paper linked below.
[Reference](http://www.brightwoodventures.com/evaluating-fund-performance-using-regression/)
## 🎓 Education
The education sector is another fascinating area for ML applications. Challenges include detecting cheating on tests or essays and addressing bias, whether intentional or unintentional, in grading.
### Predicting Student Behavior
[Coursera](https://coursera.com), an online course provider, has a tech blog where they share insights into their engineering decisions. In one case study, they used regression to explore correlations between low NPS (Net Promoter Score) ratings and course retention or drop-off.
[Reference](https://medium.com/coursera-engineering/controlled-regression-quantifying-the-impact-of-course-quality-on-learner-retention-31f956bd592a)
### Mitigating Bias
[Grammarly](https://grammarly.com), a writing assistant that checks for spelling and grammar errors, uses advanced [natural language processing systems](../../6-NLP/README.md) in its products. Their tech blog features a case study on addressing gender bias in machine learning, a topic you explored in our [introductory fairness lesson](../../1-Introduction/3-fairness/README.md).
[Reference](https://www.grammarly.com/blog/engineering/mitigating-gender-bias-in-autocorrect/)
## 👜 Retail
The retail sector can greatly benefit from ML, from enhancing the customer journey to optimizing inventory management.
### Personalizing the Customer Journey
Wayfair, a company specializing in home goods, prioritizes helping customers find products that suit their tastes and needs. In this article, their engineers explain how they use ML and NLP to deliver relevant search results. Their Query Intent Engine employs techniques like entity extraction, classifier training, opinion extraction, and sentiment tagging on customer reviews. This is a classic example of NLP in online retail.
[Reference](https://www.aboutwayfair.com/tech-innovation/how-we-use-machine-learning-and-natural-language-processing-to-empower-search)
### Inventory Management
Innovative companies like [StitchFix](https://stitchfix.com), a clothing subscription service, rely heavily on ML for recommendations and inventory management. Their data scientists collaborate with merchandising teams to predict successful clothing designs using genetic algorithms.
[Reference](https://www.zdnet.com/article/how-stitch-fix-uses-machine-learning-to-master-the-science-of-styling/)
## 🏥 Health Care
The health care sector can use ML to optimize research and logistics, such as managing patient readmissions or preventing disease spread.
### Managing Clinical Trials
Toxicity in clinical trials is a significant concern for drug developers. This study used random forest to create a [classifier](../../4-Classification/README.md) that distinguishes between groups of drugs based on clinical trial outcomes.
[Reference](https://www.sciencedirect.com/science/article/pii/S2451945616302914)
### Hospital Readmission Management
Hospital readmissions are costly. This paper describes how ML clustering algorithms can predict readmission potential and identify groups of readmissions with common causes.
[Reference](https://healthmanagement.org/c/healthmanagement/issuearticle/hospital-readmissions-and-machine-learning)
### Disease Management
The recent pandemic highlighted how ML can help combat disease spread. This article discusses using ARIMA, logistic curves, linear regression, and SARIMA to predict the spread of a virus and its outcomes, aiding in preparation and response.
[Reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7979218/)
## 🌲 Ecology and Green Tech
Nature and ecology involve sensitive systems where accurate measurement and timely action are crucial, such as in forest fire management or monitoring animal populations.
### Forest Management
[Reinforcement Learning](../../8-Reinforcement/README.md) can predict patterns in nature, such as forest fires. Researchers in Canada used RL to model wildfire dynamics from satellite images, treating the fire as an agent in a Markov Decision Process.
[Reference](https://www.frontiersin.org/articles/10.3389/fict.2018.00006/full)
### Motion Sensing of Animals
While deep learning is often used for tracking animal movements, classical ML techniques are still valuable for preprocessing data. For example, this paper analyzed sheep postures using various classifier algorithms.
[Reference](https://druckhaus-hofmann.de/gallery/31-wj-feb-2020.pdf)
### ⚡️ Energy Management
In our lessons on [time series forecasting](../../7-TimeSeries/README.md), we discussed smart parking meters. This article explores how clustering, regression, and time series forecasting were used to predict energy usage in Ireland based on smart metering.
[Reference](https://www-cdn.knime.com/sites/default/files/inline-images/knime_bigdata_energy_timeseries_whitepaper.pdf)
## 💼 Insurance
The insurance sector uses ML to build and optimize financial and actuarial models.
### Volatility Management
MetLife, a life insurance provider, uses ML to analyze and mitigate volatility in financial models. This article includes examples of binary and ordinal classification visualizations, as well as forecasting visualizations.
[Reference](https://investments.metlife.com/content/dam/metlifecom/us/investments/insights/research-topics/macro-strategy/pdf/MetLifeInvestmentManagement_MachineLearnedRanking_070920.pdf)
## 🎨 Arts, Culture, and Literature
In the arts, challenges like detecting fake news or optimizing museum operations can benefit from ML.
### Fake News Detection
Detecting fake news is a pressing issue. This article describes a system that combines ML techniques like Naive Bayes, SVM, Random Forest, SGD, and Logistic Regression to combat misinformation.
[Reference](https://www.irjet.net/archives/V7/i6/IRJET-V7I6688.pdf)
### Museum ML
Museums are leveraging ML to catalog collections and predict visitor interests. For example, the Art Institute of Chicago uses ML to optimize visitor experiences and predict attendance with high accuracy.
[Reference](https://www.chicagobusiness.com/article/20180518/ISSUE01/180519840/art-institute-of-chicago-uses-data-to-make-exhibit-choices)
## 🏷 Marketing
### Customer Segmentation
Effective marketing strategies often rely on customer segmentation. This article discusses how clustering algorithms can support differentiated marketing, improving brand recognition and revenue.
[Reference](https://ai.inqline.com/machine-learning-for-marketing-customer-segmentation/)
## 🚀 Challenge
Identify another sector that benefits from some of the techniques you learned in this curriculum, and discover how it uses ML.
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
The Wayfair data science team has some fascinating videos about how they apply ML in their organization. It's definitely [worth checking out](https://www.youtube.com/channel/UCe2PjkQXqOuwkW1gw6Ameuw/videos)!
## Assignment
[A ML scavenger hunt](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,27 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "fdebfcd0a3f12c9e2b436ded1aa79885",
"translation_date": "2025-09-06T10:52:06+00:00",
"source_file": "9-Real-World/1-Applications/assignment.md",
"language_code": "en"
}
-->
# A ML Scavenger Hunt
## Instructions
In this lesson, you learned about various real-world applications that were addressed using classical machine learning. While advancements in deep learning, new AI techniques and tools, and the use of neural networks have accelerated the development of solutions in these areas, classical machine learning techniques covered in this curriculum remain highly valuable.
For this assignment, imagine you are participating in a hackathon. Use the knowledge gained from the curriculum to propose a solution using classical machine learning to tackle a problem in one of the sectors discussed in this lesson. Prepare a presentation explaining how you plan to implement your idea. Bonus points if you can collect sample data and build a machine learning model to support your concept!
## Rubric
| Criteria | Outstanding | Satisfactory | Needs Improvement |
| -------- | ------------------------------------------------------------------ | ------------------------------------------------ | ---------------------- |
| | A PowerPoint presentation is provided - bonus for creating a model | A basic, non-innovative presentation is provided | The work is incomplete |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,185 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "df2b538e8fbb3e91cf0419ae2f858675",
"translation_date": "2025-09-06T10:52:12+00:00",
"source_file": "9-Real-World/2-Debugging-ML-Models/README.md",
"language_code": "en"
}
-->
# Postscript: Model Debugging in Machine Learning using Responsible AI dashboard components
## [Pre-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Introduction
Machine learning plays a significant role in our daily lives. AI is increasingly integrated into critical systems that impact individuals and society, such as healthcare, finance, education, and employment. For example, models are used in decision-making tasks like diagnosing health conditions or detecting fraud. However, as AI advances and adoption accelerates, societal expectations and regulations are evolving in response. We often encounter situations where AI systems fail to meet expectations, reveal new challenges, or face regulatory scrutiny. Therefore, it is crucial to analyze these models to ensure they deliver fair, reliable, inclusive, transparent, and accountable outcomes for everyone.
In this curriculum, we will explore practical tools to assess whether a model has responsible AI issues. Traditional machine learning debugging techniques often rely on quantitative metrics like aggregated accuracy or average error loss. But what happens when the data used to build these models lacks representation of certain demographics, such as race, gender, political views, or religion—or disproportionately represents them? What if the model's output favors one demographic over another? This can lead to over- or under-representation of sensitive feature groups, resulting in fairness, inclusiveness, or reliability issues. Additionally, machine learning models are often considered "black boxes," making it difficult to understand and explain their predictions. These are challenges faced by data scientists and AI developers when they lack adequate tools to debug and assess a model's fairness or trustworthiness.
In this lesson, you will learn how to debug your models using:
- **Error Analysis**: Identify areas in your data distribution where the model has high error rates.
- **Model Overview**: Compare performance metrics across different data cohorts to uncover disparities.
- **Data Analysis**: Investigate over- or under-representation in your data that may skew the model to favor certain demographics.
- **Feature Importance**: Understand which features drive your models predictions at both global and local levels.
## Prerequisite
Before starting, review [Responsible AI tools for developers](https://www.microsoft.com/ai/ai-lab-responsible-ai-dashboard).
> ![Gif on Responsible AI Tools](../../../../9-Real-World/2-Debugging-ML-Models/images/rai-overview.gif)
## Error Analysis
Traditional model performance metrics, such as accuracy, are often based on correct versus incorrect predictions. For instance, a model with 89% accuracy and an error loss of 0.001 might seem to perform well. However, errors are not always evenly distributed across the dataset. While the model may achieve 89% accuracy overall, certain regions of the data might have a failure rate of 42%. These failure patterns can lead to fairness or reliability issues, especially if they affect important data demographics. Understanding where the model performs well or poorly is essential. High error rates in specific data regions may highlight critical demographic groups.
![Analyze and debug model errors](../../../../9-Real-World/2-Debugging-ML-Models/images/ea-error-distribution.png)
The Error Analysis component in the RAI dashboard visualizes how model errors are distributed across various cohorts using a tree diagram. This helps identify features or areas with high error rates in your dataset. By pinpointing where most inaccuracies occur, you can investigate the root causes. You can also create data cohorts to analyze why the model performs well in one cohort but poorly in another.
![Error Analysis](../../../../9-Real-World/2-Debugging-ML-Models/images/ea-error-cohort.png)
The tree map uses visual indicators to quickly locate problem areas. For example, darker red nodes indicate higher error rates.
A heat map is another visualization tool that allows users to investigate error rates using one or two features, helping identify contributors to model errors across the dataset or specific cohorts.
![Error Analysis Heatmap](../../../../9-Real-World/2-Debugging-ML-Models/images/ea-heatmap.png)
Use error analysis to:
- Gain a deeper understanding of how model errors are distributed across the dataset and feature dimensions.
- Break down aggregate performance metrics to automatically discover erroneous cohorts and inform targeted mitigation strategies.
## Model Overview
Evaluating a machine learning models performance requires a comprehensive understanding of its behavior. This involves reviewing multiple metrics, such as error rate, accuracy, recall, precision, or MAE (Mean Absolute Error), to identify disparities. A single metric may appear strong, but weaknesses can be revealed in others. Comparing metrics across the entire dataset or specific cohorts can highlight areas where the model performs well or poorly. This is particularly important for sensitive features (e.g., race, gender, or age) to uncover potential fairness issues. For example, higher error rates in cohorts with sensitive features may indicate bias.
The Model Overview component in the RAI dashboard enables users to analyze performance metrics for data representation in cohorts and compare the models behavior across different groups.
![Dataset cohorts - model overview in RAI dashboard](../../../../9-Real-World/2-Debugging-ML-Models/images/model-overview-dataset-cohorts.png)
The feature-based analysis functionality allows users to focus on specific data subgroups within a feature to identify anomalies at a granular level. For instance, the dashboard can automatically generate cohorts for a user-selected feature (e.g., *"time_in_hospital < 3"* or *"time_in_hospital >= 7"*), enabling users to isolate features and determine their influence on erroneous outcomes.
![Feature cohorts - model overview in RAI dashboard](../../../../9-Real-World/2-Debugging-ML-Models/images/model-overview-feature-cohorts.png)
The Model Overview component supports two types of disparity metrics:
**Disparity in model performance**: These metrics calculate differences in performance values across data subgroups. Examples include:
- Disparity in accuracy rate
- Disparity in error rate
- Disparity in precision
- Disparity in recall
- Disparity in mean absolute error (MAE)
**Disparity in selection rate**: This metric measures differences in favorable predictions among subgroups. For example, disparity in loan approval rates. Selection rate refers to the fraction of data points classified as 1 (in binary classification) or the distribution of prediction values (in regression).
## Data Analysis
> "If you torture the data long enough, it will confess to anything" - Ronald Coase
This statement may sound extreme, but data can indeed be manipulated to support any conclusion—sometimes unintentionally. As humans, we all have biases, and it can be difficult to recognize when bias is introduced into data. Ensuring fairness in AI and machine learning remains a complex challenge.
Traditional model performance metrics often overlook data bias. High accuracy scores do not necessarily reflect underlying biases in the dataset. For example, if a dataset contains 27% women and 73% men in executive positions, a job advertising AI model trained on this data may disproportionately target men for senior-level positions. This imbalance skews the models predictions, revealing a fairness issue and gender bias.
The Data Analysis component in the RAI dashboard helps identify over- and under-representation in datasets. It allows users to diagnose errors and fairness issues caused by data imbalances or lack of representation. Users can visualize datasets based on predicted and actual outcomes, error groups, and specific features. Discovering underrepresented data groups can also reveal that the model is not learning effectively, leading to inaccuracies. Data bias not only raises fairness concerns but also indicates that the model lacks inclusiveness and reliability.
![Data Analysis component on RAI Dashboard](../../../../9-Real-World/2-Debugging-ML-Models/images/dataanalysis-cover.png)
Use data analysis to:
- Explore dataset statistics by applying filters to slice data into different dimensions (cohorts).
- Understand dataset distribution across cohorts and feature groups.
- Determine whether fairness, error analysis, and causality findings (from other dashboard components) are influenced by dataset distribution.
- Identify areas where additional data collection is needed to address representation issues, label noise, feature noise, label bias, and similar factors.
## Model Interpretability
Machine learning models are often "black boxes," making it challenging to understand which features drive predictions. Transparency is essential to explain why a model makes certain predictions. For example, if an AI system predicts that a diabetic patient is at risk of readmission within 30 days, it should provide supporting data for its prediction. Transparency helps clinicians and hospitals make informed decisions and ensures accountability with health regulations. When machine learning models impact peoples lives, understanding and explaining their behavior is crucial. Model interpretability addresses questions such as:
- Model debugging: Why did my model make this mistake? How can I improve it?
- Human-AI collaboration: How can I understand and trust the models decisions?
- Regulatory compliance: Does my model meet legal requirements?
The Feature Importance component in the RAI dashboard helps debug and understand how a model makes predictions. It is a valuable tool for machine learning professionals and decision-makers to explain features influencing a models behavior for regulatory compliance. Users can explore global and local explanations to validate which features drive predictions. Global explanations identify top features affecting overall predictions, while local explanations focus on features influencing individual cases. Local explanations are particularly useful for debugging or auditing specific cases to understand why a model made accurate or inaccurate predictions.
![Feature Importance component of the RAI dashboard](../../../../9-Real-World/2-Debugging-ML-Models/images/9-feature-importance.png)
- Global explanations: For example, what features influence the overall behavior of a diabetes hospital readmission model?
- Local explanations: For example, why was a diabetic patient over 60 years old with prior hospitalizations predicted to be readmitted or not readmitted within 30 days?
In debugging a models performance across cohorts, Feature Importance reveals the impact of features on predictions. It helps identify anomalies by comparing feature influence on erroneous predictions. The component can show which feature values positively or negatively influenced outcomes. For instance, if a model made an inaccurate prediction, the component allows users to drill down and pinpoint the features responsible. This level of detail aids debugging, provides transparency, and ensures accountability in audits. Additionally, it can help identify fairness issues. For example, if sensitive features like ethnicity or gender significantly influence predictions, this may indicate bias.
![Feature importance](../../../../9-Real-World/2-Debugging-ML-Models/images/9-features-influence.png)
Use interpretability to:
- Assess the trustworthiness of AI predictions by understanding key features driving outcomes.
- Debug models by identifying whether they rely on meaningful features or false correlations.
- Detect potential fairness issues by examining whether predictions are based on sensitive features or features correlated with them.
- Build user trust by generating local explanations to illustrate outcomes.
- Conduct regulatory audits to validate models and monitor their impact on humans.
## Conclusion
The RAI dashboard components are practical tools for building machine learning models that are less harmful and more trustworthy. They help prevent threats to human rights, discrimination, exclusion from opportunities, and risks of physical or psychological harm. Additionally, they foster trust in model decisions by generating local explanations to illustrate outcomes. Potential harms can be categorized as:
- **Allocation**: Favoring one gender or ethnicity over another.
- **Quality of service**: Training data for a specific scenario while neglecting real-world complexity, leading to poor service performance.
- **Stereotyping**: Associating a group with predefined attributes.
- **Denigration**: Unfairly criticizing or labeling something or someone.
- **Over- or under-representation**. This concept refers to situations where certain groups are not visible in specific professions, and any service or function that continues to promote this imbalance contributes to harm.
### Azure RAI Dashboard
[Azure RAI Dashboard](https://learn.microsoft.com/en-us/azure/machine-learning/concept-responsible-ai-dashboard?WT.mc_id=aiml-90525-ruyakubu) is built on open-source tools developed by leading academic institutions and organizations, including Microsoft. These tools are essential for data scientists and AI developers to better understand model behavior, identify and address undesirable issues in AI models.
- Learn how to use the various components by reviewing the RAI dashboard [documentation.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-responsible-ai-dashboard?WT.mc_id=aiml-90525-ruyakubu)
- Explore some RAI dashboard [sample notebooks](https://github.com/Azure/RAI-vNext-Preview/tree/main/examples/notebooks) to debug more responsible AI scenarios in Azure Machine Learning.
---
## 🚀 Challenge
To prevent statistical or data biases from being introduced in the first place, we should:
- Ensure diversity in backgrounds and perspectives among the people working on systems.
- Invest in datasets that represent the diversity of our society.
- Develop better methods for detecting and correcting bias when it occurs.
Consider real-world scenarios where unfairness is evident in model development and application. What additional factors should we take into account?
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
## Review & Self Study
In this lesson, youve learned about practical tools for incorporating responsible AI into machine learning.
Watch this workshop to explore the topics further:
- Responsible AI Dashboard: A comprehensive solution for operationalizing RAI in practice by Besmira Nushi and Mehrnoosh Sameki
[![Responsible AI Dashboard: A comprehensive solution for operationalizing RAI in practice](https://img.youtube.com/vi/f1oaDNl3djg/0.jpg)](https://www.youtube.com/watch?v=f1oaDNl3djg "Responsible AI Dashboard: A comprehensive solution for operationalizing RAI in practice")
> 🎥 Click the image above to watch the video: Responsible AI Dashboard: A comprehensive solution for operationalizing RAI in practice by Besmira Nushi and Mehrnoosh Sameki
Refer to the following resources to learn more about responsible AI and how to create more trustworthy models:
- Microsofts RAI dashboard tools for debugging ML models: [Responsible AI tools resources](https://aka.ms/rai-dashboard)
- Explore the Responsible AI toolkit: [GitHub](https://github.com/microsoft/responsible-ai-toolbox)
- Microsofts RAI resource center: [Responsible AI Resources Microsoft AI](https://www.microsoft.com/ai/responsible-ai-resources?activetab=pivot1%3aprimaryr4)
- Microsofts FATE research group: [FATE: Fairness, Accountability, Transparency, and Ethics in AI - Microsoft Research](https://www.microsoft.com/research/theme/fate/)
## Assignment
[Explore RAI Dashboard](assignment.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the definitive source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,25 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "91c6a180ef08e20cc15acfd2d6d6e164",
"translation_date": "2025-09-06T10:52:52+00:00",
"source_file": "9-Real-World/2-Debugging-ML-Models/assignment.md",
"language_code": "en"
}
-->
# Explore Responsible AI (RAI) dashboard
## Instructions
In this lesson, you learned about the RAI dashboard, a set of tools based on "open-source" technologies designed to assist data scientists in tasks such as error analysis, data exploration, fairness evaluation, model interpretability, counterfactual/what-if assessments, and causal analysis for AI systems. For this assignment, explore some of the sample [notebooks](https://github.com/Azure/RAI-vNext-Preview/tree/main/examples/notebooks) provided for the RAI dashboard and summarize your findings in a paper or presentation.
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | --------- | -------- | ----------------- |
| | A paper or PowerPoint presentation is submitted, discussing the components of the RAI dashboard, the notebook that was executed, and the conclusions derived from running it | A paper is submitted without conclusions | No paper is submitted |
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,32 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "5e069a0ac02a9606a69946c2b3c574a9",
"translation_date": "2025-09-06T10:51:33+00:00",
"source_file": "9-Real-World/README.md",
"language_code": "en"
}
-->
# Postscript: Real-world applications of classical machine learning
In this section of the curriculum, you'll explore some practical applications of classical machine learning. We've searched extensively to find whitepapers and articles showcasing how these techniques are applied, while deliberately minimizing the focus on neural networks, deep learning, and AI. Discover how machine learning is utilized in business systems, environmental projects, finance, arts and culture, and more.
![chess](../../../9-Real-World/images/chess.jpg)
> Photo by <a href="https://unsplash.com/@childeye?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Alexis Fauvet</a> on <a href="https://unsplash.com/s/photos/artificial-intelligence?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## Lesson
1. [Real-World Applications for ML](1-Applications/README.md)
2. [Model Debugging in Machine Learning using Responsible AI dashboard components](2-Debugging-ML-Models/README.md)
## Credits
"Real-World Applications" was created by a team including [Jen Looper](https://twitter.com/jenlooper) and [Ornella Altunyan](https://twitter.com/ornelladotcom).
"Model Debugging in Machine Learning using Responsible AI dashboard components" was authored by [Ruth Yakubu](https://twitter.com/ruthieyakubu)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,23 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "c06b12caf3c901eb3156e3dd5b0aea56",
"translation_date": "2025-09-06T10:44:18+00:00",
"source_file": "CODE_OF_CONDUCT.md",
"language_code": "en"
}
-->
# Microsoft Open Source Code of Conduct
This project follows the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
Resources:
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- For questions or concerns, contact [opencode@microsoft.com](mailto:opencode@microsoft.com)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,24 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "977ec5266dfd78ad1ce2bd8d46fccbda",
"translation_date": "2025-09-06T10:43:56+00:00",
"source_file": "CONTRIBUTING.md",
"language_code": "en"
}
-->
# Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA), which confirms that you have the rights to, and indeed do, grant us permission to use your contribution. For more details, visit https://cla.microsoft.com.
> Important: When translating text in this repository, please ensure you do not use machine translation. Translations will be verified by the community, so only volunteer for translations in languages you are fluent in.
When you submit a pull request, a CLA-bot will automatically check whether you need to provide a CLA and will update the PR accordingly (e.g., with labels or comments). Simply follow the instructions provided by the bot. You only need to complete this process once across all repositories that use our CLA.
This project follows the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information, see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,177 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "612aef7a03204260e940131b09691977",
"translation_date": "2025-09-06T10:42:53+00:00",
"source_file": "README.md",
"language_code": "en"
}
-->
[![GitHub license](https://img.shields.io/github/license/microsoft/ML-For-Beginners.svg)](https://github.com/microsoft/ML-For-Beginners/blob/master/LICENSE)
[![GitHub contributors](https://img.shields.io/github/contributors/microsoft/ML-For-Beginners.svg)](https://GitHub.com/microsoft/ML-For-Beginners/graphs/contributors/)
[![GitHub issues](https://img.shields.io/github/issues/microsoft/ML-For-Beginners.svg)](https://GitHub.com/microsoft/ML-For-Beginners/issues/)
[![GitHub pull-requests](https://img.shields.io/github/issues-pr/microsoft/ML-For-Beginners.svg)](https://GitHub.com/microsoft/ML-For-Beginners/pulls/)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)
[![GitHub watchers](https://img.shields.io/github/watchers/microsoft/ML-For-Beginners.svg?style=social&label=Watch)](https://GitHub.com/microsoft/ML-For-Beginners/watchers/)
[![GitHub forks](https://img.shields.io/github/forks/microsoft/ML-For-Beginners.svg?style=social&label=Fork)](https://GitHub.com/microsoft/ML-For-Beginners/network/)
[![GitHub stars](https://img.shields.io/github/stars/microsoft/ML-For-Beginners.svg?style=social&label=Star)](https://GitHub.com/microsoft/ML-For-Beginners/stargazers/)
### 🌐 Multi-Language Support
#### Supported via GitHub Action (Automated & Always Up-to-Date)
[French](../fr/README.md) | [Spanish](../es/README.md) | [German](../de/README.md) | [Russian](../ru/README.md) | [Arabic](../ar/README.md) | [Persian (Farsi)](../fa/README.md) | [Urdu](../ur/README.md) | [Chinese (Simplified)](../zh/README.md) | [Chinese (Traditional, Macau)](../mo/README.md) | [Chinese (Traditional, Hong Kong)](../hk/README.md) | [Chinese (Traditional, Taiwan)](../tw/README.md) | [Japanese](../ja/README.md) | [Korean](../ko/README.md) | [Hindi](../hi/README.md) | [Bengali](../bn/README.md) | [Marathi](../mr/README.md) | [Nepali](../ne/README.md) | [Punjabi (Gurmukhi)](../pa/README.md) | [Portuguese (Portugal)](../pt/README.md) | [Portuguese (Brazil)](../br/README.md) | [Italian](../it/README.md) | [Polish](../pl/README.md) | [Turkish](../tr/README.md) | [Greek](../el/README.md) | [Thai](../th/README.md) | [Swedish](../sv/README.md) | [Danish](../da/README.md) | [Norwegian](../no/README.md) | [Finnish](../fi/README.md) | [Dutch](../nl/README.md) | [Hebrew](../he/README.md) | [Vietnamese](../vi/README.md) | [Indonesian](../id/README.md) | [Malay](../ms/README.md) | [Tagalog (Filipino)](../tl/README.md) | [Swahili](../sw/README.md) | [Hungarian](../hu/README.md) | [Czech](../cs/README.md) | [Slovak](../sk/README.md) | [Romanian](../ro/README.md) | [Bulgarian](../bg/README.md) | [Serbian (Cyrillic)](../sr/README.md) | [Croatian](../hr/README.md) | [Slovenian](../sl/README.md) | [Ukrainian](../uk/README.md) | [Burmese (Myanmar)](../my/README.md)
#### Join the Community
[![Azure AI Discord](https://dcbadge.limes.pink/api/server/kzRShWzttr)](https://discord.gg/kzRShWzttr)
# Machine Learning for Beginners - A Curriculum
> 🌍 Travel around the world as we explore Machine Learning by means of world cultures 🌍
Cloud Advocates at Microsoft are excited to present a 12-week, 26-lesson curriculum all about **Machine Learning**. In this curriculum, youll dive into whats often referred to as **classic machine learning**, primarily using Scikit-learn as a library and steering clear of deep learning, which is covered in our [AI for Beginners' curriculum](https://aka.ms/ai4beginners). Pair these lessons with our ['Data Science for Beginners' curriculum](https://aka.ms/ds4beginners) for a comprehensive learning experience!
Join us on a journey around the globe as we apply these classic techniques to datasets from various regions of the world. Each lesson includes pre- and post-lesson quizzes, step-by-step instructions, solutions, assignments, and more. Our project-based approach ensures that you learn by doing, a proven method for retaining new skills.
**✍️ A big thank you to our authors** Jen Looper, Stephen Howell, Francesca Lazzeri, Tomomi Imura, Cassie Breviu, Dmitry Soshnikov, Chris Noring, Anirban Mukherjee, Ornella Altunyan, Ruth Yakubu, and Amy Boyd
**🎨 Special thanks to our illustrators** Tomomi Imura, Dasani Madipalli, and Jen Looper
**🙏 Heartfelt gratitude 🙏 to our Microsoft Student Ambassador authors, reviewers, and contributors**, including Rishit Dagli, Muhammad Sakib Khan Inan, Rohan Raj, Alexandru Petrescu, Abhishek Jaiswal, Nawrin Tabassum, Ioan Samuila, and Snigdha Agarwal
**🤩 Extra thanks to Microsoft Student Ambassadors Eric Wanjau, Jasleen Sondhi, and Vidushi Gupta for our R lessons!**
# Getting Started
Follow these steps:
1. **Fork the Repository**: Click the "Fork" button at the top-right corner of this page.
2. **Clone the Repository**: `git clone https://github.com/microsoft/ML-For-Beginners.git`
> [Find all additional resources for this course in our Microsoft Learn collection](https://learn.microsoft.com/en-us/collections/qrqzamz1nn2wx3?WT.mc_id=academic-77952-bethanycheum)
**[Students](https://aka.ms/student-page)**, to use this curriculum, fork the entire repository to your GitHub account and work through the exercises individually or in a group:
- Start with a pre-lecture quiz.
- Read the lesson and complete the activities, pausing to reflect during each knowledge check.
- Try building the projects by understanding the lessons rather than simply running the solution code (though the solution code is available in the `/solution` folders for each project-based lesson).
- Take the post-lecture quiz.
- Complete the challenge.
- Finish the assignment.
- After completing a lesson group, visit the [Discussion Board](https://github.com/microsoft/ML-For-Beginners/discussions) and "learn out loud" by filling out the appropriate PAT rubric. A 'PAT' is a Progress Assessment Tool that helps you reflect on your learning. You can also engage with others' PATs to foster collaborative learning.
> For further study, we recommend exploring these [Microsoft Learn](https://docs.microsoft.com/en-us/users/jenlooper-2911/collections/k7o7tg1gp306q4?WT.mc_id=academic-77952-leestott) modules and learning paths.
**Teachers**, weve [included some suggestions](for-teachers.md) on how to use this curriculum.
---
## Video walkthroughs
Some lessons are available as short-form videos. You can find these embedded in the lessons or on the [ML for Beginners playlist on the Microsoft Developer YouTube channel](https://aka.ms/ml-beginners-videos) by clicking the image below.
[![ML for beginners banner](../../images/ml-for-beginners-video-banner.png)](https://aka.ms/ml-beginners-videos)
---
## Meet the Team
[![Promo video](../../images/ml.gif)](https://youtu.be/Tj1XWrDSYJU)
**Gif by** [Mohit Jaisal](https://linkedin.com/in/mohitjaisal)
> 🎥 Click the image above for a video about the project and the team behind it!
---
## Pedagogy
Weve designed this curriculum with two key principles: its **hands-on and project-based**, and it includes **frequent quizzes**. Additionally, the curriculum follows a common **theme** to maintain cohesion.
By aligning the content with projects, we make the learning process more engaging and improve concept retention. Low-stakes quizzes before a lesson help set the stage for learning, while post-lesson quizzes reinforce understanding. This curriculum is flexible and fun, and you can complete it in full or in part. The projects start small and grow in complexity over the 12-week cycle. A postscript on real-world ML applications is included, which can be used as extra credit or for discussion.
> Check out our [Code of Conduct](CODE_OF_CONDUCT.md), [Contributing](CONTRIBUTING.md), and [Translation](TRANSLATIONS.md) guidelines. We welcome your constructive feedback!
## Each lesson includes
- Optional sketchnote
- Optional supplemental video
- Video walkthrough (for some lessons)
- [Pre-lecture warmup quiz](https://ff-quizzes.netlify.app/en/ml/)
- Written lesson
- For project-based lessons, step-by-step guides to build the project
- Knowledge checks
- A challenge
- Supplemental reading
- Assignment
- [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
> **A note about languages**: These lessons are primarily written in Python, but many are also available in R. To complete an R lesson, go to the `/solution` folder and look for R lessons. They include an .rmd extension, which represents an **R Markdown** file. R Markdown allows you to combine code, its output, and your thoughts in a single document. These files can be rendered into formats like PDF, HTML, or Word.
> **A note about quizzes**: All quizzes are located in the [Quiz App folder](../../quiz-app), with 52 quizzes in total (three questions each). They are linked within the lessons, but you can also run the quiz app locally. Follow the instructions in the `quiz-app` folder to host it locally or deploy it to Azure.
| Lesson Number | Topic | Lesson Grouping | Learning Objectives | Linked Lesson | Author |
| :-----------: | :------------------------------------------------------------: | :-------------------------------------------------: | ------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------: |
| 01 | Introduction to machine learning | [Introduction](1-Introduction/README.md) | Learn the basic concepts behind machine learning | [Lesson](1-Introduction/1-intro-to-ML/README.md) | Muhammad |
| 02 | The History of machine learning | [Introduction](1-Introduction/README.md) | Learn the history underlying this field | [Lesson](1-Introduction/2-history-of-ML/README.md) | Jen and Amy |
| 03 | Fairness and machine learning | [Introduction](1-Introduction/README.md) | What are the important philosophical issues around fairness that students should consider when building and applying ML models? | [Lesson](1-Introduction/3-fairness/README.md) | Tomomi |
| 04 | Machine Learning Techniques | [Introduction](1-Introduction/README.md) | What techniques do ML researchers use to create ML models? | [Lesson](1-Introduction/4-techniques-of-ML/README.md) | Chris and Jen |
| 05 | Introduction to Regression | [Regression](2-Regression/README.md) | Get started with Python and Scikit-learn for regression models |
<ul><li>[Python](2-Regression/1-Tools/README.md)</li><li>[R](../../2-Regression/1-Tools/solution/R/lesson_1.html)</li></ul> | <ul><li>Jen</li><li>Eric Wanjau</li></ul> |
| 06 | North American Pumpkin Prices 🎃 | [Regression](2-Regression/README.md) | Visualize and clean data to prepare for ML | <ul><li>[Python](2-Regression/2-Data/README.md)</li><li>[R](../../2-Regression/2-Data/solution/R/lesson_2.html)</li></ul> | <ul><li>Jen</li><li>Eric Wanjau</li></ul> |
| 07 | North American Pumpkin Prices 🎃 | [Regression](2-Regression/README.md) | Build linear and polynomial regression models | <ul><li>[Python](2-Regression/3-Linear/README.md)</li><li>[R](../../2-Regression/3-Linear/solution/R/lesson_3.html)</li></ul> | <ul><li>Jen and Dmitry</li><li>Eric Wanjau</li></ul> |
| 08 | North American Pumpkin Prices 🎃 | [Regression](2-Regression/README.md) | Build a logistic regression model | <ul><li>[Python](2-Regression/4-Logistic/README.md) </li><li>[R](../../2-Regression/4-Logistic/solution/R/lesson_4.html)</li></ul> | <ul><li>Jen</li><li>Eric Wanjau</li></ul> |
| 09 | A Web App 🔌 | [Web App](3-Web-App/README.md) | Build a web app to use your trained model | [Python](3-Web-App/1-Web-App/README.md) | Jen |
| 10 | Introduction to Classification | [Classification](4-Classification/README.md) | Clean, prepare, and visualize your data; introduction to classification | <ul><li> [Python](4-Classification/1-Introduction/README.md) </li><li>[R](../../4-Classification/1-Introduction/solution/R/lesson_10.html) | <ul><li>Jen and Cassie</li><li>Eric Wanjau</li></ul> |
| 11 | Delicious Asian and Indian Cuisines 🍜 | [Classification](4-Classification/README.md) | Introduction to classifiers | <ul><li> [Python](4-Classification/2-Classifiers-1/README.md)</li><li>[R](../../4-Classification/2-Classifiers-1/solution/R/lesson_11.html) | <ul><li>Jen and Cassie</li><li>Eric Wanjau</li></ul> |
| 12 | Delicious Asian and Indian Cuisines 🍜 | [Classification](4-Classification/README.md) | More classifiers | <ul><li> [Python](4-Classification/3-Classifiers-2/README.md)</li><li>[R](../../4-Classification/3-Classifiers-2/solution/R/lesson_12.html) | <ul><li>Jen and Cassie</li><li>Eric Wanjau</li></ul> |
| 13 | Delicious Asian and Indian Cuisines 🍜 | [Classification](4-Classification/README.md) | Build a recommender web app using your model | [Python](4-Classification/4-Applied/README.md) | Jen |
| 14 | Introduction to Clustering | [Clustering](5-Clustering/README.md) | Clean, prepare, and visualize your data; introduction to clustering | <ul><li> [Python](5-Clustering/1-Visualize/README.md)</li><li>[R](../../5-Clustering/1-Visualize/solution/R/lesson_14.html) | <ul><li>Jen</li><li>Eric Wanjau</li></ul> |
| 15 | Exploring Nigerian Musical Tastes 🎧 | [Clustering](5-Clustering/README.md) | Explore the K-Means clustering method | <ul><li> [Python](5-Clustering/2-K-Means/README.md)</li><li>[R](../../5-Clustering/2-K-Means/solution/R/lesson_15.html) | <ul><li>Jen</li><li>Eric Wanjau</li></ul> |
| 16 | Introduction to Natural Language Processing ☕️ | [Natural language processing](6-NLP/README.md) | Learn the basics of NLP by building a simple bot | [Python](6-NLP/1-Introduction-to-NLP/README.md) | Stephen |
| 17 | Common NLP Tasks ☕️ | [Natural language processing](6-NLP/README.md) | Deepen your NLP knowledge by understanding common tasks required when working with language structures | [Python](6-NLP/2-Tasks/README.md) | Stephen |
| 18 | Translation and Sentiment Analysis ♥️ | [Natural language processing](6-NLP/README.md) | Translation and sentiment analysis with Jane Austen | [Python](6-NLP/3-Translation-Sentiment/README.md) | Stephen |
| 19 | Romantic Hotels of Europe ♥️ | [Natural language processing](6-NLP/README.md) | Sentiment analysis with hotel reviews 1 | [Python](6-NLP/4-Hotel-Reviews-1/README.md) | Stephen |
| 20 | Romantic Hotels of Europe ♥️ | [Natural language processing](6-NLP/README.md) | Sentiment analysis with hotel reviews 2 | [Python](6-NLP/5-Hotel-Reviews-2/README.md) | Stephen |
| 21 | Introduction to Time Series Forecasting | [Time series](7-TimeSeries/README.md) | Introduction to time series forecasting | [Python](7-TimeSeries/1-Introduction/README.md) | Francesca |
| 22 | ⚡️ World Power Usage ⚡️ - Time Series Forecasting with ARIMA | [Time series](7-TimeSeries/README.md) | Time series forecasting with ARIMA | [Python](7-TimeSeries/2-ARIMA/README.md) | Francesca |
| 23 | ⚡️ World Power Usage ⚡️ - Time Series Forecasting with SVR | [Time series](7-TimeSeries/README.md) | Time series forecasting with Support Vector Regressor | [Python](7-TimeSeries/3-SVR/README.md) | Anirban |
| 24 | Introduction to Reinforcement Learning | [Reinforcement learning](8-Reinforcement/README.md) | Introduction to reinforcement learning with Q-Learning | [Python](8-Reinforcement/1-QLearning/README.md) | Dmitry |
| 25 | Help Peter Avoid the Wolf! 🐺 | [Reinforcement learning](8-Reinforcement/README.md) | Reinforcement learning Gym | [Python](8-Reinforcement/2-Gym/README.md) | Dmitry |
| Postscript | Real-World ML Scenarios and Applications | [ML in the Wild](9-Real-World/README.md) | Interesting and revealing real-world applications of classical ML | [Lesson](9-Real-World/1-Applications/README.md) | Team |
| Postscript | Model Debugging in ML using RAI Dashboard | [ML in the Wild](9-Real-World/README.md) | Model debugging in machine learning using Responsible AI dashboard components | [Lesson](9-Real-World/2-Debugging-ML-Models/README.md) | Ruth Yakubu |
> [Find all additional resources for this course in our Microsoft Learn collection](https://learn.microsoft.com/en-us/collections/qrqzamz1nn2wx3?WT.mc_id=academic-77952-bethanycheum)
## Offline Access
You can run this documentation offline using [Docsify](https://docsify.js.org/#/). Fork this repo, [install Docsify](https://docsify.js.org/#/quickstart) on your local machine, and then in the root folder of this repo, type `docsify serve`. The website will be served on port 3000 on your localhost: `localhost:3000`.
## PDFs
Find a PDF of the curriculum with links [here](https://microsoft.github.io/ML-For-Beginners/pdf/readme.pdf).
## 🎒 Other Courses
Our team produces other courses! Check out:
- [Generative AI for Beginners](https://aka.ms/genai-beginners)
- [Generative AI for Beginners .NET](https://github.com/microsoft/Generative-AI-for-beginners-dotnet)
- [Generative AI with JavaScript](https://github.com/microsoft/generative-ai-with-javascript)
- [Generative AI with Java](https://github.com/microsoft/Generative-AI-for-beginners-java)
- [AI for Beginners](https://aka.ms/ai-beginners)
- [Data Science for Beginners](https://aka.ms/datascience-beginners)
- [ML for Beginners](https://aka.ms/ml-beginners)
- [Cybersecurity for Beginners](https://github.com/microsoft/Security-101)
- [Web Dev for Beginners](https://aka.ms/webdev-beginners)
- [IoT for Beginners](https://aka.ms/iot-beginners)
- [XR Development for Beginners](https://github.com/microsoft/xr-development-for-beginners)
- [Mastering GitHub Copilot for Paired Programming](https://github.com/microsoft/Mastering-GitHub-Copilot-for-Paired-Programming)
- [Mastering GitHub Copilot for C#/.NET Developers](https://github.com/microsoft/mastering-github-copilot-for-dotnet-csharp-developers)
- [Choose Your Own Copilot Adventure](https://github.com/microsoft/CopilotAdventures)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,51 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "5e1b8da31aae9cca3d53ad243fa3365a",
"translation_date": "2025-09-06T10:44:02+00:00",
"source_file": "SECURITY.md",
"language_code": "en"
}
-->
## Security
Microsoft prioritizes the security of its software products and services, including all source code repositories managed through our GitHub organizations, such as [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
If you believe you have identified a security vulnerability in any Microsoft-owned repository that aligns with [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/previous-versions/tn-archive/cc751383(v=technet.10)?WT.mc_id=academic-77952-leestott), please report it to us using the process outlined below.
## Reporting Security Issues
**Do not report security vulnerabilities through public GitHub issues.**
Instead, report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
If you prefer to submit your report without logging in, you can email [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message using our PGP key, which can be downloaded from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
You should receive a response within 24 hours. If you do not, please follow up via email to confirm that we received your original message. Additional details can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
Please include the following information (as much as you can provide) to help us better understand the nature and scope of the potential issue:
* Type of issue (e.g., buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of the source file(s) related to the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration needed to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if available)
* Impact of the issue, including how an attacker might exploit it
Providing this information will help us process your report more efficiently.
If you are submitting a report for a bug bounty, more detailed reports may result in a higher bounty award. For more information about our active programs, visit the [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page.
## Preferred Languages
We prefer all communications to be in English.
## Policy
Microsoft adheres to the principles of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,24 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "872be8bc1b93ef1dd9ac3d6e8f99f6ab",
"translation_date": "2025-09-06T10:43:52+00:00",
"source_file": "SUPPORT.md",
"language_code": "en"
}
-->
# Support
## How to report issues and seek assistance
This project utilizes GitHub Issues to manage bug reports and feature requests. Before submitting a new issue, please check the existing ones to avoid duplicates. If you need to report a new issue, submit your bug or feature request as a new Issue.
For assistance or questions regarding the use of this project, please create an issue.
## Microsoft Support Policy
Support for this repository is restricted to the resources mentioned above.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,57 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "68dd06c685f6ce840e0acfa313352e7c",
"translation_date": "2025-09-06T10:51:25+00:00",
"source_file": "docs/_sidebar.md",
"language_code": "en"
}
-->
- Introduction
- [Introduction to Machine Learning](../1-Introduction/1-intro-to-ML/README.md)
- [History of Machine Learning](../1-Introduction/2-history-of-ML/README.md)
- [ML and Fairness](../1-Introduction/3-fairness/README.md)
- [Techniques of ML](../1-Introduction/4-techniques-of-ML/README.md)
- Regression
- [Tools of the Trade](../2-Regression/1-Tools/README.md)
- [Data](../2-Regression/2-Data/README.md)
- [Linear Regression](../2-Regression/3-Linear/README.md)
- [Logistic Regression](../2-Regression/4-Logistic/README.md)
- Build a Web App
- [Web App](../3-Web-App/1-Web-App/README.md)
- Classification
- [Introduction to Classification](../4-Classification/1-Introduction/README.md)
- [Classifiers 1](../4-Classification/2-Classifiers-1/README.md)
- [Classifiers 2](../4-Classification/3-Classifiers-2/README.md)
- [Applied ML](../4-Classification/4-Applied/README.md)
- Clustering
- [Visualize Your Data](../5-Clustering/1-Visualize/README.md)
- [K-Means](../5-Clustering/2-K-Means/README.md)
- NLP
- [Introduction to NLP](../6-NLP/1-Introduction-to-NLP/README.md)
- [NLP Tasks](../6-NLP/2-Tasks/README.md)
- [Translation and Sentiment](../6-NLP/3-Translation-Sentiment/README.md)
- [Hotel Reviews 1](../6-NLP/4-Hotel-Reviews-1/README.md)
- [Hotel Reviews 2](../6-NLP/5-Hotel-Reviews-2/README.md)
- Time Series Forecasting
- [Introduction to Time Series Forecasting](../7-TimeSeries/1-Introduction/README.md)
- [ARIMA](../7-TimeSeries/2-ARIMA/README.md)
- [SVR](../7-TimeSeries/3-SVR/README.md)
- Reinforcement Learning
- [Q-Learning](../8-Reinforcement/1-QLearning/README.md)
- [Gym](../8-Reinforcement/2-Gym/README.md)
- Real World ML
- [Applications](../9-Real-World/1-Applications/README.md)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,37 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "b37de02054fa6c0438ede6fabe1fdfb8",
"translation_date": "2025-09-06T10:44:11+00:00",
"source_file": "for-teachers.md",
"language_code": "en"
}
-->
## For Educators
Would you like to use this curriculum in your classroom? Feel free to do so!
In fact, you can use it directly within GitHub by leveraging GitHub Classroom.
To get started, fork this repository. You'll need to create a separate repository for each lesson, so you'll need to extract each folder into its own repository. This way, [GitHub Classroom](https://classroom.github.com/classrooms) can handle each lesson individually.
These [detailed instructions](https://github.blog/2020-03-18-set-up-your-digital-classroom-with-github-classroom/) will guide you on how to set up your classroom.
## Using the repository as is
If you'd prefer to use this repository as it currently exists, without utilizing GitHub Classroom, that's also an option. You'll need to coordinate with your students on which lesson to work through together.
In an online setting (Zoom, Teams, or similar platforms), you could create breakout rooms for quizzes and mentor students to prepare them for learning. Then, invite students to complete the quizzes and submit their answers as 'issues' at a designated time. You could follow the same approach for assignments if you want students to collaborate openly.
If you'd rather use a more private format, ask your students to fork the curriculum, lesson by lesson, into their own private GitHub repositories and grant you access. This way, they can complete quizzes and assignments privately and submit them to you via issues on your classroom repository.
There are many ways to adapt this for an online classroom environment. Let us know what works best for you!
## Please give us your thoughts!
We want this curriculum to be effective for you and your students. Please share your [feedback](https://forms.microsoft.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR2humCsRZhxNuI79cm6n0hRUQzRVVU9VVlU5UlFLWTRLWlkyQUxORTg5WS4u).
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,127 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "6d130dffca5db70d7e615f926cb1ad4c",
"translation_date": "2025-09-06T10:55:37+00:00",
"source_file": "quiz-app/README.md",
"language_code": "en"
}
-->
# Quizzes
These quizzes are the pre- and post-lecture quizzes for the ML curriculum at https://aka.ms/ml-beginners
## Project setup
```
npm install
```
### Compiles and hot-reloads for development
```
npm run serve
```
### Compiles and minifies for production
```
npm run build
```
### Lints and fixes files
```
npm run lint
```
### Customize configuration
See [Configuration Reference](https://cli.vuejs.org/config/).
Credits: Thanks to the original version of this quiz app: https://github.com/arpan45/simple-quiz-vue
## Deploying to Azure
Heres a step-by-step guide to help you get started:
1. Fork the GitHub Repository
Ensure your static web app code is in your GitHub repository. Fork this repository.
2. Create an Azure Static Web App
- Create an [Azure account](http://azure.microsoft.com)
- Go to the [Azure portal](https://portal.azure.com)
- Click on “Create a resource” and search for “Static Web App”.
- Click “Create”.
3. Configure the Static Web App
- Basics:
- Subscription: Select your Azure subscription.
- Resource Group: Create a new resource group or use an existing one.
- Name: Provide a name for your static web app.
- Region: Choose the region closest to your users.
- #### Deployment Details:
- Source: Select “GitHub”.
- GitHub Account: Authorize Azure to access your GitHub account.
- Organization: Select your GitHub organization.
- Repository: Choose the repository containing your static web app.
- Branch: Select the branch you want to deploy from.
- #### Build Details:
- Build Presets: Choose the framework your app is built with (e.g., React, Angular, Vue, etc.).
- App Location: Specify the folder containing your app code (e.g., / if its in the root).
- API Location: If you have an API, specify its location (optional).
- Output Location: Specify the folder where the build output is generated (e.g., build or dist).
4. Review and Create
Review your settings and click “Create”. Azure will set up the necessary resources and create a GitHub Actions workflow in your repository.
5. GitHub Actions Workflow
Azure will automatically create a GitHub Actions workflow file in your repository (.github/workflows/azure-static-web-apps-<name>.yml). This workflow will handle the build and deployment process.
6. Monitor the Deployment
Go to the “Actions” tab in your GitHub repository.
You should see a workflow running. This workflow will build and deploy your static web app to Azure.
Once the workflow completes, your app will be live on the provided Azure URL.
### Example Workflow File
Heres an example of what the GitHub Actions workflow file might look like:
name: Azure Static Web Apps CI/CD
```
on:
push:
branches:
- main
pull_request:
types: [opened, synchronize, reopened, closed]
branches:
- main
jobs:
build_and_deploy_job:
runs-on: ubuntu-latest
name: Build and Deploy Job
steps:
- uses: actions/checkout@v2
- name: Build And Deploy
id: builddeploy
uses: Azure/static-web-apps-deploy@v1
with:
azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN }}
repo_token: ${{ secrets.GITHUB_TOKEN }}
action: "upload"
app_location: "/quiz-app" # App source code path
api_location: ""API source code path optional
output_location: "dist" #Built app content directory - optional
```
### Additional Resources
- [Azure Static Web Apps Documentation](https://learn.microsoft.com/azure/static-web-apps/getting-started)
- [GitHub Actions Documentation](https://docs.github.com/actions/use-cases-and-examples/deploying/deploying-to-azure-static-web-app)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,122 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "fba3b94d88bfb9b81369b869a1e9a20f",
"translation_date": "2025-09-06T10:58:18+00:00",
"source_file": "sketchnotes/LICENSE.md",
"language_code": "en"
}
-->
Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material; and
c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
a. Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not discoverable.
b. To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages.
c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
Section 6 -- Term and Termination.
a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, Your rights under this Public License terminate automatically.
b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
2. upon express reinstatement by the Licensor.
c. For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
d. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
e. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
Section 7 -- Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated in this Public License are separate from and independent of the terms and conditions of this Public License.
Section 8 -- Interpretation.
a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
d. Nothing in this Public License constitutes or may be interpreted as a limitation or waiver of any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.
Rights, then the database in which you hold Sui Generis Database Rights (but not its individual contents) is considered Adapted Material,
including for the purposes of Section 3(b); and
c. You must comply with the conditions in Section 3(a) if you share all or a substantial portion of the contents of the database.
For clarity, this Section 4 supplements and does not replace your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
---
**Section 5 -- Disclaimer of Warranties and Limitation of Liability**
a. UNLESS OTHERWISE AGREED SEPARATELY BY THE LICENSOR, TO THE EXTENT POSSIBLE, THE LICENSOR PROVIDES THE LICENSED MATERIAL "AS IS" AND "AS AVAILABLE," AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND REGARDING THE LICENSED MATERIAL, WHETHER EXPRESS, IMPLIED, STATUTORY, OR OTHERWISE. THIS INCLUDES, BUT IS NOT LIMITED TO, WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT FULLY OR PARTIALLY PERMITTED, THIS DISCLAIMER MAY NOT APPLY TO YOU.
b. TO THE EXTENT POSSIBLE, UNDER NO CIRCUMSTANCES WILL THE LICENSOR BE LIABLE TO YOU UNDER ANY LEGAL THEORY (INCLUDING, BUT NOT LIMITED TO, NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, COSTS, EXPENSES, OR DAMAGES ARISING FROM THIS PUBLIC LICENSE OR USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR DAMAGES. WHERE LIMITATIONS OF LIABILITY ARE NOT FULLY OR PARTIALLY PERMITTED, THIS LIMITATION MAY NOT APPLY TO YOU.
c. The disclaimer of warranties and limitation of liability outlined above shall be interpreted in a way that, to the extent possible, most closely resembles an absolute disclaimer and waiver of all liability.
---
**Section 6 -- Term and Termination**
a. This Public License remains in effect for the duration of the Copyright and Similar Rights licensed here. However, if you fail to comply with this Public License, your rights under this Public License terminate automatically.
b. If your right to use the Licensed Material is terminated under Section 6(a), it may be reinstated:
1. Automatically as of the date the violation is remedied, provided it is remedied within 30 days of your discovery of the violation; or
2. Upon explicit reinstatement by the Licensor.
For clarity, this Section 6(b) does not affect any rights the Licensor may have to seek remedies for your violations of this Public License.
c. For clarity, the Licensor may also offer the Licensed Material under separate terms or conditions or cease distributing the Licensed Material at any time; however, doing so does not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 remain in effect even after termination of this Public License.
---
**Section 7 -- Other Terms and Conditions**
a. The Licensor is not bound by any additional or different terms or conditions communicated by you unless explicitly agreed upon.
b. Any arrangements, understandings, or agreements regarding the Licensed Material that are not stated here are separate from and independent of the terms and conditions of this Public License.
---
**Section 8 -- Interpretation**
a. For clarity, this Public License does not, and should not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically adjusted to the minimum extent necessary to make it enforceable. If the provision cannot be adjusted, it shall be removed from this Public License without affecting the enforceability of the remaining terms and conditions.
c. No term or condition of this Public License will be waived, and no failure to comply will be consented to unless explicitly agreed upon by the Licensor.
d. Nothing in this Public License constitutes or may be interpreted as a limitation on, or waiver of, any privileges and immunities that apply to the Licensor or you, including those arising from the legal processes of any jurisdiction or authority.
---
Creative Commons is not a party to its public licenses. However, Creative Commons may choose to apply one of its public licenses to material it publishes, and in such cases, it will be considered the “Licensor.” The text of the Creative Commons public licenses is dedicated to the public domain under the CC0 Public Domain Dedication. Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at creativecommons.org/policies, Creative Commons does not authorize the use of the trademark "Creative Commons" or any other trademark or logo of Creative Commons without prior written consent, including, but not limited to, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning the use of licensed material. For clarity, this paragraph is not part of the public licenses.
Creative Commons can be contacted at creativecommons.org.
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

@ -0,0 +1,21 @@
<!--
CO_OP_TRANSLATOR_METADATA:
{
"original_hash": "a88d5918c1b9da69a40d917a0840c497",
"translation_date": "2025-09-06T10:57:50+00:00",
"source_file": "sketchnotes/README.md",
"language_code": "en"
}
-->
All the sketchnotes for the curriculum can be downloaded here.
🖨 For high-resolution printing, TIFF versions are available at [this repo](https://github.com/girliemac/a-picture-is-worth-a-1000-words/tree/main/ml/tiff).
🎨 Created by: [Tomomi Imura](https://github.com/girliemac) (Twitter: [@girlie_mac](https://twitter.com/girlie_mac))
[![CC BY-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-sa/4.0/)
---
**Disclaimer**:
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
Loading…
Cancel
Save