@ -0,0 +1,38 @@
|
||||
---
|
||||
name: Bug report
|
||||
about: Create a report to help us improve
|
||||
title: ''
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
**Describe the bug**
|
||||
A clear and concise description of what the bug is.
|
||||
|
||||
**To Reproduce**
|
||||
Steps to reproduce the behavior:
|
||||
1. Go to '...'
|
||||
2. Click on '....'
|
||||
3. Scroll down to '....'
|
||||
4. See error
|
||||
|
||||
**Expected behavior**
|
||||
A clear and concise description of what you expected to happen.
|
||||
|
||||
**Screenshots**
|
||||
If applicable, add screenshots to help explain your problem.
|
||||
|
||||
**Desktop (please complete the following information):**
|
||||
- OS: [e.g. iOS]
|
||||
- Browser [e.g. chrome, safari]
|
||||
- Version [e.g. 22]
|
||||
|
||||
**Smartphone (please complete the following information):**
|
||||
- Device: [e.g. iPhone6]
|
||||
- OS: [e.g. iOS8.1]
|
||||
- Browser [e.g. stock browser, safari]
|
||||
- Version [e.g. 22]
|
||||
|
||||
**Additional context**
|
||||
Add any other context about the problem here.
|
@ -0,0 +1,20 @@
|
||||
---
|
||||
name: Feature request
|
||||
about: Suggest an idea for this project
|
||||
title: ''
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
**Is your feature request related to a problem? Please describe.**
|
||||
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
|
||||
|
||||
**Describe the solution you'd like**
|
||||
A clear and concise description of what you want to happen.
|
||||
|
||||
**Describe alternatives you've considered**
|
||||
A clear and concise description of any alternative solutions or features you've considered.
|
||||
|
||||
**Additional context**
|
||||
Add any other context or screenshots about the feature request here.
|
@ -0,0 +1,37 @@
|
||||
## Track translation progress by opening a draft PR using this template and checking off the translations completed
|
||||
|
||||
Each lesson includes a translation of the README.md and the Assignment.md file, if available. Only mark the lesson complete if both those files are translated per lesson, please.
|
||||
|
||||
- [ ] 1
|
||||
- [ ] 1-1
|
||||
- [ ] 1-2
|
||||
- [ ] 1-3
|
||||
- [ ] 2
|
||||
- [ ] 2-1
|
||||
- [ ] 2-2
|
||||
- [ ] 2-3
|
||||
- [ ] 2-4
|
||||
- [ ] 3
|
||||
- [ ] 3-1
|
||||
- [ ] 3-2
|
||||
- [ ] 3-3
|
||||
- [ ] 4
|
||||
- [ ] 4-1
|
||||
- [ ] 5
|
||||
- [ ] 5-1
|
||||
- [ ] 5-2
|
||||
- [ ] 5-3
|
||||
- [ ] 6
|
||||
- [ ] 6-1
|
||||
- [ ] 6-2
|
||||
- [ ] 6-3
|
||||
- [ ] 6-4
|
||||
- [ ] 6-5
|
||||
- [ ] 6-6
|
||||
- [ ] 7
|
||||
- [ ] 7-1
|
||||
- [ ] 7-2
|
||||
- [ ] 7-3
|
||||
- [ ] 7-4
|
||||
|
||||
- [ ] Quiz (add a file in the quiz-app with all localizations)
|
@ -0,0 +1,45 @@
|
||||
name: Azure Static Web Apps CI/CD
|
||||
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
pull_request:
|
||||
types: [opened, synchronize, reopened, closed]
|
||||
branches:
|
||||
- main
|
||||
|
||||
jobs:
|
||||
build_and_deploy_job:
|
||||
if: github.event_name == 'push' || (github.event_name == 'pull_request' && github.event.action != 'closed')
|
||||
runs-on: ubuntu-latest
|
||||
name: Build and Deploy Job
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
with:
|
||||
submodules: true
|
||||
- name: Build And Deploy
|
||||
id: builddeploy
|
||||
uses: Azure/static-web-apps-deploy@v0.0.1-preview
|
||||
with:
|
||||
azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN_JOLLY_SEA_0A877260F }}
|
||||
repo_token: ${{ secrets.GITHUB_TOKEN }} # Used for Github integrations (i.e. PR comments)
|
||||
action: "upload"
|
||||
###### Repository/Build Configurations - These values can be configured to match you app requirements. ######
|
||||
# For more information regarding Static Web App workflow configurations, please visit: https://aka.ms/swaworkflowconfig
|
||||
app_location: "/quiz-app" # App source code path
|
||||
api_location: "api" # Api source code path - optional
|
||||
output_location: "dist" # Built app content directory - optional
|
||||
###### End of Repository/Build Configurations ######
|
||||
|
||||
close_pull_request_job:
|
||||
if: github.event_name == 'pull_request' && github.event.action == 'closed'
|
||||
runs-on: ubuntu-latest
|
||||
name: Close Pull Request Job
|
||||
steps:
|
||||
- name: Close Pull Request
|
||||
id: closepullrequest
|
||||
uses: Azure/static-web-apps-deploy@v0.0.1-preview
|
||||
with:
|
||||
azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN_JOLLY_SEA_0A877260F }}
|
||||
action: "close"
|
@ -0,0 +1,14 @@
|
||||
# Contributing
|
||||
|
||||
This project welcomes contributions and suggestions. Most contributions require you to
|
||||
agree to a Contributor License Agreement (CLA) declaring that you have the right to,
|
||||
and actually do, grant us the rights to use your contribution. For details, visit
|
||||
https://cla.microsoft.com.
|
||||
|
||||
When you submit a pull request, a CLA-bot will automatically determine whether you need
|
||||
to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the
|
||||
instructions provided by the bot. You will only need to do this once across all repositories using our CLA.
|
||||
|
||||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
||||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
|
||||
or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
@ -0,0 +1,55 @@
|
||||
# [Lesson Topic]
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
![Embed a video here if available](video-url)
|
||||
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# [Assignment Name]
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
@ -0,0 +1,55 @@
|
||||
# [Lesson Topic]
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
![Embed a video here if available](video-url)
|
||||
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# [Assignment Name]
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
@ -0,0 +1,55 @@
|
||||
# [Lesson Topic]
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
![Embed a video here if available](video-url)
|
||||
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# [Assignment Name]
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
@ -0,0 +1,55 @@
|
||||
# [Lesson Topic]
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
![Embed a video here if available](video-url)
|
||||
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# [Assignment Name]
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
@ -0,0 +1,13 @@
|
||||
# Getting Started with Classification
|
||||
|
||||
In this section of the curriculum you will learn about how to classify data using Machine Learning.
|
||||
|
||||
## Lessons
|
||||
|
||||
1. [Visualize your Data and Prepare it for Use](1-Data/README.md)
|
||||
2. [Build a Discriminative Model](2-Discriminative/README.md)
|
||||
3. [Build a Generative Model](3-Generative/README.md)
|
||||
4. [Applied ML: Build a Web App](4-Applied/README.md)
|
||||
## Credits
|
||||
|
||||
"Getting Started with Classification" was written with ♥️ by [Cassie Breviu](@cassieview)
|
@ -0,0 +1,305 @@
|
||||
# Introduction to Clustering
|
||||
|
||||
[![No One Like You by PSquare](https://img.youtube.com/vi/ty2advRiWJM/0.jpg)](https://youtu.be/ty2advRiWJM "No One Like You by PSquare")
|
||||
|
||||
> While you're studying Machine Learning with Clustering, enjoy some Nigerian Dance Hall tracks - this is a highly rated song from 2014 by PSquare.
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
### Introduction
|
||||
|
||||
Clustering is a type of [Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning) that presumes that a dataset is unlabelled. It uses various algorithms to sort through unlabeled data and provide groupings according to patterns it discerns in the data.
|
||||
|
||||
> TODO infographic
|
||||
|
||||
[Clustering](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-30164-8_124) is very useful for data exploration. Let's see if it can help discover trends and patterns in the way Nigerian audiences consume music.
|
||||
|
||||
✅ Take a minute to think about the uses of clustering. In real life, clustering happens whenever you have a pile of laundry and need to sort out your family members' clothes 🧦👕👖🩲. In data science, clustering happens when trying to analyze a user's preferences, or determine the characteristics of any unlabeled dataset. Clustering, in a way, helps make sense of chaos.
|
||||
|
||||
In a professional setting, clustering can be used to determine things like market segmentation, determining what age groups buy what items, for example. Another use would be anomaly detection, perhaps to detect fraud from a dataset of credit card transactions. Or you might use clustering to determine tumors in a batch of medical scans.
|
||||
|
||||
✅ Think a minute about how you might have encountered clustering 'in the wild', in a banking, e-commerce, or business setting.
|
||||
|
||||
> 🎓 Interestingly, Cluster Analysis originated in the fields of Anthropology and Psychology in the 1930s. Can you imagine how it might have been used?
|
||||
|
||||
Alternately, you could use it for grouping search results - by shopping links, images, or reviews, for example. Clustering is useful when you have a large dataset that you want to reduce and on which you want to perform more granular analysis, so the technique can be used to learn about data before other models are constructed.
|
||||
|
||||
✅ Once your data is organized in clusters, you assign it a cluster Id, and this technique can be useful when preserving a dataset's privacy; you can instead refer to a data point by its cluster id, rather than by more revealing identifiable data. Can you think of other reasons why you'd refer to a cluster Id rather than other elements of the cluster to identify it?
|
||||
## Getting started with clustering
|
||||
|
||||
[Scikit-Learn offers a large array](https://scikit-learn.org/stable/modules/clustering.html) of methods to perform clustering. The type you choose will depend on your use case. According to the documentation, each method has various benefits. Here is a simplified table of the methods supported by Scikit-Learn and their appropriate use cases:
|
||||
|
||||
| Method name | Use case |
|
||||
| :--------------------------- | :--------------------------------------------------------------------- |
|
||||
| K-Means | general purpose, inductive |
|
||||
| Affinity propagation | many, uneven clusters, inductive |
|
||||
| Mean-shift | many, uneven clusters, inductive |
|
||||
| Spectral clustering | few, even clusters, transductive |
|
||||
| Ward hierarchical clustering | many, constrained clusters, transductive |
|
||||
| Agglomerative clustering | many, constrained, non Euclidean distances, transductive |
|
||||
| DBSCAN | non-flat geometry, uneven clusters, transductive |
|
||||
| OPTICS | non-flat geometry, uneven clusters with variable density, transductive |
|
||||
| Gaussian mixtures | flat geometry, inductive |
|
||||
| BIRCH | large dataset with outliers, inductive |
|
||||
|
||||
> 🎓 How we create clusters has a lot to do with how we gather up the data points into groups. Let's unpack some vocabulary:
|
||||
>
|
||||
> 🎓 ['Transductive' vs. 'inductive'](https://wikipedia.org/wiki/Transduction_(machine_learning))
|
||||
>
|
||||
> Transductive inference is derived from observed training cases that map to specific test cases. Inductive inference is derived from training cases that map to general rules which are only then applied to test cases.
|
||||
>
|
||||
> An example: Imagine you have a dataset that is only partially labelled. Some things are 'records', some 'cds', and some are blank. Your job is to provide labels for the blanks. If you choose an inductive approach, you'd train a model looking for 'records' and 'cds', and apply those labels to your unlabeled data. This approach will have trouble classifying things that are actually 'cassettes'. A transductive approach, on the other hand, handles this unknown data more effectively as it works to group similar items together and then applies a label to a group. In this case, clusters might reflect 'round musical things' and 'square musical things'.
|
||||
>
|
||||
> 🎓 ['Non-flat' vs. 'flat' geometry](https://datascience.stackexchange.com/questions/52260/terminology-flat-geometry-in-the-context-of-clustering)
|
||||
>
|
||||
> Derived from mathematical terminology, non-flat vs. flat geometry refers to the measure of distances between points by either 'flat' (non-[Euclidean](https://wikipedia.org/wiki/Euclidean_geometry)) or 'non-flat' (Euclidean) geometrical methods.
|
||||
>
|
||||
>'Flat' in this context refers to Euclidean geometry (parts of which are taught as 'plane' geometry), and non-flat refers to non-Euclidean geometry. What does geometry have to do with machine learning? Well, as two fields that are rooted in mathematics, there must be a common way to measure distances between points in clusters, and that can be done in a 'flat' or 'non-flat' way, depending on the nature of the data. If your data, visualized, seems to not exist on a plane, you might need to use a specialized algorithm to handle it.
|
||||
>
|
||||
> Infographic: like the last one here https://datascience.stackexchange.com/questions/52260/terminology-flat-geometry-in-the-context-of-clustering
|
||||
>
|
||||
> 🎓 ['Distances'](https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf)
|
||||
>
|
||||
> Clusters are defined by their distance matrix, e.g. the distances between points. This distance can be measured a few ways. Euclidean clusters are defined by the average of the point values, and contain a 'centroid' or center point. Distances are thus measured by the distance to that centroid. Non-Euclidean distances refer to 'clustroids', the point closest to other points. Clustroids in turn can be defined in various ways.
|
||||
>
|
||||
> 🎓 ['Constrained'](https://wikipedia.org/wiki/Constrained_clustering)
|
||||
>
|
||||
> [Constrained Clustering](https://web.cs.ucdavis.edu/~davidson/Publications/ICDMTutorial.pdf) introduces 'semi-supervised' learning into this unsupervised method. The relationships between points are flagged as 'cannot link' or 'must-link' so some rules are forced on the dataset.
|
||||
>
|
||||
>An example: If an algorithm is set free on a batch of unlabelled or semi-labelled data, the clusters it produces may be of poor quality. In the example above, the clusters might group 'round music things' and 'square music things' and 'triangular things' and 'cookies'. If given some constraints, or rules to follow ("the item must be made of plastic", "the item needs to be able to produce music") this can help 'constrain' the algorithm to make better choices.
|
||||
>
|
||||
> 🎓 'Density'
|
||||
>
|
||||
> Data that is 'noisy' is considered to be 'dense'. The distances between points in each of its clusters may prove, on examination, to be more or less dense, or 'crowded' and thus this data needs to be analyzed with the appropriate clustering method. [This article](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) demonstrates the difference between using K-Means clustering vs. HDBSCAN algorithms to explore a noisy dataset with uneven cluster density.
|
||||
|
||||
### Clustering Algorithms
|
||||
|
||||
There are over 100 clustering algorithms, and their use depends on the nature of the data at hand. Let's discuss some of the major ones:
|
||||
|
||||
**Hierarchical clustering**
|
||||
|
||||
If an object is classified by its proximity to a nearby object, rather than to one farther away, clusters are formed based on their members' distance to and from other objects. Scikit-Learn's Agglomerative clustering is hierarchical.
|
||||
|
||||
TODO: infographic
|
||||
|
||||
**Centroid clustering**
|
||||
|
||||
This popular algorithm requires the choice of 'k', or the number of clusters to form, after which the algorithm determines the center point of a cluster and gathers data around that point. [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) is a popular version of centroid clustering. The center is determined by the nearest mean, thus the name. The squared distance from the cluster is minimized.
|
||||
|
||||
TODO: infographic
|
||||
|
||||
**Distribution-based clustering**
|
||||
|
||||
Based in statistical modeling, distribution-based clustering centers on determining the probability that a data point belongs to a cluster, and assigning it accordingly. Gaussian Mixture methods belong to this type.
|
||||
|
||||
**Density-based clustering**
|
||||
|
||||
Data points are assigned to clusters based on their density, or their grouping around each other. Data points far from the group are considered outliers or noise. DBSCAN, Mean-shift and OPTICS belong to this type of clustering.
|
||||
|
||||
**Grid-based clustering**
|
||||
|
||||
For multi-dimensional datasets, a grid is created and the data is divided amongst the grid's cells, thereby creating clusters.
|
||||
### Preparing the data
|
||||
|
||||
Clustering as a technique is greatly aided by proper visualization, so let's get started by visualizing our music data. This exercise will help us decide which of the methods of clustering we should most effectively use for the nature of this data.
|
||||
|
||||
Open the notebook.ipynb file in this folder. Import the Seaborn package for good data visualization.
|
||||
|
||||
```python
|
||||
pip install seaborn
|
||||
```
|
||||
|
||||
Append the song data .csv file. Load up a dataframe with some data about the songs. Get ready to explore this data by importing the libraries and dumping out the data:
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
import pandas as pd
|
||||
|
||||
df = pd.read_csv("../../data/nigerian-songs.csv")
|
||||
df.head()
|
||||
```
|
||||
|
||||
Check the first few lines of data:
|
||||
|
||||
| | name | album | artist | artist_top_genre | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo | time_signature |
|
||||
| --- | ------------------------ | ---------------------------- | ------------------- | ---------------- | ------------ | ------ | ---------- | ------------ | ------------ | ------ | ---------------- | -------- | -------- | ----------- | ------- | -------------- |
|
||||
| 0 | Sparky | Mandy & The Jungle | Cruel Santino | alternative r&b | 2019 | 144000 | 48 | 0.666 | 0.851 | 0.42 | 0.534 | 0.11 | -6.699 | 0.0829 | 133.015 | 5 |
|
||||
| 1 | shuga rush | EVERYTHING YOU HEARD IS TRUE | Odunsi (The Engine) | afropop | 2020 | 89488 | 30 | 0.71 | 0.0822 | 0.683 | 0.000169 | 0.101 | -5.64 | 0.36 | 129.993 | 3 |
|
||||
| 2 | LITT! | LITT! | AYLØ | indie r&b | 2018 | 207758 | 40 | 0.836 | 0.272 | 0.564 | 0.000537 | 0.11 | -7.127 | 0.0424 | 130.005 | 4 |
|
||||
| 3 | Confident / Feeling Cool | Enjoy Your Life | Lady Donli | nigerian pop | 2019 | 175135 | 14 | 0.894 | 0.798 | 0.611 | 0.000187 | 0.0964 | -4.961 | 0.113 | 111.087 | 4 |
|
||||
| 4 | wanted you | rare. | Odunsi (The Engine) | afropop | 2018 | 152049 | 25 | 0.702 | 0.116 | 0.833 | 0.91 | 0.348 | -6.044 | 0.0447 | 105.115 | 4 |
|
||||
|
||||
Get some information about the dataframe:
|
||||
|
||||
```python
|
||||
df.info()
|
||||
```
|
||||
|
||||
```
|
||||
<class 'pandas.core.frame.DataFrame'>
|
||||
RangeIndex: 530 entries, 0 to 529
|
||||
Data columns (total 16 columns):
|
||||
# Column Non-Null Count Dtype
|
||||
--- ------ -------------- -----
|
||||
0 name 530 non-null object
|
||||
1 album 530 non-null object
|
||||
2 artist 530 non-null object
|
||||
3 artist_top_genre 530 non-null object
|
||||
4 release_date 530 non-null int64
|
||||
5 length 530 non-null int64
|
||||
6 popularity 530 non-null int64
|
||||
7 danceability 530 non-null float64
|
||||
8 acousticness 530 non-null float64
|
||||
9 energy 530 non-null float64
|
||||
10 instrumentalness 530 non-null float64
|
||||
11 liveness 530 non-null float64
|
||||
12 loudness 530 non-null float64
|
||||
13 speechiness 530 non-null float64
|
||||
14 tempo 530 non-null float64
|
||||
15 time_signature 530 non-null int64
|
||||
dtypes: float64(8), int64(4), object(4)
|
||||
memory usage: 66.4+ KB
|
||||
```
|
||||
|
||||
Double-check for null values:
|
||||
|
||||
```python
|
||||
df.isnull().sum()
|
||||
```
|
||||
|
||||
Looking good:
|
||||
|
||||
```
|
||||
name 0
|
||||
album 0
|
||||
artist 0
|
||||
artist_top_genre 0
|
||||
release_date 0
|
||||
length 0
|
||||
popularity 0
|
||||
danceability 0
|
||||
acousticness 0
|
||||
energy 0
|
||||
instrumentalness 0
|
||||
liveness 0
|
||||
loudness 0
|
||||
speechiness 0
|
||||
tempo 0
|
||||
time_signature 0
|
||||
dtype: int64
|
||||
```
|
||||
|
||||
Describe the data:
|
||||
|
||||
```python
|
||||
df.describe()
|
||||
```
|
||||
|
||||
| | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo | time_signature |
|
||||
| ----- | ------------ | ----------- | ---------- | ------------ | ------------ | -------- | ---------------- | -------- | --------- | ----------- | ---------- | -------------- |
|
||||
| count | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 |
|
||||
| mean | 2015.390566 | 222298.1698 | 17.507547 | 0.741619 | 0.265412 | 0.760623 | 0.016305 | 0.147308 | -4.953011 | 0.130748 | 116.487864 | 3.986792 |
|
||||
| std | 3.131688 | 39696.82226 | 18.992212 | 0.117522 | 0.208342 | 0.148533 | 0.090321 | 0.123588 | 2.464186 | 0.092939 | 23.518601 | 0.333701 |
|
||||
| min | 1998 | 89488 | 0 | 0.255 | 0.000665 | 0.111 | 0 | 0.0283 | -19.362 | 0.0278 | 61.695 | 3 |
|
||||
| 25% | 2014 | 199305 | 0 | 0.681 | 0.089525 | 0.669 | 0 | 0.07565 | -6.29875 | 0.0591 | 102.96125 | 4 |
|
||||
| 50% | 2016 | 218509 | 13 | 0.761 | 0.2205 | 0.7845 | 0.000004 | 0.1035 | -4.5585 | 0.09795 | 112.7145 | 4 |
|
||||
| 75% | 2017 | 242098.5 | 31 | 0.8295 | 0.403 | 0.87575 | 0.000234 | 0.164 | -3.331 | 0.177 | 125.03925 | 4 |
|
||||
| max | 2020 | 511738 | 73 | 0.966 | 0.954 | 0.995 | 0.91 | 0.811 | 0.582 | 0.514 | 206.007 | 5 |
|
||||
|
||||
Look at the general values of the data. Note that popularity can be '0', which show songs that have no ranking. Let's remove those shortly.
|
||||
|
||||
Use a barplot to find out the most popular genres:
|
||||
|
||||
```python
|
||||
import seaborn as sns
|
||||
|
||||
top = df['artist_top_genre'].value_counts()
|
||||
plt.figure(figsize=(10,7))
|
||||
sns.barplot(x=top[:5].index,y=top[:5].values)
|
||||
plt.xticks(rotation=45)
|
||||
plt.title('Top genres',color = 'blue')
|
||||
```
|
||||
![most popular](images/popular.png)
|
||||
|
||||
✅ If you'd like to see more top values, change the top `[:5]` to a bigger value, or remove it to see all.
|
||||
|
||||
Note, when the top genre is described as 'Missing', that means that Spotify did not classify it, so let's get rid of it:
|
||||
|
||||
```python
|
||||
df = df[df['artist_top_genre'] != 'Missing']
|
||||
top = df['artist_top_genre'].value_counts()
|
||||
plt.figure(figsize=(10,7))
|
||||
sns.barplot(x=top.index,y=top.values)
|
||||
plt.xticks(rotation=45)
|
||||
plt.title('Top genres',color = 'blue')
|
||||
```
|
||||
Now recheck the most popular genres:
|
||||
|
||||
![most popular](images/popular.png)
|
||||
|
||||
By far, the top three genres dominate this dataset, so let's concentrate on `afro dancehall`, `afropop`, and `nigerian pop`, also filtering the dataset to remove anything with a 0 popularity value (meaning it was not classified with a popularity in the dataset and can be considered noise for our purposes):
|
||||
|
||||
```python
|
||||
df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
|
||||
df = df[(df['popularity'] > 0)]
|
||||
top = df['artist_top_genre'].value_counts()
|
||||
plt.figure(figsize=(10,7))
|
||||
sns.barplot(x=top.index,y=top.values)
|
||||
plt.xticks(rotation=45)
|
||||
plt.title('Top genres',color = 'blue')
|
||||
```
|
||||
|
||||
Do a quick test to see if the data correlates in any particularly strong way:
|
||||
|
||||
```python
|
||||
corrmat = df.corr()
|
||||
f, ax = plt.subplots(figsize=(12, 9))
|
||||
sns.heatmap(corrmat, vmax=.8, square=True);
|
||||
```
|
||||
![correlations](images/correlation.png)
|
||||
|
||||
The only strong correlation is between energy and loudness, which is not too surprising, given that loud music is usually pretty energetic. Otherwise, the correlations are relatively weak. It will be interesting to see what a clustering algorithm can make of this data.
|
||||
|
||||
Is there any convergence in this dataset around a song's perceived popularity and danceability? A FacetGrid shows that there are concentric circles that line up, regardless of genre. Could it be that Nigerian tastes converge at a certain level of danceability for this genre?
|
||||
|
||||
✅ Try different datapoints (energy, loudness, speechiness) and more or different musical genres. What can you discover? Take a look at the `df.describe()` table to see the general spread of the data points.
|
||||
|
||||
### Data Distribution
|
||||
|
||||
Are these three genres significantly different in the perception of their danceability, based on their popularity? Examine our top three genres data distribution for popularity and danceability along a given x and y axis.
|
||||
|
||||
```python
|
||||
sns.set_theme(style="ticks")
|
||||
|
||||
g = sns.jointplot(
|
||||
data=df,
|
||||
x="popularity", y="danceability", hue="artist_top_genre",
|
||||
kind="kde",
|
||||
)
|
||||
```
|
||||
|
||||
You can discover concentric circles around a general point of convergence, showing the distribution of points. In general, the three genres align loosely in terms of their popularity and danceability. Determining clusters in this loosely-aligned data will be interesting:
|
||||
|
||||
![distribution](images/distribution.png)
|
||||
|
||||
A scatterplot of the same axes shows a similar pattern of convergence:
|
||||
|
||||
```python
|
||||
sns.FacetGrid(df, hue="artist_top_genre", size=5) \
|
||||
.map(plt.scatter, "popularity", "danceability") \
|
||||
.add_legend()
|
||||
```
|
||||
|
||||
In general, for clustering, you can use scatterplots to show clusters of data, so mastering this type of visualization is very useful. In the next lesson, we will take this filtered data and use k-means clustering to discover groups in this data that seems to overlap in interesting ways.
|
||||
## 🚀Challenge
|
||||
|
||||
|
||||
## [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
Before you apply clustering algorithms, as we have learned, it's a good idea to understand the nature of your dataset. Read more onn this topic [here](https://www.kdnuggets.com/2019/10/right-clustering-algorithm.html)
|
||||
|
||||
[This helpful article](https://www.freecodecamp.org/news/8-clustering-algorithms-in-machine-learning-that-all-data-scientists-should-know/) walks you through the different ways that various clustering algorithms behave, given different data shapes.
|
||||
|
||||
In the next lesson, you will make use of the most popular clustering method, K-Means. Take a look at Stanford's K-Means Simulator [here](https://stanford.edu/class/engr108/visualizations/kmeans/kmeans.html). You can use this tool to visualize sample data points and determine its centroids. With fresh data, click 'update' to see how long it takes to find convergence. You can edit the data's randomness, numbers of clusters and numbers of centroids. Does this help you get an idea of how the data can be grouped?
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# [Assignment Name]
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
After Width: | Height: | Size: 22 KiB |
After Width: | Height: | Size: 22 KiB |
After Width: | Height: | Size: 64 KiB |
After Width: | Height: | Size: 14 KiB |
@ -0,0 +1,39 @@
|
||||
{
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3"
|
||||
},
|
||||
"orig_nbformat": 2,
|
||||
"kernelspec": {
|
||||
"name": "python383jvsc74a57bd0e134e05457d34029b6460cd73bbf1ed73f339b5b6d98c95be70b69eba114fe95",
|
||||
"display_name": "Python 3.8.3 64-bit (conda)"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2,
|
||||
"cells": [
|
||||
{
|
||||
"source": [
|
||||
"# Nigerian Music scraped from Spotify - an analysis"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
]
|
||||
}
|
@ -0,0 +1,57 @@
|
||||
# [Lesson Topic]
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
![Embed a video here if available](video-url)
|
||||
https://www.youtube.com/watch?v=hDmNF9JG3lo
|
||||
|
||||
https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# [Assignment Name]
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
@ -0,0 +1,28 @@
|
||||
{
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": 3
|
||||
},
|
||||
"orig_nbformat": 2
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2,
|
||||
"cells": [
|
||||
{
|
||||
"source": [
|
||||
"# Nigerian Music scraped from Spotify - an analysis"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
}
|
||||
]
|
||||
}
|
@ -0,0 +1,20 @@
|
||||
# Clustering Models for Machine Learning
|
||||
## Regional topic: Clustering models for a Nigerian audience's musical taste
|
||||
|
||||
Nigeria's diverse audience has diverse musical tastes. Using data scraped from Spotify (inspired by [this article](https://towardsdatascience.com/country-wise-visual-analysis-of-music-taste-using-spotify-api-seaborn-in-python-77f5b749b421), let's look at some music popular in Nigeria. This dataset includes data about various songs' 'danceability' score, 'acousticness', loudness, 'speechiness', popularity and energy. It will be interesting to discover patterns in this data!
|
||||
|
||||
![A turntable](./images/turntable.jpg)
|
||||
|
||||
Photo by <a href="https://unsplash.com/@marcelalaskoski?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Marcela Laskoski</a> on <a href="https://unsplash.com/s/photos/nigerian-music?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
|
||||
|
||||
|
||||
In this series of lessons, you will discover new ways to analyze data using Clustering techniques. Clustering is particularly useful when your dataset lacks labels. If it does have labels, then Classification techniques such as those you learned in previous lessons are more useful. But in cases where you are looking to group unlabelled data, clustering is a great way to discover patterns.
|
||||
## Lessons
|
||||
|
||||
1. [Introduction to Clustering](1-Visualize/README.md)
|
||||
2. [K-Means Clustering](2-K-Means/README.md)
|
||||
## Credits
|
||||
|
||||
These lessons were written with ♥️ by [Jen Looper](https://www.twitter.com/jenlooper) with helpful reviews by Muhammad Sakib Khan Inan.
|
||||
|
||||
The [Nigerian Songs](https://www.kaggle.com/sootersaalu/nigerian-songs-spotify) dataset was sourced from Kaggle as scraped from Spotify.
|
|
After Width: | Height: | Size: 62 KiB |
@ -0,0 +1,56 @@
|
||||
# Introduction to Machine Learning
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
[![ML, AI, Deep Learning - What's the difference?](https://img.youtube.com/vi/lTd9RSxS9ZE/0.jpg)](https://youtu.be/lTd9RSxS9ZE "ML, AI, Deep Learning - What's the difference?")
|
||||
|
||||
> Click this image to watch a video discussing the difference between Machine Learning, AI, and Deep Learning.
|
||||
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/1/)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/2/)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# Assignment Name
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
@ -0,0 +1,57 @@
|
||||
# Introduction to Machine Learning
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
![Embed a video here if available](video-url)
|
||||
|
||||
[Check out this podcast where Amy Boyd discusses the evolution of AI](http://runasradio.com/Shows/Show/739)
|
||||
|
||||
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/3/)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/4/)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# [Assignment Name]
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
@ -0,0 +1,61 @@
|
||||
# The Ethics of Machine Learning
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
> ✅ Learn more about Responsible AI by following this [Learning Path](https://docs.microsoft.com/en-us/learn/modules/responsible-ai-principles/?WT.mc_id=academic-15963-cxa)
|
||||
|
||||
[![Microsoft's Approach to Responsible AI](https://img.youtube.com/vi/dnC8-uUZXSc/0.jpg)](https://youtu.be/dnC8-uUZXSc "Microsoft's Approach to Responsible AI")
|
||||
> Video: Microsoft's Approach to Responsible AI
|
||||
|
||||
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/5/)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[![Eric Horvitz discusses Ethical AI](https://img.youtube.com/vi/tL7t2O5Iu8E/0.jpg)](https://youtu.be/tL7t2O5Iu8E "Eric Horvitz, Technical Fellow and Director of Microsoft Research Labs, talks about some of the benefits AI and machine learning are bringing and why it is essential for companies to establish ethical principles to make sure AI is properly governed.")
|
||||
> Video: Eric Horvitz, Technical Fellow and Director of Microsoft Research Labs, talks about some of the benefits AI and machine learning are bringing and why it is essential for companies to establish ethical principles to make sure AI is properly governed.
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/6/)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1 @@
|
||||
# Assignment
|
@ -0,0 +1,17 @@
|
||||
# Introduction to Machine Learning
|
||||
|
||||
In this section of the curriculum, you will be introduced to the base concepts underlying the field of machine learning, what is is, and learn about its history.
|
||||
|
||||
### Lessons
|
||||
|
||||
1. [Introduction to Machine Learning](1-intro-to-ML/README.md)
|
||||
1. [The History of Machine Learning](2-history-of-ML/README.md)
|
||||
1. [Ethics and Machine Learning](3-ethics/README.md)
|
||||
|
||||
### Credits
|
||||
|
||||
"Introduction to Machine Learning" was written with ♥️ by [Name](Twitter)
|
||||
|
||||
"The History of Machine Learning" was written with ♥️ by [Name](Twitter)
|
||||
|
||||
"Ethics and Machine Learning" was written with ♥️ by [Name](Twitter)
|
@ -0,0 +1,126 @@
|
||||
# Introduction to Natural Language Processing
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
[![NLP 101](https://img.youtube.com/vi/C75SiVhXjRM/0.jpg)](https://youtu.be/C75SiVhXjRM "NLP 101")
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Introduction
|
||||
|
||||
This lesson covers a brief history and important concepts of *Computational Linguistics* focusing on *Natural Language Processing*. NLP, as it is commonly known, is one of the best-known areas where machine learning has been applied and used in production software.
|
||||
|
||||
✅ Can you think of software that you use every day that probably has some NLP embedded? What about your word processing programs or mobile apps that you use regularly?
|
||||
|
||||
You will learn about how the ideas about languages developed and what the major areas of study have been. You will also learn definitions and concepts about how computers process text, including parsing, grammar, and identifying nouns and verbs. There are some coding tasks in this lesson, and several important concepts are introduced that you will learn to code later on in the next lessons.
|
||||
|
||||
Computational linguistics is an area of research and development over many decades that studies how computers can work with, and even understand, translate, and communicate with languages. Natural Language Processing (NLP) is a related field focused on how computers can process 'natural', or human, languages. If you have ever dictated to your phone instead of typing or asked a virtual assistant a question, your speech was converted into a text form and then processed or *parsed* from the language you spoke. The detected keywords were then processed into a format that the phone or assistant could understand and act on.
|
||||
|
||||
This is possible because someone wrote a computer program to do this. A few decades ago, some science fiction writers predicted that people would mostly speak to their computers, and the computers would always understand exactly what they meant. Sadly, it turned out to be a harder problem that many imagined, and while it is a much better understood problem today, there are significant challenges in achieving 'perfect' natural language processing when it comes to understanding the meaning of a sentence. This is a particularly hard problem when it comes to understanding humour or detecting emotions such as sarcasm in a sentence.
|
||||
|
||||
At this point, you may be remembering school classes where the teacher covered the parts of grammar in a sentence. In some countries, students are taught grammar and linguistics as a dedicated subject, but in many, these topics are included as part of learning a language: either your first language in primary school (learning to read and write) and perhaps a second language in post-primary, or high school. Don't worry if you are not an expert at differentiating nouns from verbs or adverbs from adjectives!
|
||||
|
||||
If you struggle with the difference between the *simple present* and *present progressive*, you are not alone. This is a challenging thing for many people, even native speakers of a language. The good news is that computers are really good at applying formal rules, and you will learn to write code that can *parse* a sentence as well as a human. The greater challenge you will examine later is understanding the *meaning*, and *sentiment*, of a sentence.
|
||||
## Prerequisites
|
||||
|
||||
For this lesson, the main prerequisite is being able to read and understand the language of this lesson. There are no math problems or equations to solve. While the original author wrote this lesson in English, it is also translated into other languages, so you could be reading a translation. There are examples where a number of different languages are used (to compare the different grammar rules of different languages). These are *not* translated, but the explanatory text is, so the meaning should be clear.
|
||||
|
||||
For the coding tasks, you will use Python and the examples are using Python 3.8.
|
||||
|
||||
In this section, you will need:
|
||||
* Python 3 programming language comprehension
|
||||
* this lesson uses input, loops, file reading, arrays
|
||||
* Visual Studio Code with its Python extension
|
||||
* (*or the Python IDE of your choice*)
|
||||
* [TextBlob](https://github.com/sloria/TextBlob) a simplified text processing library for Python
|
||||
* Follow the instructions on the TextBlob site to install it on your system (install the corpora as well, as shown below)
|
||||
```bash
|
||||
pip install -U textblob
|
||||
python -m textblob.download_corpora
|
||||
```
|
||||
|
||||
> 💡 Tip: You can run Python directly in VS Code environments. Check the [docs](https://code.visualstudio.com/docs/languages/python?WT.mc_id=academic-15963-cxa) for more information.
|
||||
|
||||
## Conversing with Eliza
|
||||
|
||||
The history of trying to make computers understand human language goes back decades, and one of the earliest scientists to consider natural language processing was *Alan Turing*. When Turing was researching *Artificial Intelligence* in the 1950's, he considered if a conversational test could be given to a human and computer (via typed correspondence) where the human in the conversation was not sure if they were conversing with another human or a computer. If, after a certain length of conversation, the human could not determine that the answers were from a computer or not, then could the computer be said to be *thinking*?
|
||||
|
||||
[![Chatting with Eliza](https://img.youtube.com/vi/QD8mQXaUFG4/0.jpg)](https://youtu.be/QD8mQXaUFG4 "Chatting with Eliza")
|
||||
|
||||
The idea for this came from a party game called *The Imitation Game* where an interrogator is alone in a room and tasked with determining which of two people (in another room) are male and female respectively. The interrogator can send notes, and must try to think of questions where the written answers reveal the gender of the mystery person. Of course, the players in the other room are trying to trick the interrogator by answering questions in such as way as to mislead or confuse the interrogator, whilst also giving the appearance of answering honestly.
|
||||
|
||||
In the 1960's an MIT scientist called *Joseph Weizenbaum* developed [*Eliza*](https://en.wikipedia.org/wiki/ELIZA), a computer 'therapist' that would ask the human questions and give the appearance of understanding their answers. However, while Eliza could parse a sentence and identify certain grammatical constructs and keywords so as to give a reasonable answer, it could not be said to *understand* the sentence. If Eliza was presented with a sentence following the format "**I am** <u>sad</u>" it might rearrange and substitute words in the sentence to form the response "How long have **you been** <u>sad</u>". This gave the impression that Eliza understood the statement and was asking a follow-on question, whereas in reality, it was changing the tense and adding some words. If Eliza could not identify a keyword that it had a response for, it would instead give a random response that could be applicable to many different statements. Eliza could be easily tricked, for instance if a user wrote "**You are** a <u>bicycle</u>" it might respond with "How long have **I been** a <u>bicycle</u>?", instead of a more reasoned response.
|
||||
|
||||
> Note: You can read the original description of [Eliza](https://cacm.acm.org/magazines/1966/1/13317-elizaa-computer-program-for-the-study-of-natural-language-communication-between-man-and-machine/abstract) published in 1966 if you have an ACM account. Alternately, read about Eliza on [wikipedia](https://en.wikipedia.org/wiki/ELIZA)
|
||||
|
||||
### Task: Coding a basic conversational bot
|
||||
|
||||
A conversational bot, like Eliza, is a program that elicits user input and seems to understand and respond intelligently. Unlike Eliza, our bot will not have several rules giving it the appearance of having an intelligent conversation. Instead, out bot will have one ability only, to keep the conversation going with random responses that might work in almost any trivial conversation.
|
||||
|
||||
Your steps when building a conversational bot:
|
||||
|
||||
1. Print instructions advising the user how to interact with the bot
|
||||
2. Start a loop
|
||||
1. Accept user input
|
||||
2. If user has asked to exit, then exit
|
||||
3. Process user input and determine response (in this case, the response is a random choice from a list of possible generic responses)
|
||||
4. Print response
|
||||
3. loop back to step 2
|
||||
|
||||
Create this bot yourself in Python with the following random responses:
|
||||
|
||||
```python
|
||||
random_responses = ["That is quite interesting, please tell me more.",
|
||||
"I see. Do go on.",
|
||||
"Why do you say that?",
|
||||
"Funny weather we've been having, isn't it?",
|
||||
"Let's change the subject.",
|
||||
"Did you catch the game last night?"]
|
||||
```
|
||||
|
||||
Here is some sample output to guide you (user input is on the lines starting with `>`):
|
||||
|
||||
```
|
||||
Hello, I am Marvin, the simple robot.
|
||||
You can end this conversation at any time by typing 'bye'
|
||||
After typing each answer, press 'enter'
|
||||
How are you today?
|
||||
> I am good thanks
|
||||
That is quite interesting, please tell me more.
|
||||
> today I went for a walk
|
||||
Did you catch the game last night?
|
||||
> I did, but my team lost
|
||||
Funny weather we've been having, isn't it?
|
||||
> yes but I hope next week is better
|
||||
Let's change the subject.
|
||||
> ok, lets talk about music
|
||||
Why do you say that?
|
||||
> because I like music!
|
||||
Why do you say that?
|
||||
> bye
|
||||
It was nice talking to you, goodbye!
|
||||
```
|
||||
|
||||
One possible solution to the task is [here](solution/bot.py)
|
||||
|
||||
✅ Stop and consider
|
||||
1. Do you think the random responses would 'trick' someone into thinking that the bot actually understood them?
|
||||
2. What features would the bot need to be more effective?
|
||||
3. If a bot could really 'understand' the meaning of a sentence, would it need to 'remember' the meaning of previous sentences in a conversation too?
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Choose one of the "stop and consider" elements above and either try to implement them in code or write a solution on paper using pseudocode.
|
||||
|
||||
In the next lesson, you'll learn about a number of other approaches to parsing natural language and machine learning.
|
||||
|
||||
### [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
Take a look at the references below as further reading opportunities.
|
||||
### References
|
||||
|
||||
1. Schubert, Lenhart, "Computational Linguistics", *The Stanford Encyclopedia of Philosophy* (Spring 2020 Edition), Edward N. Zalta (ed.), URL = <https://plato.stanford.edu/archives/spr2020/entries/computational-linguistics/>.
|
||||
2. Princeton University "About WordNet." [WordNet](https://wordnet.princeton.edu/). Princeton University. 2010.
|
||||
|
||||
**Assignment**: [Make a Bot talk back](assignment.md)
|
@ -0,0 +1,11 @@
|
||||
# Make a Bot talk back
|
||||
|
||||
## Instructions
|
||||
|
||||
In this lesson, you programmed a basic bot with whom to chat. This bot gives random answers until you say 'bye'. Can you make the answers a little less random, and trigger answers if you say specific things, like 'why' or 'how'? Think a bit how machine learning might make this type of work less manual as you extend your bot. You can use NLTK or TextBlob libraries to make your tasks easier.
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------------------------------------------- | ------------------------------------------------ | ----------------------- |
|
||||
| | A new bot.py file is presented and documented | A new bot file is presented but it contains bugs | A file is not presented |
|
@ -0,0 +1,26 @@
|
||||
import random
|
||||
|
||||
# This list contains the random responses (you can add your own or translate them into your own language too)
|
||||
random_responses = ["That is quite interesting, please tell me more.",
|
||||
"I see. Do go on.",
|
||||
"Why do you say that?",
|
||||
"Funny weather we've been having, isn't it?",
|
||||
"Let's change the subject.",
|
||||
"Did you catch the game last night?"]
|
||||
|
||||
print("Hello, I am Marvin, the simple robot.")
|
||||
print("You can end this conversation at any time by typing 'bye'")
|
||||
print("After typing each answer, press 'enter'")
|
||||
print("How are you today?")
|
||||
|
||||
while True:
|
||||
# wait for the user to enter some text
|
||||
user_input = input("> ")
|
||||
if user_input.lower() == "bye":
|
||||
# if they typed in 'bye' (or even BYE, ByE, byE etc.), break out of the loop
|
||||
break
|
||||
else:
|
||||
response = random.choices(random_responses)[0]
|
||||
print(response)
|
||||
|
||||
print("It was nice talking to you, goodbye!")
|
@ -0,0 +1,11 @@
|
||||
# Search for a bot
|
||||
|
||||
## Instructions
|
||||
|
||||
Bots are everywhere. Your assignment: find one and adopt it! You can find them on web sites, in banking applications, and on the phone, for example when you call financial services companies for advice or account information. Analyze the bot and see if you can confuse it. If you can confuse the bot, why do you think that happened? Write a short paper about your experience.
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | ------------------------------------------------------------------------------------------------------------- | -------------------------------------------- | --------------------- |
|
||||
| | A full page paper is written, explaining the presumed bot architecture and outlining your experience with it | A paper is incomplete or not well researched | No paper is submitted |
|
@ -0,0 +1,44 @@
|
||||
import random
|
||||
from textblob import TextBlob
|
||||
from textblob.np_extractors import ConllExtractor
|
||||
extractor = ConllExtractor()
|
||||
|
||||
def main():
|
||||
print("Hello, I am Marvin, the friendly robot.")
|
||||
print("You can end this conversation at any time by typing 'bye'")
|
||||
print("After typing each answer, press 'enter'")
|
||||
print("How are you today?")
|
||||
|
||||
while True:
|
||||
# wait for the user to enter some text
|
||||
user_input = input("> ")
|
||||
|
||||
if user_input.lower() == "bye":
|
||||
# if they typed in 'bye' (or even BYE, ByE, byE etc.), break out of the loop
|
||||
break
|
||||
else:
|
||||
# Create a TextBlob based on the user input. Then extract the noun phrases
|
||||
user_input_blob = TextBlob(user_input, np_extractor=extractor)
|
||||
np = user_input_blob.noun_phrases
|
||||
response = ""
|
||||
if user_input_blob.polarity <= -0.5:
|
||||
response = "Oh dear, that sounds bad. "
|
||||
elif user_input_blob.polarity <= 0:
|
||||
response = "Hmm, that's not great. "
|
||||
elif user_input_blob.polarity <= 0.5:
|
||||
response = "Well, that sounds positive. "
|
||||
elif user_input_blob.polarity <= 1:
|
||||
response = "Wow, that sounds great. "
|
||||
|
||||
if len(np) != 0:
|
||||
# There was at least one noun phrase detected, so ask about that and pluralise it
|
||||
# e.g. cat -> cats or mouse -> mice
|
||||
response = response + "Can you tell me more about " + np[0].pluralize() + "?"
|
||||
else:
|
||||
response = response + "Can you tell me more?"
|
||||
print(response)
|
||||
|
||||
print("It was nice talking to you, goodbye!")
|
||||
|
||||
# Start the program
|
||||
main()
|
@ -0,0 +1,10 @@
|
||||
# Poetic license
|
||||
|
||||
## Instructions
|
||||
|
||||
In [this notebook](https://www.kaggle.com/jenlooper/emily-dickinson-word-frequency) you can find over 500 Emily Dickinson poems previously analyzed for sentiment using Azure text analytics. Using this dataset, analyze it using the techniques described in the lesson. Does the suggested sentiment of a poem match the more sophistic Azure service's decision? Why or why not, in your opinion? Does anything surprise you?
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | -------------------------------------------------------------------------- | ------------------------------------------------------- | ------------------------ |
|
||||
| | A notebook is presented with a solid analysis of an author's sample output | The notebook is incomplete or does not perform analysis | No notebook is presented |
|
@ -0,0 +1,23 @@
|
||||
from textblob import TextBlob
|
||||
|
||||
# You should download the book text, clean it, and import it here
|
||||
with open("pride.txt", encoding="utf8") as f:
|
||||
file_contents = f.read()
|
||||
|
||||
book_pride = TextBlob(file_contents)
|
||||
positive_sentiment_sentences = []
|
||||
negative_sentiment_sentences = []
|
||||
|
||||
for sentence in book_pride.sentences:
|
||||
if sentence.sentiment.polarity == 1:
|
||||
positive_sentiment_sentences.append(sentence)
|
||||
if sentence.sentiment.polarity == -1:
|
||||
negative_sentiment_sentences.append(sentence)
|
||||
|
||||
print("The " + str(len(positive_sentiment_sentences)) + " most positive sentences:")
|
||||
for sentence in positive_sentiment_sentences:
|
||||
print("+ " + str(sentence.replace("\n", "").replace(" ", " ")))
|
||||
|
||||
print("The " + str(len(negative_sentiment_sentences)) + " most negative sentences:")
|
||||
for sentence in negative_sentiment_sentences:
|
||||
print("- " + str(sentence.replace("\n", "").replace(" ", " ")))
|
@ -0,0 +1,55 @@
|
||||
# [Lesson Topic]
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
![Embed a video here if available](video-url)
|
||||
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# [Assignment Name]
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
@ -0,0 +1,55 @@
|
||||
# [Lesson Topic]
|
||||
|
||||
Add a sketchnote if possible/appropriate
|
||||
|
||||
![Embed a video here if available](video-url)
|
||||
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
|
||||
Describe what we will learn
|
||||
|
||||
### Introduction
|
||||
|
||||
Describe what will be covered
|
||||
|
||||
> Notes
|
||||
|
||||
### Prerequisite
|
||||
|
||||
What steps should have been covered before this lesson?
|
||||
|
||||
### Preparation
|
||||
|
||||
Preparatory steps to start this lesson
|
||||
|
||||
---
|
||||
|
||||
[Step through content in blocks]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
### Task:
|
||||
|
||||
Work together to progressively enhance your codebase to build the project with shared code:
|
||||
|
||||
```html
|
||||
code blocks
|
||||
```
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
## [Topic 3]
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
Optional: add a screenshot of the completed lesson's UI if appropriate
|
||||
|
||||
## [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1,9 @@
|
||||
# [Assignment Name]
|
||||
|
||||
## Instructions
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | | | |
|
@ -0,0 +1,20 @@
|
||||
# Getting Started with Natural Language Processing
|
||||
|
||||
In this section of the curriculum, you will be introduced to one of the most widespread uses of machine learning: Natural Language Processing (NLP). Derived from Computational Linguistics, this category of Artificial Intelligence is the bridge between humans and machines via voice or textual communication.
|
||||
|
||||
In these lessons we'll learn the basics of NLP by building small conversational bots to learn how Machine Learning aids in making these conversations more and more 'smart'. You'll travel back in time, chatting with Elizabeth Bennett and Mr. Darcy from Jane Austen's classic novel, **Pride and Prejudice**, published in 1813. Then, you'll further your knowledge by learning about sentiment analysis via hotel reviews in Europe.
|
||||
|
||||
![Pride and Prejudice book and tea](images/p&p.jpg)
|
||||
> Photo by <a href="https://unsplash.com/@elaineh?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Elaine Howlin</a> on <a href="https://unsplash.com/s/photos/pride-and-prejudice?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
|
||||
|
||||
## Lessons
|
||||
|
||||
1. [Introduction to Natural Language Processing](1-Introduction-to-NLP/README.md)
|
||||
2. [Common NLP Tasks and Techniques](2-Tasks/README.md)
|
||||
3. [Translation and Sentiment Analysis with Machine Learning](3-Translation-Sentiment/README.md)
|
||||
4. TBD
|
||||
5. TBD
|
||||
|
||||
## Credits
|
||||
|
||||
These Natural Language Processing lessons were written with ☕ by [Stephen Howell]([Twitter](https://twitter.com/Howell_MSFT))
|
After Width: | Height: | Size: 147 KiB |
@ -0,0 +1,83 @@
|
||||
# Machine Learning in the Real World
|
||||
|
||||
In this curriculum, you have learned many ways to prepare data for training and create machine learning models. You built a series of classic Regression, Clustering, Classification, Natural Language Processing, and Time Series models. Congratulations! Now, you might be wondering what it's all for...what are the real world applications for these models?
|
||||
|
||||
While a lot of interest in industry has been garnered by AI, which usually leverages Deep Learning, there are still valuable applications for classical machine learning models, some of which you use today, although you might not be aware of it. In this lesson, you'll explore how ten different industries and subject-matter domains use these types of models to make their applications more performant, reliable, intelligent, and thus more valuable to users.
|
||||
## [Pre-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Finance
|
||||
|
||||
One of the major consumers of classical machine learning models is the finance industry.
|
||||
|
||||
### Credit card fraud detection
|
||||
|
||||
We learned about [k-means clustering](Clustering/2-K-Means/README.md) earlier in the course, but how can it be used to solve problems related to credit card fraud?
|
||||
|
||||
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.680.1195&rep=rep1&type=pdf
|
||||
|
||||
### Wealth management
|
||||
|
||||
## Education
|
||||
|
||||
### Predicting student behavior
|
||||
### Preventing plagiarism
|
||||
### Course recommendations
|
||||
|
||||
## Retail
|
||||
|
||||
### Personalizing the customer journey
|
||||
|
||||
### Inventory management
|
||||
|
||||
## Health Care
|
||||
|
||||
### Optimizing drug delivery
|
||||
### Hospital re-entry management
|
||||
### Disease management
|
||||
|
||||
## Ecology and Green Tech
|
||||
|
||||
### Forest management
|
||||
### Motion sensing of animals
|
||||
### Energy Management
|
||||
This article discusses in detail how clustering and time series forecasting help predict future energy use in Ireland, based off of smart metering: https://www-cdn.knime.com/sites/default/files/inline-images/knime_bigdata_energy_timeseries_whitepaper.pdf
|
||||
|
||||
## Insurance
|
||||
|
||||
### Actuarial tasks
|
||||
|
||||
## Consumer Electronics
|
||||
|
||||
### Motion sensing
|
||||
## Software
|
||||
|
||||
### UI regression
|
||||
### Document search
|
||||
|
||||
### Recommendation engines
|
||||
|
||||
## Arts, Culture, and Literature
|
||||
|
||||
### Fake news detection
|
||||
### Classifying artifacts
|
||||
|
||||
## Marketing
|
||||
|
||||
### 'Ad words'
|
||||
### Customer segmentation
|
||||
|
||||
|
||||
|
||||
|
||||
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
|
||||
|
||||
|
||||
## 🚀Challenge
|
||||
|
||||
Add a challenge for students to work on collaboratively in class to enhance the project
|
||||
|
||||
## [Post-lecture quiz](link-to-quiz-app)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
**Assignment**: [Assignment Name](assignment.md)
|
@ -0,0 +1 @@
|
||||
# Assignment
|
@ -0,0 +1,11 @@
|
||||
# Getting Started with
|
||||
|
||||
In this section of the curriculum, you will be introduced to ...
|
||||
|
||||
## Lessons
|
||||
|
||||
1. [Real-World Applications for ML](1-Applications/README.md)
|
||||
|
||||
## Credits
|
||||
|
||||
"Real-World Applications" was written with ♥️ by [Name](Twitter)
|
@ -0,0 +1,190 @@
|
||||
# Get started with Python and Scikit-Learn for Regression models
|
||||
|
||||
![Logistic vs. Linear Regression Infographic](./images/logistic-linear.png)
|
||||
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
|
||||
|
||||
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/7/)
|
||||
## Introduction
|
||||
|
||||
The lessons in this section cover types of Regression in the context of machine learning. Regression models can help determine the relationship between variables. This type of model can predict values such as length, temperature, or age, thus uncovering relationships between variables as it analyzes datapoints.
|
||||
|
||||
In this series of lessons, you'll discover the difference between Linear vs. Logistic Regression, and when you should use one or the other.
|
||||
|
||||
But before you do anything, make sure you have the right tools in place!
|
||||
|
||||
In this lesson, you will learn:
|
||||
- How to configure your computer for local machine learning tasks
|
||||
- Getting used to working with Jupyter notebooks
|
||||
- An introduction to Scikit-Learn, including installation
|
||||
- An introduction to Linear Regression with a hands-on exercise
|
||||
|
||||
## Installations and Configurations
|
||||
|
||||
[![Using Python with Visual Studio Code](https://img.youtube.com/vi/7EXd4_ttIuw/0.jpg)](https://youtu.be/7EXd4_ttIuw "Using Python with Visual Studio Code")
|
||||
|
||||
> Click this image to watch a video on using Python within VS Code.
|
||||
|
||||
1. Ensure that [Python](https://www.python.org/downloads/) is installed on your computer. You will use Python for many data science and machine learning tasks. Most computer systems already include a Python installation. There are useful [Python Coding Packs](https://code.visualstudio.com/learn/educators/installers?WT.mc_id=academic-15963-cxa) available as well to ease the setup for some users. Some usages of Python, however, require one version of the software, whereas others require a different version. For this reason, it's useful to work within a virtual environment.
|
||||
|
||||
2. Make sure you have Visual Studio Code installed on your computer. Follow [these instructions](https://code.visualstudio.com/) for the basic installation. You are going to use Python in Visual Studio Code in this course, so you might want to brush up on how to [configure](https://docs.microsoft.com/learn/modules/python-install-vscode?WT.mc_id=academic-15963-cxa) VS Code for Python development.
|
||||
|
||||
> Get comfortable with Python by working through this collection of [Learn modules](https://docs.microsoft.com/users/jenlooper-2911/collections/mp1pagggd5qrq7?WT.mc_id=academic-15963-cxa)
|
||||
|
||||
3. Install Scikit-Learn by following [these instructions](https://scikit-learn.org/stable/install.html). Since you need to ensure that you use Python 3, it's recommended that you use a virtual environment. Note, if you are installing this library on a M1 Mac, there are special instructions on the page linked above.
|
||||
## Your ML Authoring Environment
|
||||
|
||||
You are going to use **notebooks** to develop your Python code and create machine learning models. This type of file is a common tool for data scientists, and they can be identified by their suffix or extension `.ipynb`.
|
||||
|
||||
Notebooks are an interactive environment that allow the developer to both code and add notes and write documentation around the code which is quite helpful for experimental or research-oriented projects.
|
||||
### Working with A Notebook
|
||||
|
||||
In this folder, you will find the file `notebook.ipynb`. If you open it in VS Code, assuming VS Code is properly configured, a Jupyter server will start with Python 3+ started. You will find areas of the notebook that can be 'run' by pressing arrows next to code blocks, and other areas that contain text.
|
||||
|
||||
In your notebook, add a comment. To do this, click the 'md' icon and add a bit of markdown, like `# Welcome to your notebook`.
|
||||
|
||||
Next, add some Python code: Type `print('hello notebook')` and click the arrow to run the code. You should see the printed statement, 'hello notebook'.
|
||||
|
||||
![VS Code with a notebook open](images/notebook.png)
|
||||
|
||||
You can interleaf your code with comments to self-document the notebook.
|
||||
|
||||
✅ Think for a minute how different a web developer's working environment is versus that of a data scientist.
|
||||
## Up and Running with Scikit-Learn
|
||||
|
||||
Now that Python is set up in your local environment and you are comfortable with Jupyter notebooks, let's get equally comfortable with Scikit-Learn (pronounce it `sci` as in `science`). Scikit-Learn provides an [extensive API](https://scikit-learn.org/stable/modules/classes.html#api-ref) to help you perform ML tasks.
|
||||
|
||||
According to their [website](https://scikit-learn.org/stable/getting_started.html), "Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities."
|
||||
### Let's unpack some of this jargon:
|
||||
|
||||
> 🎓 A machine learning **model** is a mathematical model that generates predictions given data to which it has not been exposed. It builds these predictions based on its analysis of data and extrapolating patterns.
|
||||
|
||||
> 🎓 **[Supervised Learning](https://wikipedia.org/wiki/Supervised_learning)** works by mapping an input to an output based on example pairs. It uses **labeled** training data to build a function to make predictions. [Download a printable Zine about Supervised Learning](https://zines.jenlooper.com/zines/supervisedlearning.html). Regression, which is covered in this group of lessons, is a type of supervised learning.
|
||||
|
||||
> 🎓 **[Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning)** works similarly but it maps pairs using **unlabeled data**. [Download a printable Zine about Unsupervised Learning](https://zines.jenlooper.com/zines/unsupervisedlearning.html)
|
||||
|
||||
> 🎓 **[Model Fitting](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py)** in the context of machine learning refers to the accuracy of the model's underlying function as it attempts to analyze data with which it is not familiar. **Underfitting** and **overfitting** are common problems that degrade the quality of the model as the model fits either not well enough or too well. This causes the model to make predictions either too closely aligned or too loosely aligned with its training data. An overfit model predicts training data too well because it has learned the data's details and noise too well. An underfit model is not accurate as it can neither accurately analyze its training data nor data it has not yet 'seen'.
|
||||
|
||||
![overfitting vs. correct model](images/overfitting.png)
|
||||
|
||||
> Infographic by [Jen Looper](https://twitter.com/jenlooper)
|
||||
|
||||
> 🎓 **Data Preprocessing** is the process whereby data scientists clean and convert data for use in the machine learning lifecycle.
|
||||
|
||||
> 🎓 **Model Selection and Evaluation** is the process whereby data scientists evaluate the performance of a model or any other relevant metric of a model by feeding it unseen data, selecting the most appropriate model for the task at hand.
|
||||
|
||||
> 🎓 **Feature Variable** A [feature](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) is a measurable property of your data. In many datasets it is expressed as a column heading like 'date' 'size' or 'color'.
|
||||
|
||||
> 🎓 **[Training and Testing](https://wikipedia.org/wiki/Training,_validation,_and_test_sets) datasets** Throughout this curriculum, you will divide up a dataset into at least two parts, one large group of data for 'training' and a smaller part for 'testing'. Sometimes you'll also find a 'validation' set. A training set is the group of examples you use to train a model. A validation set is a smaller independent group of examples that you use to tune the model's hyperparameters, or architecture, to improve the model. A test dataset is another independent group of data, often gathered from the original data, that you use to confirm the performance of the built model.
|
||||
|
||||
> 🎓 **Feature Selection and Feature Extraction** How do you know which variable to choose when building a model? You'll probably go through a process of feature selection or feature extraction to choose the right variables for the most performant model. They're not the same thing, however: "Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features." [source](https://wikipedia.org/wiki/Feature_selection)
|
||||
|
||||
In this course, you will use Scikit-Learn and other tools to build machine learning models to perform what we call 'traditional machine learning' tasks. We have deliberately avoided neural networks and deep learning, as they are better covered in our forthcoming 'AI for Beginners' curriculum.
|
||||
|
||||
Scikit-Learn makes it straightforward to build models and evaluate them for use. It is primarily focused on using numeric data and contains several ready-made datasets for use as learning tools. It also includes pre-built models for students to try. Let's explore the process of loading prepackaged data and using a built in estimator first ML model with Scikit-Learn with some basic data.
|
||||
## Your First Scikit-Learn Notebook
|
||||
|
||||
> This tutorial was inspired by the [Linear Regression example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py) on Scikit-Learn's web site.
|
||||
|
||||
In the `notebook.ipynb` file associated to this lesson, clear out all the cells by pressing the 'trash can' icon.
|
||||
|
||||
In this section, you will work with a small dataset about diabetes that is built into Scikit-Learn for learning purposes. Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic Regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.
|
||||
|
||||
> ✅ There are many types of Regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use Linear Regression, as you're seeking a **numeric value**. If you're interested in discovering whether a type of recipe should be considered vegan or not, you're looking for a **category assignment** so you would use Logistic Regression. You'll learn more about Logistic Regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.
|
||||
|
||||
Let's get started on this task.
|
||||
|
||||
1. Import some libraries to help with your tasks. First, import `matplotlib`, a useful [graphing tool](https://matplotlib.org/). We will use it to create a line plot. Also import [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html), a useful library for handling numeric data in Python. Load up `datasets` and the `linear_model` from the Scikit-Learn library. Load `model_selection` for splitting data into training and test sets.
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
from sklearn import datasets, linear_model, model_selection
|
||||
```
|
||||
|
||||
2. Print out a bit of the built-in [diabetes housing dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset). It includes 442 samples of data around diabetes, with 10 feature variables, some of which include:
|
||||
|
||||
age: age in years
|
||||
bmi: body mass index
|
||||
bp: average blood pressure
|
||||
s1 tc: T-Cells (a type of white blood cells)
|
||||
|
||||
✅ This dataset includes the concept of 'sex' as a feature variable important to research around diabetes. Many medical datasets include this type of binary classification. Think a bit about how categorizations such as this might exclude certain parts of a population from treatments.
|
||||
|
||||
Now, load up the X and y data.
|
||||
|
||||
> 🎓 Remember, this is supervised learning, and we need a named 'y' target.
|
||||
|
||||
3. In a new cell, load the diabetes dataset as data and target (X and y, loaded as a tuple). X will be a data matrix, and y will be the regression target. Add some print commands to show the shape of the data matrix and its first element:
|
||||
|
||||
> 🎓 A **tuple** is an [ordered list of elements](https://wikipedia.org/wiki/Tuple).
|
||||
|
||||
✅ Think a bit about the relationship between the data and the regression target. Linear regression predicts relationships between feature X and target variable y. Can you find the [target](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) for the diabetes dataset in the documentation? What is this dataset demonstrating, given that target?
|
||||
|
||||
```python
|
||||
X, y = datasets.load_diabetes(return_X_y=True)
|
||||
print(X.shape)
|
||||
print(X[0])
|
||||
```
|
||||
|
||||
You can see that this data has 442 items shaped in arrays of 10 elements:
|
||||
|
||||
```text
|
||||
(442, 10)
|
||||
[ 0.03807591 0.05068012 0.06169621 0.02187235 -0.0442235 -0.03482076
|
||||
-0.04340085 -0.00259226 0.01990842 -0.01764613]
|
||||
```
|
||||
|
||||
4. Next, select a portion of this dataset to plot by arranging it into a new array using numpy's newaxis function. We are going to use Linear Regression to generate a line between values in this data, according to a pattern it determines.
|
||||
|
||||
```python
|
||||
X = X[:, np.newaxis, 2]
|
||||
```
|
||||
✅ At any time, print out the data to check its shape
|
||||
|
||||
5. Now that you have data ready to be plotted, you can see if a machine can help determine a logical split between the numbers in this dataset. To do this, you need to split both the data (X) and the target (y) into test and training sets. Scikit-Learn has a straightforward way to do this; you can split your test data at a given point.
|
||||
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33)
|
||||
```
|
||||
6. Now you are ready to train your model! Load up the Linear Regression model and train it with your X and y training sets:
|
||||
|
||||
✅ `model.fit` is a command you'll see in many ML libraries such as TensorFlow
|
||||
|
||||
```python
|
||||
model = linear_model.LinearRegression()
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
7. Then, create a prediction using test data. This will be used to draw the line between data groups
|
||||
|
||||
```python
|
||||
y_pred = model.predict(X_test)
|
||||
```
|
||||
|
||||
8. Now it's time to show the data in a plot. Matplotlib is a very useful tool for this task. Create a scatterplot of all the X and y test data, and use the prediction to draw a line in the most appropriate place, between the model's data groupings.
|
||||
|
||||
```python
|
||||
plt.scatter(X_test, y_test, color='black')
|
||||
plt.plot(X_test, y_pred, color='blue', linewidth=3)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
![a scatterplot showing datapoints around diabetes](./images/scatterplot.png)
|
||||
|
||||
✅ Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model.
|
||||
|
||||
Congratulations, you just built your first Linear Regression model, created a prediction with it, and displayed it in a plot!
|
||||
|
||||
---
|
||||
## 🚀Challenge
|
||||
|
||||
Plot a different variable from this dataset. Hint: edit this line: `X = X[:, np.newaxis, 2]`. Given this dataset's target, what are you able to discover about the progression of diabetes as a disease?
|
||||
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/8/)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
In this tutorial, you worked with simple linear regression, rather than univariate or multiple linear regression. Read a little about the differences between these methods, or take a look at [this video](https://www.coursera.org/lecture/quantifying-relationships-regression-models/linear-vs-nonlinear-categorical-variables-ai2Ef)
|
||||
|
||||
Read more about the concept of Regression and think about what kinds of questions can be answered by this technique. Take this [tutorial](https://docs.microsoft.com/learn/modules/train-evaluate-regression-models?WT.mc_id=academic-15963-cxa) to deepen your understanding.
|
||||
|
||||
**Assignment**: [A different dataset](assignment.md)
|
@ -0,0 +1,13 @@
|
||||
# Regression with Scikit-Learn
|
||||
|
||||
## Instructions
|
||||
|
||||
Take a look at the [Linnerud dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud) in Scikit-Learn. This dataset has multiple [targets](https://scikit-learn.org/stable/datasets/toy_dataset.html#linnerrud-dataset): 'It consists of three excercise (data) and three physiological (target) variables collected from twenty middle-aged men in a fitness club'.
|
||||
|
||||
In your own words, describe how to create a Regression model that would plot the relationship between the waistline and how many situps are accomplished. Do the same for the other datapoints in this dataset.
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| Submit a descriptive paragraph | Well-written paragraph is submitted | A few sentences are submitted | No description is supplied |
|
After Width: | Height: | Size: 1.1 MiB |
After Width: | Height: | Size: 148 KiB |
After Width: | Height: | Size: 219 KiB |
After Width: | Height: | Size: 283 KiB |
@ -0,0 +1,152 @@
|
||||
# Build a Regression Model using Scikit-Learn: Prepare and Visualize Data
|
||||
|
||||
> ![Data Vizualization Infographic](./images/data-visualization.png)
|
||||
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
|
||||
|
||||
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/9/)
|
||||
|
||||
### Introduction
|
||||
|
||||
Now that you are set up with the tools you need to start tackling machine learning model-building with Scikit-Learn, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.
|
||||
|
||||
In this lesson, you will learn:
|
||||
- Preparing your data for model-building
|
||||
- Using Matplotlib for data visualization
|
||||
### Asking the Right Question
|
||||
|
||||
The question you need answered will determine what type of ML algorithms you will leverage. For example, do you need to determine the differences between cars and trucks as they cruise down a highway via a video feed? You will need some kind of highly performant classification model to make that differentiation. It will need to be able to perform object detection, probably by showing bounding boxes around detected cars and trucks.
|
||||
|
||||
What if you are trying to correlate two points of data - like age to height? You can use a linear regression model, as shown in the previous lesson, to draw the classical straight line through the scatterplot of points to show how, with age, height tends to increase. Thus you can predict, for a given group of people, their height given their age.
|
||||
|
||||
But it's not very common to be gifted a dataset that is completely ready to use to create a ML model. In this lesson, you will learn how to prepare a raw dataset using standard Python libraries. You will also learn various techniques to visualize the data.
|
||||
### Preparation
|
||||
|
||||
In this folder you will find a .csv file in the root `data` folder called [US-pumpkins.csv](../data/US-pumpkins.csv) which includes 1757 lines of data about the pumpkin market, sorted into groupings by city. This is raw data extracted from the [Specialty Crops Terminal Markets Standard Reports](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) distributed by the United States Department of Agriculture.
|
||||
|
||||
This data is in the public domain. It can be downloaded in many separate files, per city, from the USDA web site. To avoid too many separate files we have concatenated all the city data into one spreadsheet. Take a look at this file.
|
||||
## The Pumpkin data
|
||||
|
||||
What do you notice about this data? First, you see that it is a mix of text and numeric data. There are also dates. Second, you see that there's a considerable amount of missing and mixed data. To build a good model, you will need to handle that.
|
||||
|
||||
What question can you ask of this data, using a Regression technique? What about "Predict the price of a pumpkin for sale during a given month". Looking again at the data, there are some changes you need to make to create the data structure necessary for the task.
|
||||
### Analyze the Pumpkin Data
|
||||
|
||||
Let's use [Pandas](https://pandas.pydata.org/), (the name stands for `Python Data Analysis`) a tool very useful for shaping data, to analyze and prepare this pumpkin data. First, check for missing dates and then convert the dates to a month format (these are US dates, so the format is currently `MM/DD/YYYY`). Extract the month to a new column.
|
||||
|
||||
Open the `notebook.ipynb` file in VS Code and import the spreadsheet in to a new Pandas dataframe. Use the `head()` function to view the first five rows.
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
pumpkins = pd.read_csv('../../data/US-pumpkins.csv')
|
||||
pumpkins.head()
|
||||
```
|
||||
|
||||
✅ What function would you use to view the last five rows?
|
||||
|
||||
Check if there is missing data in the current dataframe:
|
||||
|
||||
```python
|
||||
pumpkins.isnull().sum()
|
||||
```
|
||||
|
||||
There is missing data, but maybe it won't matter for the task at hand.
|
||||
|
||||
To make your dataframe easier to work with, drop several of its columns, keeping only the ones you need:
|
||||
|
||||
```python
|
||||
new_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']
|
||||
pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
|
||||
```
|
||||
|
||||
Second, think about how to determine the average price of a pumpkin in a given month. What columns would you pick for this task? Hint: you'll need 3 columns.
|
||||
|
||||
Solution: take the average of the Low Price and High Price columns to populate the new Price column, and convert the Date column to only show the month. Fortunately, according to the check above, there is no missing data for dates or prices.
|
||||
|
||||
```python
|
||||
price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2
|
||||
|
||||
month = pd.DatetimeIndex(pumpkins['Date']).month
|
||||
|
||||
```
|
||||
✅ Feel free to print any data you'd like to check: `print(month)` for example.
|
||||
|
||||
Now, append your converted data into a fresh Pandas dataframe:
|
||||
|
||||
```python
|
||||
new_pumpkins = pd.DataFrame({'Month': month, 'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'],'High Price': pumpkins['High Price'], 'Price': price})
|
||||
```
|
||||
Printing out your dataframe will show you a clean, tidy dataset on which you can build your new regression model.
|
||||
|
||||
But wait! There's something odd here. If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in '1 1/9 bushel' measures, and some in '1/2 bushel' measures, some per pumpkin, some per pound, and some in big boxes with varying widths.
|
||||
|
||||
Digging into the original data, it's interesting that anything with `Unit of Sale` equalling 'EACH' or 'PER BIN' also have the `Package` type per inch, per bin, or 'each'. Pumpkins seem to be very hard to weigh consistently, so let's filter them out by selecting only pumpkins with the string 'bushel' in their `Package` column. Add a filter at the top of the file, under the initial .csv import:
|
||||
|
||||
```python
|
||||
pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)]
|
||||
```
|
||||
If you print the data now, you can see that you are only getting the 415 or so rows of data containing pumpkins by the bushel. But wait! there's one more thing to do. Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, so do some math to standardize it. Add these lines after the block creating the new_pumpkins dataframe:
|
||||
|
||||
```python
|
||||
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1 1/9'), 'Price'] = price/(1 + 1/9)
|
||||
|
||||
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1/2'), 'Price'] = price/(1/2)
|
||||
```
|
||||
|
||||
✅ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), a bushel's weight depends on the type of produce, as it's a volume measurement. "A bushel of tomatoes, for example, is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds." It's all pretty complicated! Let's not bother with making a bushel-to-pound conversion, and instead price by the bushel. All this study of bushels of pumpkins, however, goes to show how very important it is to understand the nature of your data!
|
||||
|
||||
Now, you can analyze the pricing per unit based on their bushel measurement. If you print out the data one more time, you can see how it's standardized.
|
||||
|
||||
✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.
|
||||
## Visualization Strategies
|
||||
|
||||
Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover. Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.
|
||||
|
||||
One data visualization libary that works well in Jupyter notebooks is [Matplotlib](https://matplotlib.org/) (which you also saw in the previous lesson).
|
||||
|
||||
> Get more experience with data visualization in [these tutorials](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-15963-cxa).
|
||||
## Experiment with Matplotlib
|
||||
|
||||
Try to create some basic plots to display the new dataframe you just created. What would a basic line plot show?
|
||||
|
||||
Import Matplotlib at the top of the file, under the Pandas import:
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
```
|
||||
|
||||
Rerun the entire notebook to refresh. Then at the bottom of the notebook, add a cell to plot the data as a box:
|
||||
|
||||
```python
|
||||
price = new_pumpkins.Price
|
||||
month = new_pumpkins.Month
|
||||
plt.scatter(price, month)
|
||||
plt.show()
|
||||
```
|
||||
![A scatterplot showing price to month relationship](./images/scatterplot.png)
|
||||
|
||||
Is this a useful plot? Does anything about it surprise you?
|
||||
|
||||
It's not particularly useful as all it does is display in your data as a spread of points in a given month. To get charts to display useful data, you usually need to group the data somehow. Let's try creating a plot where the y axis shows the months and the data demonstrates the distribution of data.
|
||||
|
||||
Add a cell to create a grouped bar chart:
|
||||
|
||||
```python
|
||||
new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
|
||||
plt.ylabel("Pumpkin Price")
|
||||
```
|
||||
|
||||
![A bar chart showing price to month relationship](./images/barchart.png)
|
||||
|
||||
This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?
|
||||
|
||||
---
|
||||
## 🚀Challenge
|
||||
|
||||
Explore the different types of visualization that matplotlib offers. Which types are most appropriate for regression problems?
|
||||
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/10/)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
Take a look at the many ways to visualize data. Make a list of the various libraries available and note which are best for given types of tasks, for example 2D visualizations vs. 3D visualizations. What do you discover?
|
||||
|
||||
**Assignment**: [Exploring visualization](assignment.md)
|
@ -0,0 +1,8 @@
|
||||
# Exploring Visualizations
|
||||
|
||||
There are several different libraries that are available for data visualization. Create some visualizations using the Pumpkin data in this lesson with matplotlib and seaborn in a sample notebook. Which libraries are easier to work with?
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | --------- | -------- | ----------------- |
|
||||
| | A notebook is submitted with two explorations/visualizations | A notebook is submitted with one explorations/visualizations | A notebook is not submitted |
|
After Width: | Height: | Size: 128 KiB |
After Width: | Height: | Size: 1.2 MiB |
After Width: | Height: | Size: 172 KiB |
@ -0,0 +1,33 @@
|
||||
{
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3-final"
|
||||
},
|
||||
"orig_nbformat": 2,
|
||||
"kernelspec": {
|
||||
"name": "python3",
|
||||
"display_name": "Python 3",
|
||||
"language": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2,
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
]
|
||||
}
|
@ -0,0 +1,258 @@
|
||||
# Build a Regression Model using Scikit-Learn: Regression Two Ways
|
||||
|
||||
![Linear vs Polynomial Regression Infographic](./images/linear-polynomial.png)
|
||||
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
|
||||
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/11/)
|
||||
### Introduction
|
||||
|
||||
So far you have explored what regression is with sample data gathered from the pumpkin pricing dataset that we will use throughout this unit. You have also visualized it using Matplotlib. Now you are ready to dive deeper into regression for ML. In this lesson, you will learn more about two types of regression: basic linear regression and polynomial regression, along with some of the math underlying these techniques.
|
||||
|
||||
> Throughout this curriculum, we assume minimal knowledge of math, and seek to make it accessible for students coming from other fields, so watch for notes, callouts, diagrams, and other learning tools to aid in comprehension.
|
||||
### Prerequisite
|
||||
|
||||
You should be familiar by now with the structure of the pumpkin data that we are examining. You can find it preloaded and pre-cleaned in this lesson's notebook.ipynb files, with the pumpkin price displayed per bushel in a new dataframe. Make sure you can run these notebooks in kernels in VS Code.
|
||||
### Preparation
|
||||
|
||||
As a reminder, you are loading this data so as to ask questions of it. When is the best time to buy pumpkins? What price can I expect of a case of miniature pumpkins? Should I buy them in half-bushel baskets or by the 1 1/9 bushel box? Let's keep digging into this data.
|
||||
|
||||
In the previous lesson, you created a Pandas dataframe and populated it with part of the original dataset, standardizing the pricing by the bushel. By doing that, however, you were only able to gather about 400 datapoints and only for the fall months.
|
||||
|
||||
Take a look at the data that we preloaded in this lesson's accompanying notebook. The data is preloaded and an initial scatterplot is charted to show month data. Maybe we can get a little more detail about the nature of the data by cleaning it more.
|
||||
## A Linear Regression Line
|
||||
|
||||
As you learned in Lesson 1, the goal of a linear regression exercise is to be able to plot a line to show the relationship between variables and make accurate predictions on where a new datapoint would fall in relationship to that line.
|
||||
|
||||
> **🧮 Show me the math**
|
||||
>
|
||||
> This line has an equation: `Y = a + bX`. It is typical of **Least-Squares Regression** to draw this type of line.
|
||||
>
|
||||
> `X` is the 'explanatory variable'. `Y` is the 'dependent variable'. The slope of the line is `b` and `a` is the y-intercept, which refers to the value of `Y` when `X = 0`.
|
||||
>
|
||||
> In other words, and referring to our pumpkin data's original question: "predict the price of a pumpkin per bushel by month", `X` would refer to the price and `Y` would refer to the month of sale. The math that calculates the line must demonstrate the slope of the line, which is also dependent on the intercept, or where `Y` is situated when `X = 0`.
|
||||
>
|
||||
> You can observe the method of calculation for these values on the [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) web site.
|
||||
>
|
||||
> A common method of regression is **Least-Squares Regression** which means that all the datapoints surounding the regression line are squared and then added up. Ideally, that final sum is as small as possible, because we want a low number of errors, or `least-squares`. We do so since we want to model a line that has the least cumulative distance from all of our data points. We also square the terms before adding them since we are concerned with its magnitude rather than its direction.
|
||||
>
|
||||
> One more term to understand is the **Correlation Coefficient** between given X and Y variables. For a scatterplot, you can quickly visualize this coefficient. A plot with datapoints scattered in a neat line have high correlation, but a plot with datapoints scattered everywhere between X and Y have a low correlation.
|
||||
>
|
||||
> A good linear regression model will be one that has a high (nearer to 1 than 0) Correlation Coefficient using the Least-Squares Regression method with a line of regression.
|
||||
|
||||
✅ Run the notebook accompanying this lesson and look at the City to Price scatterplot. Does the data associating City to Price for pumpkin sales seem to have high or low correlation, according to your visual interpretation of the scatterplot?
|
||||
## Create a Linear Regression Model correlating Pumpkin Datapoints
|
||||
|
||||
Now that you have an understanding of the math behind this exercise, create a Regression model to see if you can predict which package of pumpkins will have the best pumpkin prices. Someone buying pumpkins for a holiday pumpkin patch might want this information to be able to optimize their purchases of pumpkin packages for the patch.
|
||||
|
||||
Since you'll use Scikit-Learn, there's no reason to do this by hand (although you could!). In the main data-processing block of your lesson notebook, add a library from Scikit-Learn to automatically convert all string data to numbers:
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import LabelEncoder
|
||||
|
||||
new_pumpkins.iloc[:, 0:-1] = new_pumpkins.iloc[:, 0:-1].apply(LabelEncoder().fit_transform)
|
||||
new_pumpkins.iloc[:, 0:-1] = new_pumpkins.iloc[:, 0:-1].apply(LabelEncoder().fit_transform)
|
||||
```
|
||||
|
||||
If you look at the new_pumpkins dataframe now, you see that all the strings are now numeric. This makes it harder for you to read but much more intelligible for Scikit-Learn!
|
||||
|
||||
Now you can make more educated decisions (not just based on eyeballing a scatterplot) about the data that is best suited to regression.
|
||||
|
||||
Try to find a good correlation between two points of your data to potentially build a good predictive model. As it turns out, there's only weak correlation between the City and Price:
|
||||
|
||||
```python
|
||||
print(new_pumpkins['City'].corr(new_pumpkins['Price']))
|
||||
0.32363971816089226
|
||||
```
|
||||
|
||||
However there's a bit better correlation between the Package and its Price. That makes sense, right? Normally, the bigger the produce box, the higher the price.
|
||||
|
||||
```python
|
||||
print(new_pumpkins['Package'].corr(new_pumpkins['Price']))
|
||||
0.6061712937226021
|
||||
```
|
||||
|
||||
A good question to ask of this data will be: 'What price can I expect of a given pumpkin package?'
|
||||
|
||||
Let's build this regression model
|
||||
## Building A Linear Model
|
||||
|
||||
Before building your model, do one more tidy-up of your data. Drop any null data and check once more what the data looks like.
|
||||
|
||||
```python
|
||||
new_pumpkins.dropna(inplace=True)
|
||||
new_pumpkins.info()
|
||||
```
|
||||
|
||||
Then, create a new dataframe from this minimal set and print it out:
|
||||
|
||||
```python
|
||||
new_columns = ['Package', 'Price']
|
||||
lin_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
|
||||
|
||||
lin_pumpkins
|
||||
|
||||
```
|
||||
|
||||
Now you can assign your X and y coordinate data:
|
||||
|
||||
```python
|
||||
X = lin_pumpkins.values[:, :1]
|
||||
y = lin_pumpkins.values[:, 1:2]
|
||||
```
|
||||
> What's going on here? You're using [Python slice notation](https://stackoverflow.com/questions/509211/understanding-slice-notation/509295#509295) to create arrays to populate `X` and `y`.
|
||||
|
||||
Next, start the regression model-building routines:
|
||||
|
||||
```python
|
||||
from sklearn.linear_model import LinearRegression
|
||||
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
|
||||
lin_reg = LinearRegression()
|
||||
lin_reg.fit(X_train,y_train)
|
||||
|
||||
pred = lin_reg.predict(X_test)
|
||||
|
||||
accuracy_score = lin_reg.score(X_train,y_train)
|
||||
print('Model Accuracy: ', accuracy_score)
|
||||
|
||||
```
|
||||
|
||||
Because the correlation isn't particularly good, the model produced isn't terribly accurate.
|
||||
|
||||
```
|
||||
Model Accuracy: 0.3315342327998987
|
||||
```
|
||||
|
||||
You can visualize the line that's drawn in the process:
|
||||
|
||||
```python
|
||||
plt.scatter(X_test, y_test, color='black')
|
||||
plt.plot(X_test, pred, color='blue', linewidth=3)
|
||||
|
||||
plt.xlabel('Package')
|
||||
plt.ylabel('Price')
|
||||
|
||||
plt.show()
|
||||
```
|
||||
![A scatterplot showing package to price relationship](./images/linear.png)
|
||||
|
||||
And you can test the model against a hypothetical variety:
|
||||
|
||||
```python
|
||||
lin_reg.predict( np.array([ [2.75] ]) )
|
||||
```
|
||||
The returned price for this mythological Variety is:
|
||||
|
||||
```
|
||||
array([[33.15655975]])
|
||||
```
|
||||
|
||||
That number makes sense, if the logic of the regression line holds true.
|
||||
|
||||
Congratulations, you just created a model that can help predict the price of a few varieties of pumpkins. Your holiday pumpkin patch will be beautiful. But you can probably create a better model!
|
||||
## Polynomial Regression
|
||||
|
||||
Another type of Linear Regression is Polynomial Regression. While sometimes there's a linear relationship between variables - the bigger the pumpkin in volume, the higher the price - sometimes these relationships can't be plotted as a plane or straight line.
|
||||
|
||||
✅ Here are [some more examples](https://online.stat.psu.edu/stat501/lesson/9/9.8) of data that could use Polynomial Regression
|
||||
|
||||
Take another look at the relationship between Variety to Price in the previous plot. Does this scatterplot seem like it should necessarily be analyzed by a straight line? Perhaps not. In this case, you can try Polynomial Regression.
|
||||
|
||||
✅ Polynomials are mathematical expressions that might consist of one or more variables and coefficients
|
||||
|
||||
Polynomial regression creates a curved line to better fit nonlinear data. Let's recreate a dataframe populated with a segment of the original pumpkin data:
|
||||
|
||||
```python
|
||||
new_columns = ['Variety', 'Package', 'City', 'Month', 'Price']
|
||||
poly_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
|
||||
|
||||
poly_pumpkins
|
||||
```
|
||||
|
||||
A good way to visualize the correlations between data in dataframes is to display it in a 'coolwarm' chart:
|
||||
|
||||
```python
|
||||
corr = poly_pumpkins.corr()
|
||||
corr.style.background_gradient(cmap='coolwarm')
|
||||
```
|
||||
|
||||
![A heatmap showing data correlation](./images/heatmap.png)
|
||||
|
||||
Looking at this chart, you can visualize the good correlation between Package and Price. So you should be able to create a somewhat better model than the last one.
|
||||
|
||||
Build out the X and y columns:
|
||||
|
||||
```python
|
||||
X=poly_pumpkins.iloc[:,3:4].values
|
||||
y=poly_pumpkins.iloc[:,4:5].values
|
||||
```
|
||||
|
||||
Scikit-Learn includes a helpful API for building polynomial regression models - the `make_pipeline` [API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=pipeline#sklearn.pipeline.make_pipeline). A 'pipeline' is created which is a chain of estimators. In this case, the pipeline includes Polynomial Features, or predictions that form a nonlinear path.
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import PolynomialFeatures
|
||||
from sklearn.pipeline import make_pipeline
|
||||
|
||||
pipeline = make_pipeline(PolynomialFeatures(4), LinearRegression())
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
|
||||
|
||||
pipeline.fit(np.array(X_train), y_train)
|
||||
|
||||
y_pred=pipeline.predict(X_test)
|
||||
```
|
||||
|
||||
At this point, you need to create a new dataframe with sorted data so that the pipeline can create a sequence:
|
||||
|
||||
```python
|
||||
df = pd.DataFrame({'x': X_test[:,0], 'y': y_pred[:,0]})
|
||||
df.sort_values(by='x',inplace = True)
|
||||
points = pd.DataFrame(df).to_numpy()
|
||||
|
||||
plt.plot(points[:, 0], points[:, 1],color="blue", linewidth=3)
|
||||
plt.xlabel('Package')
|
||||
plt.ylabel('Price')
|
||||
plt.scatter(X,y, color="black")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
![A polynomial plot showing package to price relationship](./images/polynomial.png)
|
||||
|
||||
You can see a curved line that fits your data better. Let's check the model's accuracy:
|
||||
|
||||
```python
|
||||
accuracy_score = pipeline.score(X_train,y_train)
|
||||
print('Model Accuracy: ', accuracy_score)
|
||||
```
|
||||
And voila!
|
||||
```
|
||||
Model Accuracy: 0.8537946517073784
|
||||
```
|
||||
That's better! Try to predict a price:
|
||||
|
||||
```python
|
||||
pipeline.predict( np.array([ [2.75] ]) )
|
||||
```
|
||||
You are given this prediction:
|
||||
|
||||
```
|
||||
array([[46.34509342]])
|
||||
```
|
||||
It does make sense! And, if this is a better model than the previous one, looking at the same data, you need to budget for these more expensive pumpkins!
|
||||
|
||||
🏆 Well done! You created two Regression models in one lesson. In the final section on Regression, you will learn about Logistic Regression to determine categories.
|
||||
|
||||
---
|
||||
## 🚀Challenge
|
||||
|
||||
Test several different variables in this notebook to see how correlation corresponds to model accuracy.
|
||||
|
||||
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/12/)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
In this lesson we learned about Linear Regression. There are other important types of Regression. Read about Stepwise, Ridge, Lasso and Elasticnet techniques. A good course to study to learn more is the [Stanford Statistical Learning course](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning)
|
||||
|
||||
**Assignment**: [Build a Model](assignment.md)
|
@ -0,0 +1,11 @@
|
||||
# Create a Regression Model
|
||||
|
||||
## Instructions
|
||||
|
||||
In this lesson you were shown how to build a model using both Linear and Polynomial Regression. Using this knowledge, find a dataset or use one of Scikit-Learn's built-in sets to build a fresh model. Explain in your notebook why you chose the technique you did, and demonstrate your model's accuracy. If it is not accurate, explain why.
|
||||
|
||||
## Rubric
|
||||
|
||||
| Criteria | Exemplary | Adequate | Needs Improvement |
|
||||
| -------- | ------------------------------------------------------------ | -------------------------- | ------------------------------- |
|
||||
| | presents a complete notebook with a well-documented solution | the solution is incomplete | the solution is flawed or buggy |
|
After Width: | Height: | Size: 204 KiB |
After Width: | Height: | Size: 1.3 MiB |
After Width: | Height: | Size: 167 KiB |
After Width: | Height: | Size: 209 KiB |
@ -0,0 +1,261 @@
|
||||
# Logistic Regression to Predict Categories
|
||||
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/13/)
|
||||
|
||||
### Introduction
|
||||
|
||||
In this final lesson on Regression, one of the basic 'classic' ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?
|
||||
|
||||
In this lesson, you will learn:
|
||||
- A new library for data visualization
|
||||
- Techniques for Logistic Regression
|
||||
|
||||
Deepen your understanding of working with this type of Regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa)
|
||||
## Prerequisite
|
||||
|
||||
Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one small category that we can work with: Color. Let's build a Logistic Regression model to predict that, given some variables, what color a given pumpkin will be (orange 🎃 or white 👻).
|
||||
|
||||
For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.
|
||||
|
||||
> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!
|
||||
|
||||
## About Logistic Regression
|
||||
|
||||
Logistic Regression differs from Linear Regression, which you learned about previously, in a few important ways.
|
||||
### Binary Classification
|
||||
|
||||
Logistic Regression does not offer the same features as Linear Regression. The former offers a prediction about a binary category ("orange or not orange") whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, how much its price will rise.
|
||||
|
||||
![Pumpkin Classification Model](./images/pumpkin-classifier.png)
|
||||
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
|
||||
### Other Classifications
|
||||
|
||||
There are other types of Logistic Regression, including Multinomial and Ordinal. Multinomial involves having more than one categories - "Orange, White, and Striped". Ordinal involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).
|
||||
|
||||
![Multinomial vs Ordinal Regression](./images/multinomial-ordinal.png)
|
||||
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
|
||||
|
||||
### It's Still Linear
|
||||
|
||||
Even though this type of Regression is all about category predictions, it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.
|
||||
|
||||
### Variables DO NOT have to correlate
|
||||
|
||||
Remember how Linear Regression worked better with more correlated variables? Logistic Regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.
|
||||
### You Need a Lot of Clean Data
|
||||
|
||||
Logistic Regression will give more accurate results if you use more data; our small dataset is not optimal for this task, so keep that in mind.
|
||||
|
||||
✅ Think about the types of data that would lend themselves well to Logistic Regression
|
||||
|
||||
## Tidy the Data
|
||||
|
||||
First, clean the data a bit, dropping null values and selecting only some of the columns:
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import LabelEncoder
|
||||
|
||||
new_columns = ['Color','Origin','Item Size','Variety','City Name','Package']
|
||||
|
||||
new_pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
|
||||
|
||||
new_pumpkins.dropna(inplace=True)
|
||||
|
||||
new_pumpkins = new_pumpkins.apply(LabelEncoder().fit_transform)
|
||||
```
|
||||
|
||||
You can always take a peek at your new dataframe:
|
||||
|
||||
```python
|
||||
new_pumpkins.info
|
||||
```
|
||||
### Visualization
|
||||
|
||||
By now you have loaded up the [starter notebook](./notebook.ipynb) with pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including Color. Let's visualize the dataframe in the notebook using a different library: [Seaborn](https://seaborn.pydata.org/index.html), which is built on Matplotlib which we used earlier. Seaborn offers some neat ways to visualize your data. For example, you can compare distributions of the data for each point in a side-by side grid.
|
||||
|
||||
```python
|
||||
import seaborn as sns
|
||||
|
||||
g = sns.PairGrid(new_pumpkins)
|
||||
g.map(sns.scatterplot)
|
||||
```
|
||||
|
||||
![A grid of visualized data](images/grid.png)
|
||||
|
||||
By observing data side-by-side, you can see how the Color data relates to the other columns.
|
||||
|
||||
✅ Given this scatterplot grid, what are some interesting explorations you can envision?
|
||||
|
||||
Since Color is a binary category (Orange or Not), it's called 'categorical data' and needs 'a more [specialized approach](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) to visualization'. There are other ways to visualize the relationship of this category with other variables. You can visualize variables side-by-side with Seaborn plots. Try a 'swarm' plot to show the distribution of values:
|
||||
|
||||
```python
|
||||
sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
|
||||
```
|
||||
|
||||
![A swarm of visualized data](images/swarm.png)
|
||||
|
||||
A 'violin' type plot is useful as you can easily visualize the way that data in the two categories is distributed. Violin plots don't work so well with smaller datasets as the distribution is displayed more 'smoothly'.
|
||||
|
||||
```python
|
||||
sns.catplot(x="Color", y="Item Size",
|
||||
kind="violin", data=new_pumpkins)
|
||||
```
|
||||
![a violin type chart](images/violin.png)
|
||||
|
||||
✅ Try creating this plot, and other Seaborn plots, using other variables.
|
||||
|
||||
Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore Logistic Regression to determine a given pumpkin's likely color.
|
||||
|
||||
> infographic here (an image of logistic regression's sigmoid flow, like this: https://wikipedia.org/wiki/Logistic_regression#/media/File:Exam_pass_logistic_curve.jpeg)
|
||||
|
||||
> **🧮 Show Me The Math**
|
||||
>
|
||||
> Remember how Linear Regression often used ordinary least squares to arrive at a value? Logistic Regression relies on the concept of 'maximum likelihood' using [sigmoid functions](https://wikipedia.org/wiki/Sigmoid_function). A 'Sigmoid Function' on a plot looks like an 'S' shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like thus:
|
||||
>
|
||||
> ![logistic function](images/sigmoid.png)
|
||||
>
|
||||
> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class '1' of the binary choice. If not, it will be classified as '0'.
|
||||
|
||||
## Build your model
|
||||
|
||||
Building a model to find these binary classification is surprisingly straightforward in Scikit-Learn.
|
||||
|
||||
Select the variables you want to use in your classification model and split the training and test sets:
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
Selected_features = ['Origin','Item Size','Variety','City Name','Package']
|
||||
|
||||
X = new_pumpkins[Selected_features]
|
||||
y = new_pumpkins['Color']
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
|
||||
|
||||
```
|
||||
|
||||
Now you can train your model and print out its result:
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.metrics import accuracy_score, classification_report
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
model = LogisticRegression()
|
||||
model.fit(X_train, y_train)
|
||||
predictions = model.predict(X_test)
|
||||
|
||||
print(classification_report(y_test, predictions))
|
||||
print('Predicted labels: ', predictions)
|
||||
print('Accuracy: ', accuracy_score(y_test, predictions))
|
||||
```
|
||||
|
||||
Take a look at your model's scoreboard. It's not too bad, considering you have only about 1000 rows of data:
|
||||
|
||||
```
|
||||
precision recall f1-score support
|
||||
|
||||
0 0.85 0.95 0.90 166
|
||||
1 0.38 0.15 0.22 33
|
||||
|
||||
accuracy 0.82 199
|
||||
macro avg 0.62 0.55 0.56 199
|
||||
weighted avg 0.77 0.82 0.78 199
|
||||
|
||||
Predicted labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
|
||||
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
|
||||
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
|
||||
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
|
||||
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
|
||||
0 0 0 1 0 1 0 0 1 0 0 0 1 0]
|
||||
```
|
||||
|
||||
## Better comprehension via a confusion matrix
|
||||
|
||||
While you can get a scoreboard report [terms](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report) by printing out the items above, you might be able to understand your model more easily by using a [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) to help us understand how the model is performing.
|
||||
|
||||
> 🎓 A '[confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)' (or 'error matrix') is a table that expresses your model's true vs. false positives and negatives, thus gauging the accuracy of predictions.
|
||||
|
||||
```python
|
||||
from sklearn.metrics import confusion_matrix
|
||||
confusion_matrix(y_test, predictions)
|
||||
```
|
||||
|
||||
Take a look at your model's confusion matrix:
|
||||
|
||||
```
|
||||
array([[162, 4],
|
||||
[ 33, 0]])
|
||||
```
|
||||
|
||||
What's going on here? Let's say our model is asked to classify items between two binary categories, category 'pumpkin' and category 'not-a-pumpkin'.
|
||||
|
||||
- If your model predicts something as a pumpkin and it belongs to category 'pumpkin' in reality we call it a true positive, shown by the top left number.
|
||||
- If your model predicts something as not a pumpkin and it belongs to category 'pumpkin' in reality we call it a false positive, shown by the top right number.
|
||||
- If your model predicts something as a pumpkin and it belongs to category 'not-a-pumpkin' in reality we call it a false negative, shown by the bottom left number.
|
||||
- If your model predicts something as not a pumpkin and it belongs to category 'not-a-pumpkin' in reality we call it a true negative, shown by the bottom right number.
|
||||
|
||||
![Confusion Matrix](images/confusion-matrix.png)
|
||||
|
||||
> Infographic by [Jen Looper](https://twitter.com/jenlooper)
|
||||
|
||||
As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.
|
||||
|
||||
✅ Q: According to the confusion matrix, how did the model do? A: Not too bad; there are a good number of true positives but also several false negatives.
|
||||
|
||||
Let's revisit the terms we saw earlier with the help of the confusion matrix's mapping of TP/TN and FP/FN:
|
||||
|
||||
🎓 Precision: TP/(TP + FN) The fraction of relevant instances among the retrieved instances (e.g. which labels were well-labeled)
|
||||
|
||||
🎓 Recall: TP/(TP + FP) The fraction of relevant instances that were retrieved, whether well-labeled or not
|
||||
|
||||
🎓 f1-score: (2 * precision * recall)/(precision + recall) A weighted average of the precision and recall, with best being 1 and worst being 0
|
||||
|
||||
🎓 Support: The number of occurrences of each label retrieved
|
||||
|
||||
🎓 Accuracy: (TP + TN)/(TP + TN + FP + FN) The percentage of labels predicted accurately for a sample.
|
||||
|
||||
🎓 Macro Avg: The calculation of the unweighted mean metrics for each label, not taking label imbalance into account.
|
||||
|
||||
🎓 Weighted Avg: The calculation of the mean metrics for each label, taking label imbalance into account by weighting them by their support (the number of true instances for each label).
|
||||
|
||||
✅ Can you think which metric you should watch if you want your model to reduce the number of false negatives?
|
||||
## Visualize the ROC Curve of this Model
|
||||
|
||||
This is not a bad model; its accuracy is in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables.
|
||||
|
||||
Let's do one more visualization to see the so-called 'ROC' score:
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_curve, roc_auc_score
|
||||
|
||||
y_scores = model.predict_proba(X_test)
|
||||
# calculate ROC curve
|
||||
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
|
||||
sns.lineplot([0, 1], [0, 1])
|
||||
sns.lineplot(fpr, tpr)
|
||||
```
|
||||
Using Seaborn again, plot the model's [Receiving Operating Characteristic](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html?highlight=roc) or ROC. ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. "ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis." Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly:
|
||||
|
||||
![ROC](./images/ROC.png)
|
||||
|
||||
Finally, use Scikit-Learn's [`roc_auc_score` API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html?highlight=roc_auc#sklearn.metrics.roc_auc_score) to compute the actual 'Area Under the Curve' (AUC):
|
||||
|
||||
```python
|
||||
auc = roc_auc_score(y_test,y_scores[:,1])
|
||||
print(auc)
|
||||
```
|
||||
The result is `0.6976998904709748`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is _pretty good_.
|
||||
|
||||
In future lessons on classifications, you will learn how to iterate to improve your model's scores. But for now, congratulations! You've completed these regression lessons!
|
||||
|
||||
---
|
||||
## 🚀Challenge
|
||||
|
||||
There's a lot more to unpack regarding Logistic Regression! But the best way to learn is to experiment. Find a dataset that lends itself to this type of analysis and build a model with it. What do you learn? tip: try [Kaggle](https://kaggle.com) for interesting datasets.
|
||||
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/14/)
|
||||
|
||||
## Review & Self Study
|
||||
|
||||
Read the first few pages of [this paper from Stanford](https://web.stanford.edu/~jurafsky/slp3/5.pdf) on some practical uses for Logistic Regression. Think about tasks that are better suited for one or the other type of Regression tasks that we have studied up to this point. What would work best?
|
||||
|
||||
**Assignment**: [Retrying this Regression](assignment.md)
|