clustering

3 years ago · 6c159a3f78
parent 00f46c260e
commit 6c159a3f78
10 changed files with 500 additions and 119 deletions
--- a/Clustering/1-Visualize/README.md
+++ b/Clustering/1-Visualize/README.md
@ -4,11 +4,16 @@

 > While you're studying Machine Learning with Clustering, enjoy some Nigerian Dance Hall tracks - this is a highly rated song from 2014 by PSquare.
 ## [Pre-lecture quiz](link-to-quiz-app)
+### Introduction

 Clustering is a type of unsupervised learning that presumes that a dataset is unlabelled. It uses various algorithms to sort through unlabeled data and provide groupings according to patterns it discerns in the data. Clustering is very useful for data exploration. Let's see if it can help discover trends and patterns in the way Nigerian audiences consume music.

 ✅ Take a minute to think about the uses of clustering. In real life, clustering happens whenever you have a pile of laundry and need to sort out your family members' clothes 🧦👕👖🩲. In data science, clustering happens when trying to analyze a user's preferences, or determine the characteristics of any unlabeled dataset. Clustering, in a way, helps make sense of chaos.
-### Introduction
+
+In real life, clustering can be used to determine things like market segmentation, determining what age groups buy what items, for example. Another use would be anomaly detection, perhaps to detect fraud from a dataset of credit card transactions. Or you might use clustering to determine tumors in a batch of medical scans. Alternately, you could use it for grouping search results - by shopping links, images, or reviews, for example. Clustering is useful when you have a large dataset that you want to reduce and on which you want to perform more granular analysis, so the technique can be used to learn about data before other models are constructed.
+
+> ✅ Once your data is organized in clusters, you assign it a cluster Id, and this technique can be useful when preserving a dataset's privacy; you can instead refer to a data point by its cluster id, rather than by more revealing identifiable data. Can you think of other reasons why you'd refer to a cluster Id rather than other elements of the cluster to identify it?
+## Getting started with clustering

 [Scikit-Learn offers a large array](https://scikit-learn.org/stable/modules/clustering.html) of methods to perform clustering. The type you choose will depend on your use case. According to the documentation, each method has various benefits. Here is a simplified table of the methods supported by Scikit-Learn and their appropriate use cases:

@ -19,35 +24,134 @@ Clustering is a type of unsupervised learning that presumes that a dataset is un
 | Mean-shift                   | many, uneven clusters, inductive                                       |
 | Spectral clustering          | few, even clusters, transductive                                       |
 | Ward hierarchical clustering | many, constrained clusters, transductive                               |
-| Agglomerative clustering     | many, constrained, non Euclidan distances, transductive                |
+| Agglomerative clustering     | many, constrained, non Euclidean distances, transductive               |
 | DBSCAN                       | non-flat geometry, uneven clusters, transductive                       |
 | OPTICS                       | non-flat geometry, uneven clusters with variable density, transductive |
 | Gaussian mixtures            | flat geometry, inductive                                               |
 | BIRCH                        | large dataset with outliers, inductive                                 |

-> 🎓 Let's unpack some vocabulary:
+> 🎓 How we create clusters has a lot to do with how we gather up the data points into groups. Let's unpack some vocabulary:
 >
-> - 'transductive' vs. 'inductive'
->  - 'non-flat' vs. 'flat' geometry
->  - 'distances'
->  - 'constrained'
->  - 'density'
+> 🎓 ['Transductive' vs. 'inductive'](https://wikipedia.org/wiki/Transduction_(machine_learning))
+> Transductive inference is derived from observed training cases that map to specific test cases. Inductive inference is derived from training cases that map to general rules which are only then applied to test cases.
+> 
+> 🎓 ['Non-flat' vs. 'flat' geometry](https://datascience.stackexchange.com/questions/52260/terminology-flat-geometry-in-the-context-of-clustering)
+> Derived from mathematical terminology, non-flat vs. flat geometry refers to the measure of distances between points by either 'flat' (non-[Euclidean](https://wikipedia.org/wiki/Euclidean_geometry)) or 'non-flat' (Euclidean) geometrical methods. 
+> 
+> 🎓 ['Distances'](https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf)
+> Clusters are defined by their distance matrix, e.g. the distances between points. This distance can be measured a few ways. Euclidean clusters are defined by the average of the point values, and contain a 'centroid' or center point. Distances are thus measured by the distance to that centroid. Non-Euclidean distances refer to 'clustroids', the point closest to other points. Clustroids in turn can be defined in various ways.
+> 
+> 🎓 ['Constrained'](https://wikipedia.org/wiki/Constrained_clustering)
+> Constrained Clustering introduces 'semi-supervised' learning into this unsupervised method. The relationships between points are flagged as 'cannot link' or 'must-link' so some rules are forced on the dataset.
+> 
+> 🎓 'Density'
+> Data that is 'noisy' is considered to be 'dense'. The distances between points in each of its clusters may prove, on examination, to be more or less dense, or 'crowded' and thus this data needs to be analyzed with the appropriate clustering method. [This article](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) demonstrates the difference between using K-Means clustering vs. HDBSCAN algorithms to explore a noisy dataset with uneven cluster density.
 ### Preparation

-Open the notebook.ipynb file in this folder and append the song data
+Clustering is heavily dependent on visualization, so let's get started.
+
+Open the notebook.ipynb file in this folder and append the song data .csv file. Load up a dataframe with some data about the songs. Get ready to explore this data by importing the libraries and dumping out the data:
+
+```python
+import matplotlib.pyplot as plt
+import pandas as pd
+
+df = pd.read_csv("../data/nigerian-songs.csv")
+df.head()
+```
+Check the first few lines of data:
+
+Get some information about the dataframe:
+
+```python
+df.info()
+```
+
+<class 'pandas.core.frame.DataFrame'>
+RangeIndex: 530 entries, 0 to 529
+Data columns (total 16 columns):
+ #   Column            Non-Null Count  Dtype  
+---  ------            --------------  -----  
+ 0   name              530 non-null    object 
+ 1   album             530 non-null    object 
+ 2   artist            530 non-null    object 
+ 3   artist_top_genre  530 non-null    object 
+ 4   release_date      530 non-null    int64  
+ 5   length            530 non-null    int64  
+ 6   popularity        530 non-null    int64  
+ 7   danceability      530 non-null    float64
+ 8   acousticness      530 non-null    float64
+ 9   energy            530 non-null    float64
+ 10  instrumentalness  530 non-null    float64
+ 11  liveness          530 non-null    float64
+ 12  loudness          530 non-null    float64
+ 13  speechiness       530 non-null    float64
+ 14  tempo             530 non-null    float64
+ 15  time_signature    530 non-null    int64  
+dtypes: float64(8), int64(4), object(4)
+memory usage: 66.4+ KB
+
+It's useful that this data is mostly numeric, so it's almost ready for clustering.
+
+Check for null values:
+
+```python
+df.isnull().sum()
+```
+
+Looking good:
+
+name                0
+album               0
+artist              0
+artist_top_genre    0
+release_date        0
+length              0
+popularity          0
+danceability        0
+acousticness        0
+energy              0
+instrumentalness    0
+liveness            0
+loudness            0
+speechiness         0
+tempo               0
+time_signature      0
+dtype: int64
+
+Describe the data:
+
+```python
+df.describe()
+```
+## Visualize the data
+
+Now, find out the most popular genre using a barplot:
+
+```python
+top = df['artist_top_genre'].value_counts()
+plt.figure(figsize=(10,7))
+sns.barplot(x=top[:5].index,y=top[:5].values)
+plt.xticks(rotation=45)
+plt.title('Top genres',color = 'blue')
+```
+![most popular](images/popular.png)
+
+✅ If you'd like to see more top values, change this `[:5]` to a bigger value, or remove it to see all. It's interesting that one of the top genres is called 'Missing'!
+
+Explore the data by checking the most popular genre:

---



-## 🚀Challenge

-Add a challenge for students to work on collaboratively in class to enhance the project
+## 🚀Challenge

-Optional: add a screenshot of the completed lesson's UI if appropriate

 ## [Post-lecture quiz](link-to-quiz-app)

 ## Review & Self Study

+Take a look at Stanford's K-Means Simulator [here](https://stanford.edu/class/engr108/visualizations/kmeans/kmeans.html). You can use this tool to visualize sample data points and determine its centroids. With fresh data, click 'update' to see how long it takes to find convergence. You can edit the data's randomness, numbers of clusters and numbers of centroids. Does this help you get an idea of how the data can be grouped?
+
 **Assignment**: [Assignment Name](assignment.md)
--- a/Clustering/1-Visualize/images/popular.png
+++ b/Clustering/1-Visualize/images/popular.png
--- a/Clustering/1-Visualize/notebook.ipynb
+++ b/Clustering/1-Visualize/notebook.ipynb
--- a/Clustering/1-Visualize/solution/notebook.ipynb
+++ b/Clustering/1-Visualize/solution/notebook.ipynb
--- a/Clustering/2-K-Means/solution/notebook.ipynb
+++ b/Clustering/2-K-Means/solution/notebook.ipynb
--- a/Clustering/images/turntable.jpg
+++ b/Clustering/images/turntable.jpg
--- a/Regression/1-Tools/README.md
+++ b/Regression/1-Tools/README.md
@ -58,9 +58,9 @@ According to their [website](https://scikit-learn.org/stable/getting_started.htm

 > 🎓 A machine learning **model** is a mathematical model that generates predictions given data to which it has not been exposed. It builds these predictions based on its analysis of data and extrapolating patterns.

-> 🎓 **[Supervised Learning](https://en.wikipedia.org/wiki/Supervised_learning)** works by mapping an input to an output based on example pairs. It uses **labeled** training data to build a function to make predictions. [Download a printable Zine about Supervised Learning](https://zines.jenlooper.com/zines/supervisedlearning.html). Regression, which is covered in this group of lessons, is a type of supervised learning.
+> 🎓 **[Supervised Learning](https://wikipedia.org/wiki/Supervised_learning)** works by mapping an input to an output based on example pairs. It uses **labeled** training data to build a function to make predictions. [Download a printable Zine about Supervised Learning](https://zines.jenlooper.com/zines/supervisedlearning.html). Regression, which is covered in this group of lessons, is a type of supervised learning.

-> 🎓 **[Unsupervised Learning](https://en.wikipedia.org/wiki/Unsupervised_learning)** works similarly but it maps pairs using **unlabeled data**. [Download a printable Zine about Unsupervised Learning](https://zines.jenlooper.com/zines/unsupervisedlearning.html)
+> 🎓 **[Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning)** works similarly but it maps pairs using **unlabeled data**. [Download a printable Zine about Unsupervised Learning](https://zines.jenlooper.com/zines/unsupervisedlearning.html)

 > 🎓 **[Model Fitting](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py)** in the context of machine learning refers to the accuracy of the model's underlying function as it attempts to analyze data with which it is not familiar. **Underfitting** and **overfitting** are common problems that degrade the quality of the model as the model fits either not well enough or too well. This causes the model to make predictions either too closely aligned or too loosely aligned with its training data. An overfit model predicts training data too well because it has learned the data's details and noise too well. An underfit model is not accurate as it can neither accurately analyze its training data nor data it has not yet 'seen'.

@ -72,9 +72,9 @@ TODO: Infographic to show underfitting/overfitting like this https://miro.medium

 > 🎓 **Feature Variable** A [feature](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) is a measurable property of your data. In many datasets it is expressed as a column heading like 'date' 'size' or 'color'.

-> 🎓 **[Training and Testing](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) datasets** Throughout this curriculum, you will divide up a dataset into at least two parts, one large group of data for 'training' and a smaller part for 'testing'. Sometimes you'll also find a 'validation' set. A training set is the group of examples you use to train a model. A validation set is a smaller independent group of examples that you use to tune the model's hyperparameters, or architecture, to improve the model. A test dataset is another independent group of data, often gathered from the original data, that you use to confirm the performance of the built model. 
+> 🎓 **[Training and Testing](https://wikipedia.org/wiki/Training,_validation,_and_test_sets) datasets** Throughout this curriculum, you will divide up a dataset into at least two parts, one large group of data for 'training' and a smaller part for 'testing'. Sometimes you'll also find a 'validation' set. A training set is the group of examples you use to train a model. A validation set is a smaller independent group of examples that you use to tune the model's hyperparameters, or architecture, to improve the model. A test dataset is another independent group of data, often gathered from the original data, that you use to confirm the performance of the built model. 

-> 🎓 **Feature Selection and Feature Extraction** How do you know which variable to choose when building a model? You'll probably go through a process of feature selection or feature extraction to choose the right variables for the most performant model. They're not the same thing, however: "Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features." [source](https://en.wikipedia.org/wiki/Feature_selection)
+> 🎓 **Feature Selection and Feature Extraction** How do you know which variable to choose when building a model? You'll probably go through a process of feature selection or feature extraction to choose the right variables for the most performant model. They're not the same thing, however: "Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features." [source](https://wikipedia.org/wiki/Feature_selection)

 In this course, you will use Scikit-Learn and other tools to build machine learning models to perform what we call 'traditional machine learning' tasks. We have deliberately avoided neural networks and deep learning, as they are better covered in our forthcoming 'AI for Beginners' curriculum. 

@ -114,7 +114,7 @@ Now, load up the X and y data.

 3. In a new cell, load the diabetes dataset as data and target (X and y, loaded as a tuple). X will be a data matrix, and y will be the regression target. Add some print commands to show the shape of the data matrix and its first element:

-> 🎓 A **tuple** is an [ordered list of elements](https://en.wikipedia.org/wiki/Tuple).
+> 🎓 A **tuple** is an [ordered list of elements](https://wikipedia.org/wiki/Tuple).

 ✅ Think a bit about the relationship between the data and the regression target. Linear regression predicts relationships between feature X and target variable y. Can you find the [target](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) for the diabetes dataset in the documentation? What is this dataset demonstrating, given that target? 

--- a/Regression/4-Logistic/README.md
+++ b/Regression/4-Logistic/README.md
@ -105,11 +105,11 @@ sns.catplot(x="Color", y="Item Size",

 Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore Logistic Regression to determine a given pumpkin's likely color.

-> infographic here (an image of logistic regression's sigmoid flow, like this: https://en.wikipedia.org/wiki/Logistic_regression#/media/File:Exam_pass_logistic_curve.jpeg)
+> infographic here (an image of logistic regression's sigmoid flow, like this: https://wikipedia.org/wiki/Logistic_regression#/media/File:Exam_pass_logistic_curve.jpeg)

 > **🧮 Show Me The Math**
 >
-> Remember how Linear Regression often used ordinary least squares to arrive at a value? Logistic Regression relies on the concept of 'maximum likelihood' using [sigmoid functions](https://en.wikipedia.org/wiki/Sigmoid_function). A 'Sigmoid Function' on a plot looks like an 'S' shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like thus:
+> Remember how Linear Regression often used ordinary least squares to arrive at a value? Logistic Regression relies on the concept of 'maximum likelihood' using [sigmoid functions](https://wikipedia.org/wiki/Sigmoid_function). A 'Sigmoid Function' on a plot looks like an 'S' shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like thus:
 >
 > ![logistic function](images/sigmoid.png)
 >
--- a/TimeSeries/1-Introduction/README.md
+++ b/TimeSeries/1-Introduction/README.md
@ -21,7 +21,7 @@ Before starting, however, it's useful to understand what's going on behind the s
 When encountering the term 'time series' you need to understand its use in several different contexts.
 ### Time Series

-In mathematics, "a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time." An example of a time series is the daily closing value of the [Dow Jones Industrial Average](https://en.wikipedia.org/wiki/Time_series). The use of time series plots and statistical modeling is frequently encountered in signal processing, weather forecasting, earthquake prediction, and other fields where events occur and data points can be plotted over time.
+In mathematics, "a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time." An example of a time series is the daily closing value of the [Dow Jones Industrial Average](https://wikipedia.org/wiki/Time_series). The use of time series plots and statistical modeling is frequently encountered in signal processing, weather forecasting, earthquake prediction, and other fields where events occur and data points can be plotted over time.
 ### Time Series Analysis

 Time Series Analysis is the analysis of the above mentioned time series data. Time series data can take distinct forms, including 'interrupted time series' which detects patterns in a time series' evolution before and after an interrupting event. The type of analysis needed for the time series depends on the nature of the data. Time series data itself can take the form of series of numbers or characters.
--- a/TimeSeries/2-ARIMA/README.md
+++ b/TimeSeries/2-ARIMA/README.md
@ -5,22 +5,22 @@
 > A brief introduction to ARIMA models. The example is done in R, but the concepts are universal.
 ## [Pre-lecture quiz](link-to-quiz-app)

-In the previous lesson, you learned a bit about Time Series Forecasting and loaded a dataset showing the fluctuations of electrical load over a time period. In this lesson, you will discover a specific way to build models with [ARIMA: *A*uto*R*egressive *I*ntegrated *M*oving *A*verage](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average). ARIMA models are particularly suited to fit data that shows [non-stationarity](https://en.wikipedia.org/wiki/Stationary_process).
+In the previous lesson, you learned a bit about Time Series Forecasting and loaded a dataset showing the fluctuations of electrical load over a time period. In this lesson, you will discover a specific way to build models with [ARIMA: *A*uto*R*egressive *I*ntegrated *M*oving *A*verage](https://wikipedia.org/wiki/Autoregressive_integrated_moving_average). ARIMA models are particularly suited to fit data that shows [non-stationarity](https://wikipedia.org/wiki/Stationary_process).

 > 🎓 Stationarity, from a statistical context, refers to data whose distribution does not change when shifted in time. Non-stationary data, then, shows fluctuations due to trends that must be transformed to be analyzed. Seasonality, for example, can introduce fluctuations in data and can be eliminated by a process of 'seasonal-differencing'. 

-> 🎓 [Differencing](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average#Differencing) data, again from a statistical context, refers to the process of transforming non-stationary data to make it stationary by removing its non-constant trend. "Differencing removes the changes in the level of a time series, eliminating trend and seasonality and consequently stabilizing the mean of the time series."[Paper by Shixiong et al](https://arxiv.org/abs/1904.07632) 
+> 🎓 [Differencing](https://wikipedia.org/wiki/Autoregressive_integrated_moving_average#Differencing) data, again from a statistical context, refers to the process of transforming non-stationary data to make it stationary by removing its non-constant trend. "Differencing removes the changes in the level of a time series, eliminating trend and seasonality and consequently stabilizing the mean of the time series."[Paper by Shixiong et al](https://arxiv.org/abs/1904.07632) 

 Let's unpack the parts of ARIMA to better understand how it helps us model Time Series and help us make predictions against it.
 ## AR - for AutoRegressive

-Autoregressive models, as the name implies, look 'back' in time to analyze previous values in your data and make assumptions about them. These previous values are called 'lags'. An example would be data that shows monthly sales of pencils. Each month's sales total would be considered an 'evolving variable' in the dataset. This model is built as the "evolving variable of interest is regressed on its own lagged (i.e., prior) values." [wikipedia](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) 
+Autoregressive models, as the name implies, look 'back' in time to analyze previous values in your data and make assumptions about them. These previous values are called 'lags'. An example would be data that shows monthly sales of pencils. Each month's sales total would be considered an 'evolving variable' in the dataset. This model is built as the "evolving variable of interest is regressed on its own lagged (i.e., prior) values." [wikipedia](https://wikipedia.org/wiki/Autoregressive_integrated_moving_average) 
 ## I - for Integrated

-As opposed to the similar 'ARMA' models, the 'I' in ARIMA refers to its *[integrated](https://en.wikipedia.org/wiki/Order_of_integration)* aspect. The data is 'integrated' when differencing steps are applied so as to eliminate non-stationarity.
+As opposed to the similar 'ARMA' models, the 'I' in ARIMA refers to its *[integrated](https://wikipedia.org/wiki/Order_of_integration)* aspect. The data is 'integrated' when differencing steps are applied so as to eliminate non-stationarity.
 ## MA -  for Moving Average

-The [moving-average](https://en.wikipedia.org/wiki/Moving-average_model) aspect of this model refers to the output variable that is determined by observing the current and past values of lags.
+The [moving-average](https://wikipedia.org/wiki/Moving-average_model) aspect of this model refers to the output variable that is determined by observing the current and past values of lags.

 Bottom line: ARIMA is used to make a model fit the special form of time series data as closely as possible.
 ### Preparation
@ -290,7 +290,7 @@ Check the accuracy of your model by testing its mean absolute percentage error (
 >
 > ![MAPE](images/mape.png)
 > 
->  [MAPE](https://www.linkedin.com/pulse/what-mape-mad-msd-time-series-allameh-statistics/) is used to show prediction accuracy as a ratio defined by the above formula. The difference between actual<sub>t</sub> and predicted<sub>t</sub> is divided by the actual<sub>t</sub>. "The absolute value in this calculation is summed for every forecasted point in time and divided by the number of fitted points n." [wikipedia](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error)
+>  [MAPE](https://www.linkedin.com/pulse/what-mape-mad-msd-time-series-allameh-statistics/) is used to show prediction accuracy as a ratio defined by the above formula. The difference between actual<sub>t</sub> and predicted<sub>t</sub> is divided by the actual<sub>t</sub>. "The absolute value in this calculation is summed for every forecasted point in time and divided by the number of fitted points n." [wikipedia](https://wikipedia.org/wiki/Mean_absolute_percentage_error)

 If this equation is expressed in code: