[K-Means Clustering](https://wikipedia.org/wiki/K-means_clustering) is a method derived from the domain of signal processing. It is used to divide and partition groups of data into 'k' clusters using a series of observations. Each observation works to group a given datapoint closest to its nearest 'mean', or the center point of a cluster. The clusters can be visualized as [Voronoi diagrams](https://wikipedia.org/wiki/Voronoi_diagram), which include a point (or 'seed') and its corresponding region.
## Introduction
[K-Means Clustering](https://wikipedia.org/wiki/K-means_clustering) is a method derived from the domain of signal processing. It is used to divide and partition groups of data into 'k' clusters using a series of observations. Each observation works to group a given datapoint closest to its nearest 'mean', or the center point of a cluster.
The clusters can be visualized as [Voronoi diagrams](https://wikipedia.org/wiki/Voronoi_diagram), which include a point (or 'seed') and its corresponding region.
![voronoi diagram](images/voronoi.png)
![voronoi diagram](images/voronoi.png)
@ -25,17 +28,21 @@ Terms you will learn about:
The K-Means clustering process [executes in a three-step process](https://scikit-learn.org/stable/modules/clustering.html#k-means):
The K-Means clustering process [executes in a three-step process](https://scikit-learn.org/stable/modules/clustering.html#k-means):
1. The algorithm selects k-number of center points by sampling from the dataset. After this, it loops:
1. The algorithm selects k-number of center points by sampling from the dataset. After this, it loops:
1. It assigns each sample to the nearest centroid
1. It assigns each sample to the nearest centroid.
2. It creates new centroids by taking the mean value of all of the samples assigned to the previous centroids.
2. It creates new centroids by taking the mean value of all of the samples assigned to the previous centroids.
3. Then, it calculates the difference between the new and old centroids and repeats until the centroids are stablized.
3. Then, it calculates the difference between the new and old centroids and repeats until the centroids are stablized.
One drawback of using K-Means includes the fact that you will need to establish 'k', that is the number of centroids. Fortunately the 'elbow method' helps to estimate a good starting value for 'k'. You'll try it in a minute.
One drawback of using K-Means includes the fact that you will need to establish 'k', that is the number of centroids. Fortunately the 'elbow method' helps to estimate a good starting value for 'k'. You'll try it in a minute.
### Prerequisite
You will work in this lesson's `notebook.ipynb` file that includes the data import and preliminary cleaning you did in the last lesson.
## Prerequisite
### Preparation
You will work in this lesson's _notebook.ipynb_ file that includes the data import and preliminary cleaning you did in the last lesson.
## Exercise - preparation
Start by taking another look at the songs data.
Start by taking another look at the songs data. This data is a little noisy: by observing each column as a boxplot, you can see outliers:
1. Create a boxplot, calling `boxplot()` for each column:
```python
```python
plt.figure(figsize=(20,20), dpi=200)
plt.figure(figsize=(20,20), dpi=200)
@ -76,9 +83,14 @@ sns.boxplot(x = 'length', data = df)
plt.subplot(4,3,12)
plt.subplot(4,3,12)
sns.boxplot(x = 'release_date', data = df)
sns.boxplot(x = 'release_date', data = df)
```
```
This data is a little noisy: by observing each column as a boxplot, you can see outliers.
![outliers](images/boxplots.png)
![outliers](images/boxplots.png)
You could go through the dataset and remove these outliers, but that would make the data pretty minimal. For now, choose which columns you will use for your clustering exercise. Pick ones with similar ranges and encode the `artist_top_genre` column as numeric data:
You could go through the dataset and remove these outliers, but that would make the data pretty minimal.
1. For now, choose which columns you will use for your clustering exercise. Pick ones with similar ranges and encode the `artist_top_genre` column as numeric data:
Look for a silhouette score closer to 1. This score varies from -1 to 1, and if the score is 1, the cluster is dense and well-separated from other clusters. A value near 0 represents overlapping clusters with samples very close to the decision boundary of the neighboring clusters.[source](https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam).
Look for a silhouette score closer to 1. This score varies from -1 to 1, and if the score is 1, the cluster is dense and well-separated from other clusters. A value near 0 represents overlapping clusters with samples very close to the decision boundary of the neighboring clusters.[source](https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam).
Our score is .53, so right in the middle. This indicates that our data is not particularly well-suited to this type of clustering, but let's continue.
Our score is **.53**, so right in the middle. This indicates that our data is not particularly well-suited to this type of clustering, but let's continue.
### Build a model
Now you can import KMeans and start the clustering process. There are a few parts here that warrant explaining:
### Exercise - build a model
1. Import `KMeans` and start the clustering process.
```python
```python
from sklearn.cluster import KMeans
from sklearn.cluster import KMeans
@ -137,6 +152,9 @@ for i in range(1, 11):
wcss.append(kmeans.inertia_)
wcss.append(kmeans.inertia_)
```
```
There are a few parts here that warrant explaining.
> 🎓 range: These are the iterations of the clustering process
> 🎓 range: These are the iterations of the clustering process
> 🎓 random_state: "Determines random number generation for centroid initialization."[source](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
> 🎓 random_state: "Determines random number generation for centroid initialization."[source](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
@ -146,9 +164,12 @@ for i in range(1, 11):
> 🎓 Inertia: K-Means algorithms attempt to choose centroids to minimize 'inertia', "a measure of how internally coherent clusters are."[source](https://scikit-learn.org/stable/modules/clustering.html). The value is appended to the wcss variable on each iteration.
> 🎓 Inertia: K-Means algorithms attempt to choose centroids to minimize 'inertia', "a measure of how internally coherent clusters are."[source](https://scikit-learn.org/stable/modules/clustering.html). The value is appended to the wcss variable on each iteration.
> 🎓 k-means++: In [Scikit-learn](https://scikit-learn.org/stable/modules/clustering.html#k-means) you can use the 'k-means++' optimization, which "initializes the centroids to be (generally) distant from each other, leading to probably better results than random initialization.
> 🎓 k-means++: In [Scikit-learn](https://scikit-learn.org/stable/modules/clustering.html#k-means) you can use the 'k-means++' optimization, which "initializes the centroids to be (generally) distant from each other, leading to probably better results than random initialization.
### Elbow method
### Elbow method
Previously, you surmised that, because you have targeted 3 song genres, you should choose 3 clusters. But is that the case? Use the 'elbow method' to make sure.
Previously, you surmised that, because you have targeted 3 song genres, you should choose 3 clusters. But is that the case?
1. Use the 'elbow method' to make sure.
```python
```python
plt.figure(figsize=(10,5))
plt.figure(figsize=(10,5))
@ -162,9 +183,10 @@ plt.show()
Use the `wcss` variable that you built in the previous step to create a chart showing where the 'bend' in the elbow is, which indicates the optimum number of clusters. Maybe it **is** 3!
Use the `wcss` variable that you built in the previous step to create a chart showing where the 'bend' in the elbow is, which indicates the optimum number of clusters. Maybe it **is** 3!
![elbow method](images/elbow.png)
![elbow method](images/elbow.png)
### Display the clusters
Try the process again, this time setting three clusters, and display the clusters as a scatterplot:
## Exercise - display the clusters
1. Try the process again, this time setting three clusters, and display the clusters as a scatterplot:
```python
```python
from sklearn.cluster import KMeans
from sklearn.cluster import KMeans
@ -177,7 +199,7 @@ plt.ylabel('danceability')
plt.show()
plt.show()
```
```
Check the model's accuracy:
1. Check the model's accuracy:
```python
```python
labels = kmeans.labels_
labels = kmeans.labels_
@ -188,6 +210,7 @@ print("Result: %d out of %d samples were correctly labeled." % (correct_labels,
This model's accuracy is not very good, and the shape of the clusters gives you a hint why.
This model's accuracy is not very good, and the shape of the clusters gives you a hint why.
![clusters](images/clusters.png)
![clusters](images/clusters.png)
@ -198,6 +221,7 @@ In Scikit-learn's documentation, you can see that a model like this one, with cl
![problem models](images/problems.png)
![problem models](images/problems.png)
> Infographic from Scikit-learn
> Infographic from Scikit-learn
## Variance
## Variance
Variance is defined as "the average of the squared differences from the Mean."[source](https://www.mathsisfun.com/data/standard-deviation.html) In the context of this clustering problem, it refers to data that the numbers of our dataset tend to diverge a bit too much from the mean.
Variance is defined as "the average of the squared differences from the Mean."[source](https://www.mathsisfun.com/data/standard-deviation.html) In the context of this clustering problem, it refers to data that the numbers of our dataset tend to diverge a bit too much from the mean.
@ -207,6 +231,7 @@ Variance is defined as "the average of the squared differences from the Mean."[s
> Try this '[variance calculator](https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php)' to understand the concept a bit more.
> Try this '[variance calculator](https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php)' to understand the concept a bit more.
---
---
## 🚀Challenge
## 🚀Challenge
Spend some time with this notebook, tweaking parameters. Can you improve the accuracy of the model by cleaning the data more (removing outliers, for example)? You can use weights to give more weight to given data samples. What else can you do to create better clusters?
Spend some time with this notebook, tweaking parameters. Can you improve the accuracy of the model by cleaning the data more (removing outliers, for example)? You can use weights to give more weight to given data samples. What else can you do to create better clusters?
@ -214,6 +239,7 @@ Spend some time with this notebook, tweaking parameters. Can you improve the acc
Hint: Try to scale your data. There's commented code in the notebook that adds standard scaling to make the data columns resemble each other more closely in terms of range. You'll find that while the silhouette score goes down, the 'kink' in the elbow graph smooths out. This is because leaving the data unscaled allows data with less variance to carry more weight. Read a bit more on this problem [here](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226).
Hint: Try to scale your data. There's commented code in the notebook that adds standard scaling to make the data columns resemble each other more closely in terms of range. You'll find that while the silhouette score goes down, the 'kink' in the elbow graph smooths out. This is because leaving the data unscaled allows data with less variance to carry more weight. Read a bit more on this problem [here](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226).
Take a look at Stanford's K-Means Simulator [here](https://stanford.edu/class/engr108/visualizations/kmeans/kmeans.html). You can use this tool to visualize sample data points and determine its centroids. With fresh data, click 'update' to see how long it takes to find convergence. You can edit the data's randomness, numbers of clusters and numbers of centroids. Does this help you get an idea of how the data can be grouped?
Take a look at Stanford's K-Means Simulator [here](https://stanford.edu/class/engr108/visualizations/kmeans/kmeans.html). You can use this tool to visualize sample data points and determine its centroids. With fresh data, click 'update' to see how long it takes to find convergence. You can edit the data's randomness, numbers of clusters and numbers of centroids. Does this help you get an idea of how the data can be grouped?