11 KiB
K-Means Clustering
Pre-lecture quiz
In this lesson, you'll learn how to create clusters using Scikit-learn and the Nigerian music dataset you imported earlier. We'll cover the basics of K-Means for clustering. Remember, as you learned in the previous lesson, there are many ways to work with clusters, and the method you choose depends on your data. We'll try K-Means since it's the most common clustering technique. Let's dive in!
Key terms you'll learn:
- Silhouette scoring
- Elbow method
- Inertia
- Variance
Introduction
K-Means Clustering is a technique from the field of signal processing. It is used to divide and group data into 'k' clusters based on a series of observations. Each observation works to group a given data point closest to its nearest 'mean,' or the center point of a cluster.
The clusters can be visualized as Voronoi diagrams, which consist of a point (or 'seed') and its corresponding region.
Infographic by Jen Looper
The K-Means clustering process follows a three-step procedure:
- The algorithm selects k-number of center points by sampling from the dataset. Then it loops:
- Assigns each sample to the nearest centroid.
- Creates new centroids by calculating the mean value of all samples assigned to the previous centroids.
- Calculates the difference between the new and old centroids and repeats until the centroids stabilize.
One limitation of K-Means is that you need to define 'k,' the number of centroids. Luckily, the 'elbow method' can help estimate a good starting value for 'k.' You'll try it shortly.
Prerequisite
You'll work in this lesson's notebook.ipynb file, which includes the data import and preliminary cleaning you completed in the previous lesson.
Exercise - Preparation
Start by revisiting the songs dataset.
-
Create a boxplot by calling
boxplot()
for each column:plt.figure(figsize=(20,20), dpi=200) plt.subplot(4,3,1) sns.boxplot(x = 'popularity', data = df) plt.subplot(4,3,2) sns.boxplot(x = 'acousticness', data = df) plt.subplot(4,3,3) sns.boxplot(x = 'energy', data = df) plt.subplot(4,3,4) sns.boxplot(x = 'instrumentalness', data = df) plt.subplot(4,3,5) sns.boxplot(x = 'liveness', data = df) plt.subplot(4,3,6) sns.boxplot(x = 'loudness', data = df) plt.subplot(4,3,7) sns.boxplot(x = 'speechiness', data = df) plt.subplot(4,3,8) sns.boxplot(x = 'tempo', data = df) plt.subplot(4,3,9) sns.boxplot(x = 'time_signature', data = df) plt.subplot(4,3,10) sns.boxplot(x = 'danceability', data = df) plt.subplot(4,3,11) sns.boxplot(x = 'length', data = df) plt.subplot(4,3,12) sns.boxplot(x = 'release_date', data = df)
This data is a bit noisy: by observing each column as a boxplot, you can identify outliers.
You could go through the dataset and remove these outliers, but that would leave you with very minimal data.
-
For now, decide which columns to use for your clustering exercise. Choose ones with similar ranges and encode the
artist_top_genre
column as numeric data:from sklearn.preprocessing import LabelEncoder le = LabelEncoder() X = df.loc[:, ('artist_top_genre','popularity','danceability','acousticness','loudness','energy')] y = df['artist_top_genre'] X['artist_top_genre'] = le.fit_transform(X['artist_top_genre']) y = le.transform(y)
-
Next, determine how many clusters to target. You know there are 3 song genres in the dataset, so let's try 3:
from sklearn.cluster import KMeans nclusters = 3 seed = 0 km = KMeans(n_clusters=nclusters, random_state=seed) km.fit(X) # Predict the cluster for each data point y_cluster_kmeans = km.predict(X) y_cluster_kmeans
You'll see an array printed out with predicted clusters (0, 1, or 2) for each row in the dataframe.
-
Use this array to calculate a 'silhouette score':
from sklearn import metrics score = metrics.silhouette_score(X, y_cluster_kmeans) score
Silhouette Score
Aim for a silhouette score closer to 1. This score ranges from -1 to 1. A score of 1 indicates that the cluster is dense and well-separated from other clusters. A value near 0 suggests overlapping clusters with samples close to the decision boundary of neighboring clusters. (Source)
Our score is 0.53, which is moderate. This suggests that our data isn't particularly well-suited for this type of clustering, but let's proceed.
Exercise - Build a Model
-
Import
KMeans
and begin the clustering process.from sklearn.cluster import KMeans wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(X) wcss.append(kmeans.inertia_)
Here's an explanation of some key parts:
🎓 range: These are the iterations of the clustering process.
🎓 random_state: "Determines random number generation for centroid initialization." Source
🎓 WCSS: "Within-cluster sums of squares" measures the squared average distance of all points within a cluster to the cluster centroid. Source
🎓 Inertia: K-Means algorithms aim to choose centroids that minimize 'inertia,' "a measure of how internally coherent clusters are." Source. The value is appended to the wcss variable during each iteration.
🎓 k-means++: In Scikit-learn, you can use the 'k-means++' optimization, which "initializes the centroids to be (generally) distant from each other, leading to likely better results than random initialization."
Elbow Method
Earlier, you assumed that 3 clusters would be appropriate because of the 3 song genres. But is that correct?
-
Use the 'elbow method' to confirm.
plt.figure(figsize=(10,5)) sns.lineplot(x=range(1, 11), y=wcss, marker='o', color='red') plt.title('Elbow') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show()
Use the
wcss
variable you built earlier to create a chart showing the 'bend' in the elbow, which indicates the optimal number of clusters. Perhaps it is 3!
Exercise - Display the Clusters
-
Repeat the process, this time setting three clusters, and display the clusters as a scatterplot:
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters = 3) kmeans.fit(X) labels = kmeans.predict(X) plt.scatter(df['popularity'],df['danceability'],c = labels) plt.xlabel('popularity') plt.ylabel('danceability') plt.show()
-
Check the model's accuracy:
labels = kmeans.labels_ correct_labels = sum(y == labels) print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size)) print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))
The model's accuracy isn't great, and the shape of the clusters gives you a clue as to why.
The data is too imbalanced, poorly correlated, and has too much variance between column values to cluster effectively. In fact, the clusters that form are likely heavily influenced or skewed by the three genre categories we defined earlier. This was a learning experience!
According to Scikit-learn's documentation, a model like this one, with poorly defined clusters, has a 'variance' problem:
Infographic from Scikit-learn
Variance
Variance is defined as "the average of the squared differences from the Mean" (Source). In the context of this clustering problem, it means that the numbers in our dataset diverge too much from the mean.
✅ This is a good time to think about ways to address this issue. Should you tweak the data further? Use different columns? Try a different algorithm? Hint: Consider scaling your data to normalize it and test other columns.
Try this 'variance calculator' to better understand the concept.
🚀Challenge
Spend some time with this notebook, tweaking parameters. Can you improve the model's accuracy by cleaning the data further (e.g., removing outliers)? You can use weights to give more importance to certain data samples. What else can you do to create better clusters?
Hint: Try scaling your data. There's commented code in the notebook that adds standard scaling to make the data columns more similar in range. You'll find that while the silhouette score decreases, the 'kink' in the elbow graph becomes smoother. This is because leaving the data unscaled allows data with less variance to have more influence. Read more about this issue here.
Post-lecture quiz
Review & Self Study
Check out a K-Means Simulator like this one. This tool lets you visualize sample data points and determine their centroids. You can adjust the data's randomness, number of clusters, and number of centroids. Does this help you better understand how data can be grouped?
Also, review this handout on K-Means from Stanford.
Assignment
Experiment with different clustering methods
Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.