## **Nigerian Music scraped from Spotify - an analysis**

Clustering is a type of [Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning) that assumes a dataset is unlabelled or that its inputs are not paired with predefined outputs. It uses various algorithms to sift through unlabeled data and create groupings based on patterns it identifies in the data.

[**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/27/)

### **Introduction**

[Clustering](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-30164-8_124) is highly useful for exploring data. Let's see if it can help uncover trends and patterns in how Nigerian audiences consume music.

> ‚úÖ Take a moment to think about the applications of clustering. In everyday life, clustering happens when you sort a pile of laundry into family members' clothes üß¶üëïüëñü©≤. In data science, clustering occurs when analyzing user preferences or identifying characteristics in an unlabeled dataset. Clustering, in a way, helps bring order to chaos, like organizing a sock drawer.

In a professional context, clustering can be used for tasks like market segmentation, such as identifying which age groups purchase specific items. Another application is anomaly detection, for instance, identifying fraud in a dataset of credit card transactions. It can also be used to detect tumors in medical scans.

‚úÖ Take a moment to think about how you might have encountered clustering in real-world scenarios, such as in banking, e-commerce, or business.

> üéì Interestingly, cluster analysis originated in the fields of Anthropology and Psychology in the 1930s. Can you imagine how it might have been applied back then?

Alternatively, clustering can be used to group search results‚Äîfor example, by shopping links, images, or reviews. It is particularly useful for large datasets that need to be reduced for more detailed analysis, making it a valuable technique for understanding data before building other models.

‚úÖ Once your data is organized into clusters, you assign it a cluster ID. This approach can be helpful for preserving a dataset's privacy, as you can refer to a data point by its cluster ID rather than by more identifiable information. Can you think of other reasons why you might use a cluster ID instead of other elements of the cluster for identification?

### Getting started with clustering

> üéì How we create clusters depends heavily on how we group data points. Let's break down some key terms:
>
> üéì ['Transductive' vs. 'inductive'](https://wikipedia.org/wiki/Transduction_(machine_learning))
>
> Transductive inference is derived from observed training cases that map to specific test cases. Inductive inference is derived from training cases that map to general rules, which are then applied to test cases.
>
> Example: Imagine you have a dataset that is only partially labeled. Some items are 'records,' some are 'CDs,' and others are blank. Your task is to label the blanks. If you use an inductive approach, you'd train a model to identify 'records' and 'CDs' and apply those labels to the unlabeled data. This approach might struggle to classify items that are actually 'cassettes.' A transductive approach, however, handles unknown data more effectively by grouping similar items together and then applying a label to the group. In this case, clusters might represent 'round musical items' and 'square musical items.'
>
> üéì ['Non-flat' vs. 'flat' geometry](https://datascience.stackexchange.com/questions/52260/terminology-flat-geometry-in-the-context-of-clustering)
>
> Derived from mathematical terminology, non-flat vs. flat geometry refers to how distances between points are measured‚Äîeither using 'flat' ([Euclidean](https://wikipedia.org/wiki/Euclidean_geometry)) or 'non-flat' (non-Euclidean) methods.
>
> 'Flat' refers to Euclidean geometry (often taught as 'plane' geometry), while 'non-flat' refers to non-Euclidean geometry. What does geometry have to do with machine learning? Since both fields are rooted in mathematics, there must be a common way to measure distances between points in clusters. This can be done in a 'flat' or 'non-flat' manner, depending on the nature of the data. [Euclidean distances](https://wikipedia.org/wiki/Euclidean_distance) are measured as the length of a straight line between two points. [Non-Euclidean distances](https://wikipedia.org/wiki/Non-Euclidean_geometry) are measured along a curve. If your data, when visualized, doesn't exist on a plane, you may need a specialized algorithm to handle it.

<p>
   <img src="../../images/flat-nonflat.png"
   width="600"/>
   <figcaption>Infographic by Dasani Madipalli</figcaption>

> üéì ['Distances'](https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf)
>
> Clusters are defined by their distance matrix, which measures the distances between points. This distance can be calculated in several ways. Euclidean clusters are defined by the average of the point values and have a 'centroid' or center point. Distances are measured relative to that centroid. Non-Euclidean distances refer to 'clustroids,' the point closest to other points. Clustroids can be defined in various ways.
>
> üéì ['Constrained'](https://wikipedia.org/wiki/Constrained_clustering)
>
> [Constrained Clustering](https://web.cs.ucdavis.edu/~davidson/Publications/ICDMTutorial.pdf) introduces 'semi-supervised' learning into this unsupervised method. Relationships between points are flagged as 'cannot link' or 'must-link,' imposing some rules on the dataset.
>
> Example: If an algorithm is applied to a batch of unlabeled or semi-labeled data, the resulting clusters may be of poor quality. For instance, the clusters might group 'round musical items,' 'square musical items,' 'triangular items,' and 'cookies.' By adding constraints or rules ("the item must be made of plastic," "the item must produce music"), the algorithm can make better choices.
>
> üéì 'Density'
>
> Data that is 'noisy' is considered 'dense.' The distances between points in its clusters may vary, requiring the use of appropriate clustering methods. [This article](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) explains the difference between using K-Means clustering and HDBSCAN algorithms to analyze a noisy dataset with uneven cluster density.

Deepen your understanding of clustering techniques in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-cluster-models?WT.mc_id=academic-77952-leestott)

### **Clustering algorithms**

There are over 100 clustering algorithms, and their application depends on the nature of the data. Let's explore some of the major ones:

-   **Hierarchical clustering**. Objects are classified based on their proximity to nearby objects rather than distant ones. Clusters are formed by repeatedly combining two clusters.

<p>
   <img src="../../images/hierarchical.png"
   width="600"/>
   <figcaption>Infographic by Dasani Madipalli</figcaption>

-   **Centroid clustering**. This popular algorithm requires selecting 'k,' the number of clusters to form. The algorithm then determines the center point of a cluster and gathers data around it. [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) is a widely used version of centroid clustering that divides a dataset into pre-defined K groups. The center is determined by the nearest mean, hence the name. The squared distance from the cluster is minimized.

<p>
   <img src="../../images/centroid.png"
   width="600"/>
   <figcaption>Infographic by Dasani Madipalli</figcaption>

-   **Distribution-based clustering**. Based on statistical modeling, this method assigns data points to clusters based on the probability of their belonging to a cluster. Gaussian mixture methods fall under this category.

-   **Density-based clustering**. Data points are grouped into clusters based on their density or proximity to one another. Points far from the group are considered outliers or noise. DBSCAN, Mean-shift, and OPTICS are examples of this type of clustering.

-   **Grid-based clustering**. For multi-dimensional datasets, a grid is created, and the data is divided among the grid's cells, forming clusters.

The best way to learn about clustering is to try it yourself, which you'll do in this exercise.

We'll need some packages to complete this module. You can install them using: `install.packages(c('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork'))`

Alternatively, the script below checks whether you have the required packages and installs any missing ones for you.


In [None]:
suppressWarnings(if(!require("pacman")) install.packages("pacman"))

pacman::p_load('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork')


## Exercise - cluster your data

Clustering is a technique that benefits significantly from good visualization, so let's begin by visualizing our music data. This exercise will help us determine which clustering method would be most effective for the characteristics of this data.

Let's dive right in by importing the data.


In [None]:
# Load the core tidyverse and make it available in your current R session
library(tidyverse)

# Import the data into a tibble
df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv")

# View the first 5 rows of the data set
df %>% 
  slice_head(n = 5)


Sometimes, we might want a bit more information about our data. We can examine the `data` and `its structure` using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function:


In [None]:
# Glimpse into the data set
df %>% 
  glimpse()


Good job!üí™

We can see that `glimpse()` provides the total number of rows (observations) and columns (variables), followed by the first few entries of each variable in a row after the variable name. Additionally, the *data type* of the variable is displayed right after the variable's name within `< >`.

`DataExplorer::introduce()` can organize this information in a concise way:


In [None]:
# Describe basic information for our data
df %>% 
  introduce()

# A visual display of the same
df %>% 
  plot_intro()


Awesome! We just learned that our data doesn't have any missing values.

While we're at it, we can explore common statistics for central tendency (e.g., [mean](https://en.wikipedia.org/wiki/Arithmetic_mean) and [median](https://en.wikipedia.org/wiki/Median)) and measures of dispersion (e.g., [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)) using `summarytools::descr()`.


In [None]:
# Describe common statistics
df %>% 
  descr(stats = "common")


Let's examine the general values of the data. Keep in mind that popularity can be `0`, which indicates songs that have no ranking. We'll filter those out shortly.

> ü§î If we're using clustering, an unsupervised method that doesn't rely on labeled data, why are we displaying this data with labels? During the data exploration phase, labels can be useful, but they aren't required for clustering algorithms to function.

### 1. Explore popular genres

Let's dive in and identify the most popular genres üé∂ by counting how often each one appears.


In [None]:
# Popular genres
top_genres <- df %>% 
  count(artist_top_genre, sort = TRUE) %>% 
# Encode to categorical and reorder the according to count
  mutate(artist_top_genre = factor(artist_top_genre) %>% fct_inorder())

# Print the top genres
top_genres


That went well! They say a picture is worth a thousand rows of a data frame (actually, nobody ever says that üòÖ). But you get the idea, right?

One way to visualize categorical data (character or factor variables) is by using bar plots. Let's create a bar plot for the top 10 genres:


In [None]:
# Change the default gray theme
theme_set(theme_light())

# Visualize popular genres
top_genres %>%
  slice(1:10) %>% 
  ggplot(mapping = aes(x = artist_top_genre, y = n,
                       fill = artist_top_genre)) +
  geom_col(alpha = 0.8) +
  paletteer::scale_fill_paletteer_d("rcartocolor::Vivid") +
  ggtitle("Top genres") +
  theme(plot.title = element_text(hjust = 0.5),
        # Rotates the X markers (so we can read them)
    axis.text.x = element_text(angle = 90))


Now it's much easier to spot that we have `missing` genres üßê!

> A good visualization will reveal things you didn't anticipate or spark new questions about the data - Hadley Wickham and Garrett Grolemund, [R For Data Science](https://r4ds.had.co.nz/introduction.html)

Keep in mind, when the top genre is labeled as `Missing`, it means Spotify didn't categorize it, so let's remove it.


In [None]:
# Visualize popular genres
top_genres %>%
  filter(artist_top_genre != "Missing") %>% 
  slice(1:10) %>% 
  ggplot(mapping = aes(x = artist_top_genre, y = n,
                       fill = artist_top_genre)) +
  geom_col(alpha = 0.8) +
  paletteer::scale_fill_paletteer_d("rcartocolor::Vivid") +
  ggtitle("Top genres") +
  theme(plot.title = element_text(hjust = 0.5),
        # Rotates the X markers (so we can read them)
    axis.text.x = element_text(angle = 90))


From the initial data exploration, we observe that the top three genres dominate this dataset. Let's focus on `afro dancehall`, `afropop`, and `nigerian pop`, and further filter the dataset to exclude any entries with a popularity value of 0 (indicating they were not assigned a popularity score in the dataset and can be considered irrelevant for our analysis):


In [None]:
nigerian_songs <- df %>% 
  # Concentrate on top 3 genres
  filter(artist_top_genre %in% c("afro dancehall", "afropop","nigerian pop")) %>% 
  # Remove unclassified observations
  filter(popularity != 0)



# Visualize popular genres
nigerian_songs %>%
  count(artist_top_genre) %>%
  ggplot(mapping = aes(x = artist_top_genre, y = n,
                       fill = artist_top_genre)) +
  geom_col(alpha = 0.8) +
  paletteer::scale_fill_paletteer_d("ggsci::category10_d3") +
  ggtitle("Top genres") +
  theme(plot.title = element_text(hjust = 0.5))


Let's examine whether there is any obvious linear relationship among the numerical variables in our dataset. This relationship is measured mathematically using the [correlation statistic](https://en.wikipedia.org/wiki/Correlation).

The correlation statistic is a value ranging from -1 to 1 that reflects the strength of a relationship. Values greater than 0 indicate a *positive* correlation (high values of one variable are generally associated with high values of the other), whereas values less than 0 indicate a *negative* correlation (high values of one variable are generally associated with low values of the other).


In [None]:
# Narrow down to numeric variables and fid correlation
corr_mat <- nigerian_songs %>% 
  select(where(is.numeric)) %>% 
  cor()

# Visualize correlation matrix
corrplot(corr_mat, order = 'AOE', col = c('white', 'black'), bg = 'gold2')  


The data doesn't show strong correlations, except between `energy` and `loudness`, which makes sense since loud music is typically quite energetic. `Popularity` is related to `release date`, which also makes sense because newer songs are likely to be more popular. There also seems to be a correlation between length and energy.

It will be interesting to see what insights a clustering algorithm can uncover from this data!

> üéì Remember, correlation does not imply causation! While we can confirm correlation, we have no evidence of causation. An [entertaining website](https://tylervigen.com/spurious-correlations) provides visuals that highlight this concept.

### 2. Explore data distribution

Let‚Äôs dive deeper with some nuanced questions. Are genres significantly different in how their danceability is perceived, based on their popularity? To investigate, let‚Äôs analyze the data distribution for popularity and danceability in our top three genres along a specified x and y axis using [density plots](https://www.khanacademy.org/math/ap-statistics/density-curves-normal-distribution-ap/density-curves/v/density-curves).


In [None]:
# Perform 2D kernel density estimation
density_estimate_2d <- nigerian_songs %>% 
  ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre)) +
  geom_density_2d(bins = 5, size = 1) +
  paletteer::scale_color_paletteer_d("RSkittleBrewer::wildberry") +
  xlim(-20, 80) +
  ylim(0, 1.2)

# Density plot based on the popularity
density_estimate_pop <- nigerian_songs %>% 
  ggplot(mapping = aes(x = popularity, fill = artist_top_genre, color = artist_top_genre)) +
  geom_density(size = 1, alpha = 0.5) +
  paletteer::scale_fill_paletteer_d("RSkittleBrewer::wildberry") +
  paletteer::scale_color_paletteer_d("RSkittleBrewer::wildberry") +
  theme(legend.position = "none")

# Density plot based on the danceability
density_estimate_dance <- nigerian_songs %>% 
  ggplot(mapping = aes(x = danceability, fill = artist_top_genre, color = artist_top_genre)) +
  geom_density(size = 1, alpha = 0.5) +
  paletteer::scale_fill_paletteer_d("RSkittleBrewer::wildberry") +
  paletteer::scale_color_paletteer_d("RSkittleBrewer::wildberry")


# Patch everything together
library(patchwork)
density_estimate_2d / (density_estimate_pop + density_estimate_dance)


We observe concentric circles aligning, regardless of the genre. Could it be that Nigerian preferences converge at a specific level of danceability for this genre?

Overall, the three genres show similarities in terms of their popularity and danceability. Identifying clusters within this loosely aligned data will be tricky. Let's check if a scatter plot can provide some clarity.


In [None]:
# A scatter plot of popularity and danceability
scatter_plot <- nigerian_songs %>% 
  ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre, shape = artist_top_genre)) +
  geom_point(size = 2, alpha = 0.8) +
  paletteer::scale_color_paletteer_d("futurevisions::mars")

# Add a touch of interactivity
ggplotly(scatter_plot)


A scatterplot of the same axes reveals a similar pattern of convergence.

In general, scatterplots are useful for visualizing clusters in data, making them an essential tool for clustering tasks. In the next lesson, we will take this filtered data and apply k-means clustering to identify groups within the data that exhibit interesting overlaps.

## **üöÄ Challenge**

To prepare for the next lesson, create a chart outlining various clustering algorithms that you might encounter and use in a production environment. What types of problems are these clustering methods designed to solve?

## [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/28/)

## **Review & Self Study**

Before applying clustering algorithms, as we've learned, it's important to understand the characteristics of your dataset. Learn more about this topic [here](https://www.kdnuggets.com/2019/10/right-clustering-algorithm.html).

Expand your knowledge of clustering techniques:

-   [Train and Evaluate Clustering Models using Tidymodels and friends](https://rpubs.com/eR_ic/clustering)

-   Bradley Boehmke & Brandon Greenwell, [*Hands-On Machine Learning with R*](https://bradleyboehmke.github.io/HOML/)*.*

## **Assignment**

[Explore other visualizations for clustering](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/assignment.md)

## THANK YOU TO:

[Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ‚ô•Ô∏è

[`Dasani Madipalli`](https://twitter.com/dasani_decoded) for designing the incredible illustrations that make machine learning concepts more accessible and easier to grasp.

Happy Learning,

[Eric](https://twitter.com/ericntay), Gold Microsoft Learn Student Ambassador.



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
