diff --git a/5-Clustering/1-Visualize/solution/R/lesson_14.html b/5-Clustering/1-Visualize/solution/R/lesson_14.html new file mode 100644 index 00000000..c6b791f3 --- /dev/null +++ b/5-Clustering/1-Visualize/solution/R/lesson_14.html @@ -0,0 +1,5445 @@ + + + + + + + + + + + + + +Introduction to clustering: Clean, prep and visualize your data + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + +
+

Nigerian Music scraped from Spotify - an +analysis

+

Clustering is a type of Unsupervised +Learning that presumes that a dataset is unlabelled or that its +inputs are not matched with predefined outputs. It uses various +algorithms to sort through unlabeled data and provide groupings +according to patterns it discerns in the data.

+

Pre-lecture +quiz

+
+

Introduction

+

Clustering +is very useful for data exploration. Let’s see if it can help discover +trends and patterns in the way Nigerian audiences consume music.

+
+

✅ Take a minute to think about the uses of clustering. In real life, +clustering happens whenever you have a pile of laundry and need to sort +out your family members’ clothes 🧦👕👖🩲. In data science, clustering +happens when trying to analyze a user’s preferences, or determine the +characteristics of any unlabeled dataset. Clustering, in a way, helps +make sense of chaos, like a sock drawer.

+
+

In a professional setting, clustering can be used to determine things +like market segmentation, determining what age groups buy what items, +for example. Another use would be anomaly detection, perhaps to detect +fraud from a dataset of credit card transactions. Or you might use +clustering to determine tumors in a batch of medical scans.

+

✅ Think a minute about how you might have encountered clustering ‘in +the wild’, in a banking, e-commerce, or business setting.

+
+

🎓 Interestingly, cluster analysis originated in the fields of +Anthropology and Psychology in the 1930s. Can you imagine how it might +have been used?

+
+

Alternately, you could use it for grouping search results - by +shopping links, images, or reviews, for example. Clustering is useful +when you have a large dataset that you want to reduce and on which you +want to perform more granular analysis, so the technique can be used to +learn about data before other models are constructed.

+

✅ Once your data is organized in clusters, you assign it a cluster +Id, and this technique can be useful when preserving a dataset’s +privacy; you can instead refer to a data point by its cluster id, rather +than by more revealing identifiable data. Can you think of other reasons +why you’d refer to a cluster Id rather than other elements of the +cluster to identify it?

+
+
+

Getting started with clustering

+
+

🎓 How we create clusters has a lot to do with how we gather up the +data points into groups. Let’s unpack some vocabulary:

+

🎓 ‘Transductive’ +vs. ‘inductive’

+

Transductive inference is derived from observed training cases that +map to specific test cases. Inductive inference is derived from training +cases that map to general rules which are only then applied to test +cases.

+

An example: Imagine you have a dataset that is only partially +labelled. Some things are ‘records’, some ‘cds’, and some are blank. +Your job is to provide labels for the blanks. If you choose an inductive +approach, you’d train a model looking for ‘records’ and ‘cds’, and apply +those labels to your unlabeled data. This approach will have trouble +classifying things that are actually ‘cassettes’. A transductive +approach, on the other hand, handles this unknown data more effectively +as it works to group similar items together and then applies a label to +a group. In this case, clusters might reflect ‘round musical things’ and +‘square musical things’.

+

🎓 ‘Non-flat’ +vs. ‘flat’ geometry

+

Derived from mathematical terminology, non-flat vs. flat geometry +refers to the measure of distances between points by either ‘flat’ (Euclidean) or +‘non-flat’ (non-Euclidean) geometrical methods.

+

‘Flat’ in this context refers to Euclidean geometry (parts of which +are taught as ‘plane’ geometry), and non-flat refers to non-Euclidean +geometry. What does geometry have to do with machine learning? Well, as +two fields that are rooted in mathematics, there must be a common way to +measure distances between points in clusters, and that can be done in a +‘flat’ or ‘non-flat’ way, depending on the nature of the data. Euclidean +distances are measured as the length of a line segment between two +points. Non-Euclidean +distances are measured along a curve. If your data, visualized, +seems to not exist on a plane, you might need to use a specialized +algorithm to handle it.

+
+
+Infographic by Dasani Madipalli +
Infographic by Dasani Madipalli
+
+
+

🎓 ‘Distances’

+

Clusters are defined by their distance matrix, e.g. the distances +between points. This distance can be measured a few ways. Euclidean +clusters are defined by the average of the point values, and contain a +‘centroid’ or center point. Distances are thus measured by the distance +to that centroid. Non-Euclidean distances refer to ‘clustroids’, the +point closest to other points. Clustroids in turn can be defined in +various ways.

+

🎓 ‘Constrained’

+

Constrained +Clustering introduces ‘semi-supervised’ learning into this +unsupervised method. The relationships between points are flagged as +‘cannot link’ or ‘must-link’ so some rules are forced on the +dataset.

+

An example: If an algorithm is set free on a batch of unlabelled or +semi-labelled data, the clusters it produces may be of poor quality. In +the example above, the clusters might group ‘round music things’ and +‘square music things’ and ‘triangular things’ and ‘cookies’. If given +some constraints, or rules to follow (“the item must be made of +plastic”, “the item needs to be able to produce music”) this can help +‘constrain’ the algorithm to make better choices.

+

🎓 ‘Density’

+

Data that is ‘noisy’ is considered to be ‘dense’. The distances +between points in each of its clusters may prove, on examination, to be +more or less dense, or ‘crowded’ and thus this data needs to be analyzed +with the appropriate clustering method. This +article demonstrates the difference between using K-Means clustering +vs. HDBSCAN algorithms to explore a noisy dataset with uneven cluster +density.

+
+

Deepen your understanding of clustering techniques in this Learn +module

+
+
+

Clustering algorithms

+

There are over 100 clustering algorithms, and their use depends on +the nature of the data at hand. Let’s discuss some of the major +ones:

+
    +
  • Hierarchical clustering. If an object is classified +by its proximity to a nearby object, rather than to one farther away, +clusters are formed based on their members’ distance to and from other +objects. Hierarchical clustering is characterized by repeatedly +combining two clusters.
  • +
+
+Infographic by Dasani Madipalli +
Infographic by Dasani Madipalli
+
+
    +
  • Centroid clustering. This popular algorithm +requires the choice of ‘k’, or the number of clusters to form, after +which the algorithm determines the center point of a cluster and gathers +data around that point. K-means +clustering is a popular version of centroid clustering which +separates a data set into pre-defined K groups. The center is determined +by the nearest mean, thus the name. The squared distance from the +cluster is minimized.Infographic by Dasani Madipalli

  • +
  • Distribution-based clustering. Based in +statistical modeling, distribution-based clustering centers on +determining the probability that a data point belongs to a cluster, and +assigning it accordingly. Gaussian mixture methods belong to this +type.

  • +
  • Density-based clustering. Data points are +assigned to clusters based on their density, or their grouping around +each other. Data points far from the group are considered outliers or +noise. DBSCAN, Mean-shift and OPTICS belong to this type of +clustering.

  • +
  • Grid-based clustering. For multi-dimensional +datasets, a grid is created and the data is divided amongst the grid’s +cells, thereby creating clusters.

  • +
+

The best way to learn about clustering is to try it for yourself, so +that’s what you’ll do in this exercise.

+

We’ll require some packages to knock-off this module. You can have +them installed as: +install.packages(c('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork'))

+

Alternatively, the script below checks whether you have the packages +required to complete this module and installs them for you in case some +are missing.

+
suppressWarnings(if(!require("pacman")) install.packages("pacman"))
+
## Loading required package: pacman
+
pacman::p_load('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork')
+
## 
+## The downloaded binary packages are in
+##  /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmplRAI5s/downloaded_packages
+
## 
+## summarytools installed
+
## Warning in pacman::p_load("tidyverse", "tidymodels", "DataExplorer", "summarytools", : Failed to install/load:
+## summarytools
+
knitr::opts_chunk$set(warning = F, message = F)
+
+
+
+

Exercise - cluster your data

+

Clustering as a technique is greatly aided by proper visualization, +so let’s get started by visualizing our music data. This exercise will +help us decide which of the methods of clustering we should most +effectively use for the nature of this data.

+

Let’s hit the ground running by importing the data.

+
# Load the core tidyverse and make it available in your current R session
+library(tidyverse)
+
+# Import the data into a tibble
+df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv")
+
+# View the first 5 rows of the data set
+df %>% 
+  slice_head(n = 5)
+
+ +
+

Sometimes, we may want some little more information on our data. We +can have a look at the data and its structure +by using the glimpse() +function:

+
# Glimpse into the data set
+df %>% 
+  glimpse()
+
## Rows: 530
+## Columns: 16
+## $ name             <chr> "Sparky", "shuga rush", "LITT!", "Confident / Feeling…
+## $ album            <chr> "Mandy & The Jungle", "EVERYTHING YOU HEARD IS TRUE",…
+## $ artist           <chr> "Cruel Santino", "Odunsi (The Engine)", "AYLØ", "Lady…
+## $ artist_top_genre <chr> "alternative r&b", "afropop", "indie r&b", "nigerian …
+## $ release_date     <dbl> 2019, 2020, 2018, 2019, 2018, 2020, 2018, 2018, 2019,…
+## $ length           <dbl> 144000, 89488, 207758, 175135, 152049, 184800, 202648…
+## $ popularity       <dbl> 48, 30, 40, 14, 25, 26, 29, 27, 36, 30, 33, 35, 46, 2…
+## $ danceability     <dbl> 0.666, 0.710, 0.836, 0.894, 0.702, 0.803, 0.818, 0.80…
+## $ acousticness     <dbl> 0.8510, 0.0822, 0.2720, 0.7980, 0.1160, 0.1270, 0.452…
+## $ energy           <dbl> 0.420, 0.683, 0.564, 0.611, 0.833, 0.525, 0.587, 0.30…
+## $ instrumentalness <dbl> 5.34e-01, 1.69e-04, 5.37e-04, 1.87e-04, 9.10e-01, 6.6…
+## $ liveness         <dbl> 0.1100, 0.1010, 0.1100, 0.0964, 0.3480, 0.1290, 0.590…
+## $ loudness         <dbl> -6.699, -5.640, -7.127, -4.961, -6.044, -10.034, -9.8…
+## $ speechiness      <dbl> 0.0829, 0.3600, 0.0424, 0.1130, 0.0447, 0.1970, 0.199…
+## $ tempo            <dbl> 133.015, 129.993, 130.005, 111.087, 105.115, 100.103,…
+## $ time_signature   <dbl> 5, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4,…
+

Good job!💪

+

We can observe that glimpse() will give you the total +number of rows (observations) and columns (variables), then, the first +few entries of each variable in a row after the variable name. In +addition, the data type of the variable is given immediately +after each variable’s name inside < >.

+

DataExplorer::introduce() can summarize this information +neatly:

+
# Describe basic information for our data
+df %>% 
+  introduce()
+
+ +
+
# A visual display of the same
+df %>% 
+  plot_intro()
+

+

Awesome! We have just learnt that our data has no missing values.

+

While we are at it, we can explore common central tendency statistics +(e.g mean +and median) and +measures of dispersion (e.g standard +deviation) using summarytools::descr()

+
# Describe common statistics
+df %>% descr(stats = "common")
+
+

Let’s look at the general values of the data. Note that popularity +can be 0, which show songs that have no ranking. We’ll +remove those shortly.

+
+

🤔 If we are working with clustering, an unsupervised method that +does not require labeled data, why are we showing this data with labels? +In the data exploration phase, they come in handy, but they are not +necessary for the clustering algorithms to work.

+
+ +
+

2. Explore data distribution

+

Let’s ask some more subtle questions. Are the genres significantly +different in the perception of their danceability, based on their +popularity? Let’s examine our top three genres data distribution for +popularity and danceability along a given x and y axis using density +plots.

+
# Perform 2D kernel density estimation
+density_estimate_2d <- nigerian_songs %>% 
+  ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre)) +
+  geom_density_2d(bins = 5, size = 1) +
+  paletteer::scale_color_paletteer_d("RSkittleBrewer::wildberry") +
+  xlim(-20, 80) +
+  ylim(0, 1.2)
+
+# Density plot based on the popularity
+density_estimate_pop <- nigerian_songs %>% 
+  ggplot(mapping = aes(x = popularity, fill = artist_top_genre, color = artist_top_genre)) +
+  geom_density(size = 1, alpha = 0.5) +
+  paletteer::scale_fill_paletteer_d("RSkittleBrewer::wildberry") +
+  paletteer::scale_color_paletteer_d("RSkittleBrewer::wildberry") +
+  theme(legend.position = "none")
+
+# Density plot based on the danceability
+density_estimate_dance <- nigerian_songs %>% 
+  ggplot(mapping = aes(x = danceability, fill = artist_top_genre, color = artist_top_genre)) +
+  geom_density(size = 1, alpha = 0.5) +
+  paletteer::scale_fill_paletteer_d("RSkittleBrewer::wildberry") +
+  paletteer::scale_color_paletteer_d("RSkittleBrewer::wildberry")
+
+
+# Patch everything together
+library(patchwork)
+density_estimate_2d / (density_estimate_pop + density_estimate_dance)
+

+

We see that there are concentric circles that line up, regardless of +genre. Could it be that Nigerian tastes converge at a certain level of +danceability for this genre?

+

In general, the three genres align in terms of their popularity and +danceability. Determining clusters in this loosely-aligned data will be +a challenge. Let’s see whether a scatter plot can support this.

+
# A scatter plot of popularity and danceability
+scatter_plot <- nigerian_songs %>% 
+  ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre, shape = artist_top_genre)) +
+  geom_point(size = 2, alpha = 0.8) +
+  paletteer::scale_color_paletteer_d("futurevisions::mars")
+
+# Add a touch of interactivity
+ggplotly(scatter_plot)
+
+ +

A scatterplot of the same axes shows a similar pattern of +convergence.

+

In general, for clustering, you can use scatterplots to show clusters +of data, so mastering this type of visualization is very useful. In the +next lesson, we will take this filtered data and use k-means clustering +to discover groups in this data that see to overlap in interesting +ways.

+
+
+
+

🚀 Challenge

+

In preparation for the next lesson, make a chart about the various +clustering algorithms you might discover and use in a production +environment. What kinds of problems is the clustering trying to +address?

+
+
+

Post-lecture +quiz

+
+
+

Review & Self Study

+

Before you apply clustering algorithms, as we have learned, it’s a +good idea to understand the nature of your dataset. Read more on this +topic here

+

Deepen your understanding of clustering techniques:

+ +
+ +
+

THANK YOU TO:

+

Jen Looper for +creating the original Python version of this module ♥️

+

Dasani Madipalli +for creating the amazing illustrations that make machine learning +concepts more interpretable and easier to understand.

+

Happy Learning,

+

Eric, Gold Microsoft Learn +Student Ambassador.

+
+ +

+ + +
+
+ +
+ + + + + + + + + + + + + + + + +