Merge pull request #45 from softchris/clustering

editorial
pull/47/head
Jen Looper 4 years ago committed by GitHub
commit f079a4c1d3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -75,46 +75,39 @@ Deepen your understanding of clustering techniques in this [Learn module](https:
>
> Data that is 'noisy' is considered to be 'dense'. The distances between points in each of its clusters may prove, on examination, to be more or less dense, or 'crowded' and thus this data needs to be analyzed with the appropriate clustering method. [This article](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) demonstrates the difference between using K-Means clustering vs. HDBSCAN algorithms to explore a noisy dataset with uneven cluster density.
### Clustering algorithms
## Clustering algorithms
There are over 100 clustering algorithms, and their use depends on the nature of the data at hand. Let's discuss some of the major ones:
**Hierarchical clustering**
If an object is classified by its proximity to a nearby object, rather than to one farther away, clusters are formed based on their members' distance to and from other objects. Scikit-learn's agglomerative clustering is hierarchical.
- **Hierarchical clustering**. If an object is classified by its proximity to a nearby object, rather than to one farther away, clusters are formed based on their members' distance to and from other objects. Scikit-learn's agglomerative clustering is hierarchical.
![Hierarchical clustering Infographic](./images/hierarchical.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
**Centroid clustering**
This popular algorithm requires the choice of 'k', or the number of clusters to form, after which the algorithm determines the center point of a cluster and gathers data around that point. [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) is a popular version of centroid clustering. The center is determined by the nearest mean, thus the name. The squared distance from the cluster is minimized.
- **Centroid clustering**. This popular algorithm requires the choice of 'k', or the number of clusters to form, after which the algorithm determines the center point of a cluster and gathers data around that point. [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) is a popular version of centroid clustering. The center is determined by the nearest mean, thus the name. The squared distance from the cluster is minimized.
![Centroid clustering Infographic](./images/centroid.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
**Distribution-based clustering**
Based in statistical modeling, distribution-based clustering centers on determining the probability that a data point belongs to a cluster, and assigning it accordingly. Gaussian mixture methods belong to this type.
**Density-based clustering**
- **Distribution-based clustering**. Based in statistical modeling, distribution-based clustering centers on determining the probability that a data point belongs to a cluster, and assigning it accordingly. Gaussian mixture methods belong to this type.
Data points are assigned to clusters based on their density, or their grouping around each other. Data points far from the group are considered outliers or noise. DBSCAN, Mean-shift and OPTICS belong to this type of clustering.
- **Density-based clustering**. Data points are assigned to clusters based on their density, or their grouping around each other. Data points far from the group are considered outliers or noise. DBSCAN, Mean-shift and OPTICS belong to this type of clustering.
**Grid-based clustering**
- **Grid-based clustering**. For multi-dimensional datasets, a grid is created and the data is divided amongst the grid's cells, thereby creating clusters.
For multi-dimensional datasets, a grid is created and the data is divided amongst the grid's cells, thereby creating clusters.
### Preparing the data
## Exercise - cluster your data
Clustering as a technique is greatly aided by proper visualization, so let's get started by visualizing our music data. This exercise will help us decide which of the methods of clustering we should most effectively use for the nature of this data.
Open the notebook.ipynb file in this folder. Import the Seaborn package for good data visualization.
1. Open the _notebook.ipynb_ file in this folder.
1. Import the `Seaborn` package for good data visualization.
```python
pip install seaborn
```
Append the song data .csv file. Load up a dataframe with some data about the songs. Get ready to explore this data by importing the libraries and dumping out the data:
1. Append the song data from _nigerian-songs.csv_. Load up a dataframe with some data about the songs. Get ready to explore this data by importing the libraries and dumping out the data:
```python
import matplotlib.pyplot as plt
@ -134,13 +127,15 @@ Check the first few lines of data:
| 3 | Confident / Feeling Cool | Enjoy Your Life | Lady Donli | nigerian pop | 2019 | 175135 | 14 | 0.894 | 0.798 | 0.611 | 0.000187 | 0.0964 | -4.961 | 0.113 | 111.087 | 4 |
| 4 | wanted you | rare. | Odunsi (The Engine) | afropop | 2018 | 152049 | 25 | 0.702 | 0.116 | 0.833 | 0.91 | 0.348 | -6.044 | 0.0447 | 105.115 | 4 |
Get some information about the dataframe:
1. Get some information about the dataframe, calling `info()`:
```python
df.info()
```
```
The output looking like so:
```output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529
Data columns (total 16 columns):
@ -166,7 +161,7 @@ dtypes: float64(8), int64(4), object(4)
memory usage: 66.4+ KB
```
Double-check for null values:
1. Double-check for null values, by calling `isnull()` and verifying the sum being 0:
```python
df.isnull().sum()
@ -174,7 +169,7 @@ df.isnull().sum()
Looking good:
```
```output
name 0
album 0
artist 0
@ -194,7 +189,7 @@ time_signature 0
dtype: int64
```
Describe the data:
1. Describe the data:
```python
df.describe()
@ -215,7 +210,7 @@ df.describe()
Look at the general values of the data. Note that popularity can be '0', which show songs that have no ranking. Let's remove those shortly.
Use a barplot to find out the most popular genres:
1. Use a barplot to find out the most popular genres:
```python
import seaborn as sns
@ -226,11 +221,14 @@ sns.barplot(x=top[:5].index,y=top[:5].values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
![most popular](./images/popular.png)
✅ If you'd like to see more top values, change the top `[:5]` to a bigger value, or remove it to see all.
Note, when the top genre is described as 'Missing', that means that Spotify did not classify it, so let's get rid of it:
Note, when the top genre is described as 'Missing', that means that Spotify did not classify it, so let's get rid of it.
1. Get rid of missing data by filtering it out
```python
df = df[df['artist_top_genre'] != 'Missing']
@ -240,11 +238,12 @@ sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
Now recheck the genres:
![most popular](images/all-genres.png)
By far, the top three genres dominate this dataset, so let's concentrate on `afro dancehall`, `afropop`, and `nigerian pop`, also filtering the dataset to remove anything with a 0 popularity value (meaning it was not classified with a popularity in the dataset and can be considered noise for our purposes):
1. By far, the top three genres dominate this dataset. Let's concentrate on `afro dancehall`, `afropop`, and `nigerian pop`, additionally filter the dataset to remove anything with a 0 popularity value (meaning it was not classified with a popularity in the dataset and can be considered noise for our purposes):
```python
df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
@ -256,16 +255,17 @@ plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
Do a quick test to see if the data correlates in any particularly strong way:
1. Do a quick test to see if the data correlates in any particularly strong way:
```python
corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);
sns.heatmap(corrmat, vmax=.8, square=True)
```
![correlations](images/correlation.png)
The only strong correlation is between energy and loudness, which is not too surprising, given that loud music is usually pretty energetic. Otherwise, the correlations are relatively weak. It will be interesting to see what a clustering algorithm can make of this data.
The only strong correlation is between `energy` and `loudness`, which is not too surprising, given that loud music is usually pretty energetic. Otherwise, the correlations are relatively weak. It will be interesting to see what a clustering algorithm can make of this data.
> 🎓 Note that correlation does not imply causation! We have proof of correlation but no proof of causation. An [amusing web site](https://tylervigen.com/spurious-correlations) has some visuals that emphasize this point.
@ -273,9 +273,11 @@ Is there any convergence in this dataset around a song's perceived popularity an
✅ Try different datapoints (energy, loudness, speechiness) and more or different musical genres. What can you discover? Take a look at the `df.describe()` table to see the general spread of the data points.
### Data distribution
### Exercise - data distribution
Are these three genres significantly different in the perception of their danceability, based on their popularity? Examine our top three genres data distribution for popularity and danceability along a given x and y axis.
Are these three genres significantly different in the perception of their danceability, based on their popularity?
1. Examine our top three genres data distribution for popularity and danceability along a given x and y axis.
```python
sns.set_theme(style="ticks")
@ -295,7 +297,7 @@ In general, the three genres align loosely in terms of their popularity and danc
![distribution](images/distribution.png)
A scatterplot of the same axes shows a similar pattern of convergence:
1. Create a scatter plot:
```python
sns.FacetGrid(df, hue="artist_top_genre", size=5) \
@ -303,14 +305,18 @@ sns.FacetGrid(df, hue="artist_top_genre", size=5) \
.add_legend()
```
A scatterplot of the same axes shows a similar pattern of convergence
![Facetgrid](images/facetgrid.png)
In general, for clustering, you can use scatterplots to show clusters of data, so mastering this type of visualization is very useful. In the next lesson, we will take this filtered data and use k-means clustering to discover groups in this data that see to overlap in interesting ways.
---
## 🚀Challenge
In preparation for the next lesson, make a chart about the various clustering algorithms you might discover and use in a production environment. What kinds of problems is the clustering trying to address?
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/28/)
## Review & Self Study

@ -1,4 +1,7 @@
# Clustering models for machine learning
Clustering is a machine learning task where it looks to find objects that resemble one another and group these into groups called clusters. What differs clustering from other approaches in machine learning, is that things happen automatically, in fact, it's fair to say it's the opposite of supervised learning.
## Regional topic: clustering models for a Nigerian audience's musical taste 🎧
Nigeria's diverse audience has diverse musical tastes. Using data scraped from Spotify (inspired by [this article](https://towardsdatascience.com/country-wise-visual-analysis-of-music-taste-using-spotify-api-seaborn-in-python-77f5b749b421), let's look at some music popular in Nigeria. This dataset includes data about various songs' 'danceability' score, 'acousticness', loudness, 'speechiness', popularity and energy. It will be interesting to discover patterns in this data!
@ -7,7 +10,6 @@ Nigeria's diverse audience has diverse musical tastes. Using data scraped from S
Photo by <a href="https://unsplash.com/@marcelalaskoski?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Marcela Laskoski</a> on <a href="https://unsplash.com/s/photos/nigerian-music?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
In this series of lessons, you will discover new ways to analyze data using clustering techniques. Clustering is particularly useful when your dataset lacks labels. If it does have labels, then classification techniques such as those you learned in previous lessons might be more useful. But in cases where you are looking to group unlabelled data, clustering is a great way to discover patterns.
> There are useful low-code tools that can help you learn about working with clustering models. Try [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-clustering-model-azure-machine-learning-designer/?WT.mc_id=academic-15963-cxa)

Loading…
Cancel
Save