pull/45/head
softchris 3 years ago
parent 3fde0654bc
commit 7de1a1fc0b

@ -75,242 +75,248 @@ Deepen your understanding of clustering techniques in this [Learn module](https:
> >
> Data that is 'noisy' is considered to be 'dense'. The distances between points in each of its clusters may prove, on examination, to be more or less dense, or 'crowded' and thus this data needs to be analyzed with the appropriate clustering method. [This article](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) demonstrates the difference between using K-Means clustering vs. HDBSCAN algorithms to explore a noisy dataset with uneven cluster density. > Data that is 'noisy' is considered to be 'dense'. The distances between points in each of its clusters may prove, on examination, to be more or less dense, or 'crowded' and thus this data needs to be analyzed with the appropriate clustering method. [This article](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) demonstrates the difference between using K-Means clustering vs. HDBSCAN algorithms to explore a noisy dataset with uneven cluster density.
### Clustering algorithms ## Clustering algorithms
There are over 100 clustering algorithms, and their use depends on the nature of the data at hand. Let's discuss some of the major ones: There are over 100 clustering algorithms, and their use depends on the nature of the data at hand. Let's discuss some of the major ones:
**Hierarchical clustering** - **Hierarchical clustering**. If an object is classified by its proximity to a nearby object, rather than to one farther away, clusters are formed based on their members' distance to and from other objects. Scikit-learn's agglomerative clustering is hierarchical.
If an object is classified by its proximity to a nearby object, rather than to one farther away, clusters are formed based on their members' distance to and from other objects. Scikit-learn's agglomerative clustering is hierarchical. ![Hierarchical clustering Infographic](./images/hierarchical.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
![Hierarchical clustering Infographic](./images/hierarchical.png) - **Centroid clustering**. This popular algorithm requires the choice of 'k', or the number of clusters to form, after which the algorithm determines the center point of a cluster and gathers data around that point. [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) is a popular version of centroid clustering. The center is determined by the nearest mean, thus the name. The squared distance from the cluster is minimized.
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
**Centroid clustering**
This popular algorithm requires the choice of 'k', or the number of clusters to form, after which the algorithm determines the center point of a cluster and gathers data around that point. [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) is a popular version of centroid clustering. The center is determined by the nearest mean, thus the name. The squared distance from the cluster is minimized.
![Centroid clustering Infographic](./images/centroid.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
**Distribution-based clustering**
Based in statistical modeling, distribution-based clustering centers on determining the probability that a data point belongs to a cluster, and assigning it accordingly. Gaussian mixture methods belong to this type. ![Centroid clustering Infographic](./images/centroid.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
**Density-based clustering** - **Distribution-based clustering**. Based in statistical modeling, distribution-based clustering centers on determining the probability that a data point belongs to a cluster, and assigning it accordingly. Gaussian mixture methods belong to this type.
Data points are assigned to clusters based on their density, or their grouping around each other. Data points far from the group are considered outliers or noise. DBSCAN, Mean-shift and OPTICS belong to this type of clustering. - **Density-based clustering**. Data points are assigned to clusters based on their density, or their grouping around each other. Data points far from the group are considered outliers or noise. DBSCAN, Mean-shift and OPTICS belong to this type of clustering.
**Grid-based clustering** - **Grid-based clustering**. For multi-dimensional datasets, a grid is created and the data is divided amongst the grid's cells, thereby creating clusters.
For multi-dimensional datasets, a grid is created and the data is divided amongst the grid's cells, thereby creating clusters. ## Exercise - cluster your data
### Preparing the data
Clustering as a technique is greatly aided by proper visualization, so let's get started by visualizing our music data. This exercise will help us decide which of the methods of clustering we should most effectively use for the nature of this data. Clustering as a technique is greatly aided by proper visualization, so let's get started by visualizing our music data. This exercise will help us decide which of the methods of clustering we should most effectively use for the nature of this data.
Open the notebook.ipynb file in this folder. Import the Seaborn package for good data visualization. 1. Open the _notebook.ipynb_ file in this folder.
```python 1. Import the `Seaborn` package for good data visualization.
pip install seaborn
``` ```python
pip install seaborn
Append the song data .csv file. Load up a dataframe with some data about the songs. Get ready to explore this data by importing the libraries and dumping out the data: ```
```python 1. Append the song data from _nigerian-songs.csv_. Load up a dataframe with some data about the songs. Get ready to explore this data by importing the libraries and dumping out the data:
import matplotlib.pyplot as plt
import pandas as pd ```python
import matplotlib.pyplot as plt
df = pd.read_csv("../data/nigerian-songs.csv") import pandas as pd
df.head()
``` df = pd.read_csv("../data/nigerian-songs.csv")
df.head()
Check the first few lines of data: ```
| | name | album | artist | artist_top_genre | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo | time_signature | Check the first few lines of data:
| --- | ------------------------ | ---------------------------- | ------------------- | ---------------- | ------------ | ------ | ---------- | ------------ | ------------ | ------ | ---------------- | -------- | -------- | ----------- | ------- | -------------- |
| 0 | Sparky | Mandy & The Jungle | Cruel Santino | alternative r&b | 2019 | 144000 | 48 | 0.666 | 0.851 | 0.42 | 0.534 | 0.11 | -6.699 | 0.0829 | 133.015 | 5 | | | name | album | artist | artist_top_genre | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo | time_signature |
| 1 | shuga rush | EVERYTHING YOU HEARD IS TRUE | Odunsi (The Engine) | afropop | 2020 | 89488 | 30 | 0.71 | 0.0822 | 0.683 | 0.000169 | 0.101 | -5.64 | 0.36 | 129.993 | 3 | | --- | ------------------------ | ---------------------------- | ------------------- | ---------------- | ------------ | ------ | ---------- | ------------ | ------------ | ------ | ---------------- | -------- | -------- | ----------- | ------- | -------------- |
| 2 | LITT! | LITT! | AYLØ | indie r&b | 2018 | 207758 | 40 | 0.836 | 0.272 | 0.564 | 0.000537 | 0.11 | -7.127 | 0.0424 | 130.005 | 4 | | 0 | Sparky | Mandy & The Jungle | Cruel Santino | alternative r&b | 2019 | 144000 | 48 | 0.666 | 0.851 | 0.42 | 0.534 | 0.11 | -6.699 | 0.0829 | 133.015 | 5 |
| 3 | Confident / Feeling Cool | Enjoy Your Life | Lady Donli | nigerian pop | 2019 | 175135 | 14 | 0.894 | 0.798 | 0.611 | 0.000187 | 0.0964 | -4.961 | 0.113 | 111.087 | 4 | | 1 | shuga rush | EVERYTHING YOU HEARD IS TRUE | Odunsi (The Engine) | afropop | 2020 | 89488 | 30 | 0.71 | 0.0822 | 0.683 | 0.000169 | 0.101 | -5.64 | 0.36 | 129.993 | 3 |
| 4 | wanted you | rare. | Odunsi (The Engine) | afropop | 2018 | 152049 | 25 | 0.702 | 0.116 | 0.833 | 0.91 | 0.348 | -6.044 | 0.0447 | 105.115 | 4 | | 2 | LITT! | LITT! | AYLØ | indie r&b | 2018 | 207758 | 40 | 0.836 | 0.272 | 0.564 | 0.000537 | 0.11 | -7.127 | 0.0424 | 130.005 | 4 |
| 3 | Confident / Feeling Cool | Enjoy Your Life | Lady Donli | nigerian pop | 2019 | 175135 | 14 | 0.894 | 0.798 | 0.611 | 0.000187 | 0.0964 | -4.961 | 0.113 | 111.087 | 4 |
Get some information about the dataframe: | 4 | wanted you | rare. | Odunsi (The Engine) | afropop | 2018 | 152049 | 25 | 0.702 | 0.116 | 0.833 | 0.91 | 0.348 | -6.044 | 0.0447 | 105.115 | 4 |
```python 1. Get some information about the dataframe, calling `info()`:
df.info()
``` ```python
df.info()
``` ```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529 The output looking like so:
Data columns (total 16 columns):
# Column Non-Null Count Dtype ```output
--- ------ -------------- ----- <class 'pandas.core.frame.DataFrame'>
0 name 530 non-null object RangeIndex: 530 entries, 0 to 529
1 album 530 non-null object Data columns (total 16 columns):
2 artist 530 non-null object # Column Non-Null Count Dtype
3 artist_top_genre 530 non-null object --- ------ -------------- -----
4 release_date 530 non-null int64 0 name 530 non-null object
5 length 530 non-null int64 1 album 530 non-null object
6 popularity 530 non-null int64 2 artist 530 non-null object
7 danceability 530 non-null float64 3 artist_top_genre 530 non-null object
8 acousticness 530 non-null float64 4 release_date 530 non-null int64
9 energy 530 non-null float64 5 length 530 non-null int64
10 instrumentalness 530 non-null float64 6 popularity 530 non-null int64
11 liveness 530 non-null float64 7 danceability 530 non-null float64
12 loudness 530 non-null float64 8 acousticness 530 non-null float64
13 speechiness 530 non-null float64 9 energy 530 non-null float64
14 tempo 530 non-null float64 10 instrumentalness 530 non-null float64
15 time_signature 530 non-null int64 11 liveness 530 non-null float64
dtypes: float64(8), int64(4), object(4) 12 loudness 530 non-null float64
memory usage: 66.4+ KB 13 speechiness 530 non-null float64
``` 14 tempo 530 non-null float64
15 time_signature 530 non-null int64
Double-check for null values: dtypes: float64(8), int64(4), object(4)
memory usage: 66.4+ KB
```python ```
df.isnull().sum()
``` 1. Double-check for null values, by calling `isnull()` and verifying the sum being 0:
Looking good: ```python
df.isnull().sum()
``` ```
name 0
album 0 Looking good:
artist 0
artist_top_genre 0 ```output
release_date 0 name 0
length 0 album 0
popularity 0 artist 0
danceability 0 artist_top_genre 0
acousticness 0 release_date 0
energy 0 length 0
instrumentalness 0 popularity 0
liveness 0 danceability 0
loudness 0 acousticness 0
speechiness 0 energy 0
tempo 0 instrumentalness 0
time_signature 0 liveness 0
dtype: int64 loudness 0
``` speechiness 0
tempo 0
Describe the data: time_signature 0
dtype: int64
```python ```
df.describe()
``` 1. Describe the data:
| | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo | time_signature | ```python
| ----- | ------------ | ----------- | ---------- | ------------ | ------------ | -------- | ---------------- | -------- | --------- | ----------- | ---------- | -------------- | df.describe()
| count | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | ```
| mean | 2015.390566 | 222298.1698 | 17.507547 | 0.741619 | 0.265412 | 0.760623 | 0.016305 | 0.147308 | -4.953011 | 0.130748 | 116.487864 | 3.986792 |
| std | 3.131688 | 39696.82226 | 18.992212 | 0.117522 | 0.208342 | 0.148533 | 0.090321 | 0.123588 | 2.464186 | 0.092939 | 23.518601 | 0.333701 | | | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo | time_signature |
| min | 1998 | 89488 | 0 | 0.255 | 0.000665 | 0.111 | 0 | 0.0283 | -19.362 | 0.0278 | 61.695 | 3 | | ----- | ------------ | ----------- | ---------- | ------------ | ------------ | -------- | ---------------- | -------- | --------- | ----------- | ---------- | -------------- |
| 25% | 2014 | 199305 | 0 | 0.681 | 0.089525 | 0.669 | 0 | 0.07565 | -6.29875 | 0.0591 | 102.96125 | 4 | | count | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 530 |
| 50% | 2016 | 218509 | 13 | 0.761 | 0.2205 | 0.7845 | 0.000004 | 0.1035 | -4.5585 | 0.09795 | 112.7145 | 4 | | mean | 2015.390566 | 222298.1698 | 17.507547 | 0.741619 | 0.265412 | 0.760623 | 0.016305 | 0.147308 | -4.953011 | 0.130748 | 116.487864 | 3.986792 |
| 75% | 2017 | 242098.5 | 31 | 0.8295 | 0.403 | 0.87575 | 0.000234 | 0.164 | -3.331 | 0.177 | 125.03925 | 4 | | std | 3.131688 | 39696.82226 | 18.992212 | 0.117522 | 0.208342 | 0.148533 | 0.090321 | 0.123588 | 2.464186 | 0.092939 | 23.518601 | 0.333701 |
| max | 2020 | 511738 | 73 | 0.966 | 0.954 | 0.995 | 0.91 | 0.811 | 0.582 | 0.514 | 206.007 | 5 | | min | 1998 | 89488 | 0 | 0.255 | 0.000665 | 0.111 | 0 | 0.0283 | -19.362 | 0.0278 | 61.695 | 3 |
| 25% | 2014 | 199305 | 0 | 0.681 | 0.089525 | 0.669 | 0 | 0.07565 | -6.29875 | 0.0591 | 102.96125 | 4 |
| 50% | 2016 | 218509 | 13 | 0.761 | 0.2205 | 0.7845 | 0.000004 | 0.1035 | -4.5585 | 0.09795 | 112.7145 | 4 |
| 75% | 2017 | 242098.5 | 31 | 0.8295 | 0.403 | 0.87575 | 0.000234 | 0.164 | -3.331 | 0.177 | 125.03925 | 4 |
| max | 2020 | 511738 | 73 | 0.966 | 0.954 | 0.995 | 0.91 | 0.811 | 0.582 | 0.514 | 206.007 | 5 |
> 🤔 If we are working with clustering, an unsupervised method that does not require labeled data, why are we showing this data with labels? In the data exploration phase, they come in handy, but they are not necessary for the clustering algorithms to work. You could just as well remove the column headers and refer to the data by column number. > 🤔 If we are working with clustering, an unsupervised method that does not require labeled data, why are we showing this data with labels? In the data exploration phase, they come in handy, but they are not necessary for the clustering algorithms to work. You could just as well remove the column headers and refer to the data by column number.
Look at the general values of the data. Note that popularity can be '0', which show songs that have no ranking. Let's remove those shortly. Look at the general values of the data. Note that popularity can be '0', which show songs that have no ranking. Let's remove those shortly.
Use a barplot to find out the most popular genres: 1. Use a barplot to find out the most popular genres:
```python ```python
import seaborn as sns import seaborn as sns
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top[:5].index,y=top[:5].values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
top = df['artist_top_genre'].value_counts() ![most popular](./images/popular.png)
plt.figure(figsize=(10,7))
sns.barplot(x=top[:5].index,y=top[:5].values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
![most popular](./images/popular.png)
✅ If you'd like to see more top values, change the top `[:5]` to a bigger value, or remove it to see all. ✅ If you'd like to see more top values, change the top `[:5]` to a bigger value, or remove it to see all.
Note, when the top genre is described as 'Missing', that means that Spotify did not classify it, so let's get rid of it: Note, when the top genre is described as 'Missing', that means that Spotify did not classify it, so let's get rid of it.
```python 1. Get rid of missing data by filtering it out
df = df[df['artist_top_genre'] != 'Missing']
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
Now recheck the genres:
![most popular](images/all-genres.png) ```python
df = df[df['artist_top_genre'] != 'Missing']
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
By far, the top three genres dominate this dataset, so let's concentrate on `afro dancehall`, `afropop`, and `nigerian pop`, also filtering the dataset to remove anything with a 0 popularity value (meaning it was not classified with a popularity in the dataset and can be considered noise for our purposes): Now recheck the genres:
```python ![most popular](images/all-genres.png)
df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
df = df[(df['popularity'] > 0)]
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
Do a quick test to see if the data correlates in any particularly strong way: 1. By far, the top three genres dominate this dataset. Let's concentrate on `afro dancehall`, `afropop`, and `nigerian pop`, additionally filter the dataset to remove anything with a 0 popularity value (meaning it was not classified with a popularity in the dataset and can be considered noise for our purposes):
```python ```python
corrmat = df.corr() df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
f, ax = plt.subplots(figsize=(12, 9)) df = df[(df['popularity'] > 0)]
sns.heatmap(corrmat, vmax=.8, square=True); top = df['artist_top_genre'].value_counts()
``` plt.figure(figsize=(10,7))
![correlations](images/correlation.png) sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')
```
The only strong correlation is between energy and loudness, which is not too surprising, given that loud music is usually pretty energetic. Otherwise, the correlations are relatively weak. It will be interesting to see what a clustering algorithm can make of this data. 1. Do a quick test to see if the data correlates in any particularly strong way:
> 🎓 Note that correlation does not imply causation! We have proof of correlation but no proof of causation. An [amusing web site](https://tylervigen.com/spurious-correlations) has some visuals that emphasize this point. ```python
corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
```
![correlations](images/correlation.png)
The only strong correlation is between `energy` and `loudness`, which is not too surprising, given that loud music is usually pretty energetic. Otherwise, the correlations are relatively weak. It will be interesting to see what a clustering algorithm can make of this data.
> 🎓 Note that correlation does not imply causation! We have proof of correlation but no proof of causation. An [amusing web site](https://tylervigen.com/spurious-correlations) has some visuals that emphasize this point.
Is there any convergence in this dataset around a song's perceived popularity and danceability? A FacetGrid shows that there are concentric circles that line up, regardless of genre. Could it be that Nigerian tastes converge at a certain level of danceability for this genre? Is there any convergence in this dataset around a song's perceived popularity and danceability? A FacetGrid shows that there are concentric circles that line up, regardless of genre. Could it be that Nigerian tastes converge at a certain level of danceability for this genre?
✅ Try different datapoints (energy, loudness, speechiness) and more or different musical genres. What can you discover? Take a look at the `df.describe()` table to see the general spread of the data points. ✅ Try different datapoints (energy, loudness, speechiness) and more or different musical genres. What can you discover? Take a look at the `df.describe()` table to see the general spread of the data points.
### Data distribution ### Exercise - data distribution
Are these three genres significantly different in the perception of their danceability, based on their popularity? Examine our top three genres data distribution for popularity and danceability along a given x and y axis. Are these three genres significantly different in the perception of their danceability, based on their popularity?
```python 1. Examine our top three genres data distribution for popularity and danceability along a given x and y axis.
sns.set_theme(style="ticks")
g = sns.jointplot( ```python
data=df, sns.set_theme(style="ticks")
x="popularity", y="danceability", hue="artist_top_genre",
kind="kde", g = sns.jointplot(
) data=df,
``` x="popularity", y="danceability", hue="artist_top_genre",
kind="kde",
)
```
You can discover concentric circles around a general point of convergence, showing the distribution of points. You can discover concentric circles around a general point of convergence, showing the distribution of points.
> 🎓 Note that this example uses a KDE (Kernel Density Estimate) graph that represents the data using a continuous probability density curve. This allows us to interpret data when working with multiple distributions. > 🎓 Note that this example uses a KDE (Kernel Density Estimate) graph that represents the data using a continuous probability density curve. This allows us to interpret data when working with multiple distributions.
In general, the three genres align loosely in terms of their popularity and danceability. Determining clusters in this loosely-aligned data will be a challenge: In general, the three genres align loosely in terms of their popularity and danceability. Determining clusters in this loosely-aligned data will be a challenge:
![distribution](images/distribution.png) ![distribution](images/distribution.png)
A scatterplot of the same axes shows a similar pattern of convergence: 1. Create a scatter plot:
```python ```python
sns.FacetGrid(df, hue="artist_top_genre", size=5) \ sns.FacetGrid(df, hue="artist_top_genre", size=5) \
.map(plt.scatter, "popularity", "danceability") \ .map(plt.scatter, "popularity", "danceability") \
.add_legend() .add_legend()
``` ```
![Facetgrid](images/facetgrid.png) A scatterplot of the same axes shows a similar pattern of convergence
![Facetgrid](images/facetgrid.png)
In general, for clustering, you can use scatterplots to show clusters of data, so mastering this type of visualization is very useful. In the next lesson, we will take this filtered data and use k-means clustering to discover groups in this data that see to overlap in interesting ways. In general, for clustering, you can use scatterplots to show clusters of data, so mastering this type of visualization is very useful. In the next lesson, we will take this filtered data and use k-means clustering to discover groups in this data that see to overlap in interesting ways.
--- ---
## 🚀Challenge ## 🚀Challenge
In preparation for the next lesson, make a chart about the various clustering algorithms you might discover and use in a production environment. What kinds of problems is the clustering trying to address? In preparation for the next lesson, make a chart about the various clustering algorithms you might discover and use in a production environment. What kinds of problems is the clustering trying to address?
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/28/) ## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/28/)
## Review & Self Study ## Review & Self Study
@ -319,6 +325,6 @@ Before you apply clustering algorithms, as we have learned, it's a good idea to
[This helpful article](https://www.freecodecamp.org/news/8-clustering-algorithms-in-machine-learning-that-all-data-scientists-should-know/) walks you through the different ways that various clustering algorithms behave, given different data shapes. [This helpful article](https://www.freecodecamp.org/news/8-clustering-algorithms-in-machine-learning-that-all-data-scientists-should-know/) walks you through the different ways that various clustering algorithms behave, given different data shapes.
## Assignment ## Assignment
[Research other visualizations for clustering](assignment.md) [Research other visualizations for clustering](assignment.md)

@ -1,4 +1,7 @@
# Clustering models for machine learning # Clustering models for machine learning
Clustering is a machine learning task where it looks to find objects that resemble one another and group these into groups called clusters. What differs clustering from other approaches in machine learning, is that things happen automatically, in fact, it's fair to say it's the opposite of supervised learning.
## Regional topic: clustering models for a Nigerian audience's musical taste 🎧 ## Regional topic: clustering models for a Nigerian audience's musical taste 🎧
Nigeria's diverse audience has diverse musical tastes. Using data scraped from Spotify (inspired by [this article](https://towardsdatascience.com/country-wise-visual-analysis-of-music-taste-using-spotify-api-seaborn-in-python-77f5b749b421), let's look at some music popular in Nigeria. This dataset includes data about various songs' 'danceability' score, 'acousticness', loudness, 'speechiness', popularity and energy. It will be interesting to discover patterns in this data! Nigeria's diverse audience has diverse musical tastes. Using data scraped from Spotify (inspired by [this article](https://towardsdatascience.com/country-wise-visual-analysis-of-music-taste-using-spotify-api-seaborn-in-python-77f5b749b421), let's look at some music popular in Nigeria. This dataset includes data about various songs' 'danceability' score, 'acousticness', loudness, 'speechiness', popularity and energy. It will be interesting to discover patterns in this data!
@ -7,7 +10,6 @@ Nigeria's diverse audience has diverse musical tastes. Using data scraped from S
Photo by <a href="https://unsplash.com/@marcelalaskoski?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Marcela Laskoski</a> on <a href="https://unsplash.com/s/photos/nigerian-music?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> Photo by <a href="https://unsplash.com/@marcelalaskoski?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Marcela Laskoski</a> on <a href="https://unsplash.com/s/photos/nigerian-music?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
In this series of lessons, you will discover new ways to analyze data using clustering techniques. Clustering is particularly useful when your dataset lacks labels. If it does have labels, then classification techniques such as those you learned in previous lessons might be more useful. But in cases where you are looking to group unlabelled data, clustering is a great way to discover patterns. In this series of lessons, you will discover new ways to analyze data using clustering techniques. Clustering is particularly useful when your dataset lacks labels. If it does have labels, then classification techniques such as those you learned in previous lessons might be more useful. But in cases where you are looking to group unlabelled data, clustering is a great way to discover patterns.
> There are useful low-code tools that can help you learn about working with clustering models. Try [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-clustering-model-azure-machine-learning-designer/?WT.mc_id=academic-15963-cxa) > There are useful low-code tools that can help you learn about working with clustering models. Try [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-clustering-model-azure-machine-learning-designer/?WT.mc_id=academic-15963-cxa)

Loading…
Cancel
Save