Merge pull request #45 from softchris/clustering

editorial
4 years ago · f079a4c1d3
parent 7b8f5261a5 7de1a1fc0b
commit f079a4c1d3
2 changed files with 195 additions and 187 deletions
--- a/5-Clustering/1-Visualize/README.md
+++ b/5-Clustering/1-Visualize/README.md
@ -75,77 +75,72 @@ Deepen your understanding of clustering techniques in this [Learn module](https:
 > 
 > Data that is 'noisy' is considered to be 'dense'. The distances between points in each of its clusters may prove, on examination, to be more or less dense, or 'crowded' and thus this data needs to be analyzed with the appropriate clustering method. [This article](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) demonstrates the difference between using K-Means clustering vs. HDBSCAN algorithms to explore a noisy dataset with uneven cluster density.

-### Clustering algorithms
+## Clustering algorithms

 There are over 100 clustering algorithms, and their use depends on the nature of the data at hand. Let's discuss some of the major ones:

-**Hierarchical clustering** 
+- **Hierarchical clustering**. If an object is classified by its proximity to a nearby object, rather than to one farther away, clusters are formed based on their members' distance to and from other objects. Scikit-learn's agglomerative clustering is hierarchical.

-If an object is classified by its proximity to a nearby object, rather than to one farther away, clusters are formed based on their members' distance to and from other objects. Scikit-learn's agglomerative clustering is hierarchical.
+   ![Hierarchical clustering Infographic](./images/hierarchical.png)
+   > Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)

-![Hierarchical clustering Infographic](./images/hierarchical.png)
-> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
-
-**Centroid clustering** 
+- **Centroid clustering**. This popular algorithm requires the choice of 'k', or the number of clusters to form, after which the algorithm determines the center point of a cluster and gathers data around that point. [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) is a popular version of centroid clustering. The center is determined by the nearest mean, thus the name. The squared distance from the cluster is minimized.

-This popular algorithm requires the choice of 'k', or the number of clusters to form, after which the algorithm determines the center point of a cluster and gathers data around that point. [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) is a popular version of centroid clustering. The center is determined by the nearest mean, thus the name. The squared distance from the cluster is minimized.
-
-![Centroid clustering Infographic](./images/centroid.png)
-> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
+   ![Centroid clustering Infographic](./images/centroid.png)
+   > Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)

-**Distribution-based clustering**
+- **Distribution-based clustering**. Based in statistical modeling, distribution-based clustering centers on determining the probability that a data point belongs to a cluster, and assigning it accordingly. Gaussian mixture methods belong to this type.

-Based in statistical modeling, distribution-based clustering centers on determining the probability that a data point belongs to a cluster, and assigning it accordingly. Gaussian mixture methods belong to this type.
+- **Density-based clustering**. Data points are assigned to clusters based on their density, or their grouping around each other. Data points far from the group are considered outliers or noise. DBSCAN, Mean-shift and OPTICS belong to this type of clustering.

-**Density-based clustering**
+- **Grid-based clustering**. For multi-dimensional datasets, a grid is created and the data is divided amongst the grid's cells, thereby creating clusters.

-Data points are assigned to clusters based on their density, or their grouping around each other. Data points far from the group are considered outliers or noise. DBSCAN, Mean-shift and OPTICS belong to this type of clustering.
+## Exercise - cluster your data

-**Grid-based clustering**
+Clustering as a technique is greatly aided by proper visualization, so let's get started by visualizing our music data. This exercise will help us decide which of the methods of clustering we should most effectively use for the nature of this data.

-For multi-dimensional datasets, a grid is created and the data is divided amongst the grid's cells, thereby creating clusters.
-### Preparing the data
+1. Open the _notebook.ipynb_ file in this folder.

-Clustering as a technique is greatly aided by proper visualization, so let's get started by visualizing our music data. This exercise will help us decide which of the methods of clustering we should most effectively use for the nature of this data.
+1. Import the `Seaborn` package for good data visualization.

-Open the notebook.ipynb file in this folder. Import the Seaborn package for good data visualization.
+    ```python
+    pip install seaborn
+    ```

-```python
-pip install seaborn
-```
+1. Append the song data from _nigerian-songs.csv_. Load up a dataframe with some data about the songs. Get ready to explore this data by importing the libraries and dumping out the data:

-Append the song data .csv file. Load up a dataframe with some data about the songs. Get ready to explore this data by importing the libraries and dumping out the data:
+    ```python
+    import matplotlib.pyplot as plt
+    import pandas as pd
    
-```python
-import matplotlib.pyplot as plt
-import pandas as pd
+    df = pd.read_csv("../data/nigerian-songs.csv")
+    df.head()
+    ```

-df = pd.read_csv("../data/nigerian-songs.csv")
-df.head()
-```
+    Check the first few lines of data:

-Check the first few lines of data:
+    |     | name                     | album                        | artist              | artist_top_genre | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo   | time_signature |
+    | --- | ------------------------ | ---------------------------- | ------------------- | ---------------- | ------------ | ------ | ---------- | ------------ | ------------ | ------ | ---------------- | -------- | -------- | ----------- | ------- | -------------- |
+    | 0   | Sparky                   | Mandy & The Jungle           | Cruel Santino       | alternative r&b  | 2019         | 144000 | 48         | 0.666        | 0.851        | 0.42   | 0.534            | 0.11     | -6.699   | 0.0829      | 133.015 | 5              |
+    | 1   | shuga rush               | EVERYTHING YOU HEARD IS TRUE | Odunsi (The Engine) | afropop          | 2020         | 89488  | 30         | 0.71         | 0.0822       | 0.683  | 0.000169         | 0.101    | -5.64    | 0.36        | 129.993 | 3              |
+    | 2   | LITT!                    | LITT!                        | AYLØ                | indie r&b        | 2018         | 207758 | 40         | 0.836        | 0.272        | 0.564  | 0.000537         | 0.11     | -7.127   | 0.0424      | 130.005 | 4              |
+    | 3   | Confident / Feeling Cool | Enjoy Your Life              | Lady Donli          | nigerian pop     | 2019         | 175135 | 14         | 0.894        | 0.798        | 0.611  | 0.000187         | 0.0964   | -4.961   | 0.113       | 111.087 | 4              |
+    | 4   | wanted you               | rare.                        | Odunsi (The Engine) | afropop          | 2018         | 152049 | 25         | 0.702        | 0.116        | 0.833  | 0.91             | 0.348    | -6.044   | 0.0447      | 105.115 | 4              |

-|     | name                     | album                        | artist              | artist_top_genre | release_date | length | popularity | danceability | acousticness | energy | instrumentalness | liveness | loudness | speechiness | tempo   | time_signature |
-| --- | ------------------------ | ---------------------------- | ------------------- | ---------------- | ------------ | ------ | ---------- | ------------ | ------------ | ------ | ---------------- | -------- | -------- | ----------- | ------- | -------------- |
-| 0   | Sparky                   | Mandy & The Jungle           | Cruel Santino       | alternative r&b  | 2019         | 144000 | 48         | 0.666        | 0.851        | 0.42   | 0.534            | 0.11     | -6.699   | 0.0829      | 133.015 | 5              |
-| 1   | shuga rush               | EVERYTHING YOU HEARD IS TRUE | Odunsi (The Engine) | afropop          | 2020         | 89488  | 30         | 0.71         | 0.0822       | 0.683  | 0.000169         | 0.101    | -5.64    | 0.36        | 129.993 | 3              |
-| 2   | LITT!                    | LITT!                        | AYLØ                | indie r&b        | 2018         | 207758 | 40         | 0.836        | 0.272        | 0.564  | 0.000537         | 0.11     | -7.127   | 0.0424      | 130.005 | 4              |
-| 3   | Confident / Feeling Cool | Enjoy Your Life              | Lady Donli          | nigerian pop     | 2019         | 175135 | 14         | 0.894        | 0.798        | 0.611  | 0.000187         | 0.0964   | -4.961   | 0.113       | 111.087 | 4              |
-| 4   | wanted you               | rare.                        | Odunsi (The Engine) | afropop          | 2018         | 152049 | 25         | 0.702        | 0.116        | 0.833  | 0.91             | 0.348    | -6.044   | 0.0447      | 105.115 | 4              |
+1. Get some information about the dataframe, calling `info()`:

-Get some information about the dataframe:
+    ```python
+    df.info()
+    ```

-```python
-df.info()
-```
+   The output looking like so:

-```
-<class 'pandas.core.frame.DataFrame'>
-RangeIndex: 530 entries, 0 to 529
-Data columns (total 16 columns):
+    ```output
+    <class 'pandas.core.frame.DataFrame'>
+    RangeIndex: 530 entries, 0 to 529
+    Data columns (total 16 columns):
     #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
+    ---  ------            --------------  -----  
     0   name              530 non-null    object 
     1   album             530 non-null    object 
     2   artist            530 non-null    object 
@ -162,155 +157,166 @@ Data columns (total 16 columns):
     13  speechiness       530 non-null    float64
     14  tempo             530 non-null    float64
     15  time_signature    530 non-null    int64  
-dtypes: float64(8), int64(4), object(4)
-memory usage: 66.4+ KB
-```
-
-Double-check for null values:
-
-```python
-df.isnull().sum()
-```
-
-Looking good:
-
-```
-name                0
-album               0
-artist              0
-artist_top_genre    0
-release_date        0
-length              0
-popularity          0
-danceability        0
-acousticness        0
-energy              0
-instrumentalness    0
-liveness            0
-loudness            0
-speechiness         0
-tempo               0
-time_signature      0
-dtype: int64
-```
-
-Describe the data:
-
-```python
-df.describe()
-```
-
-|       | release_date | length      | popularity | danceability | acousticness | energy   | instrumentalness | liveness | loudness  | speechiness | tempo      | time_signature |
-| ----- | ------------ | ----------- | ---------- | ------------ | ------------ | -------- | ---------------- | -------- | --------- | ----------- | ---------- | -------------- |
-| count | 530          | 530         | 530        | 530          | 530          | 530      | 530              | 530      | 530       | 530         | 530        | 530            |
-| mean  | 2015.390566  | 222298.1698 | 17.507547  | 0.741619     | 0.265412     | 0.760623 | 0.016305         | 0.147308 | -4.953011 | 0.130748    | 116.487864 | 3.986792       |
-| std   | 3.131688     | 39696.82226 | 18.992212  | 0.117522     | 0.208342     | 0.148533 | 0.090321         | 0.123588 | 2.464186  | 0.092939    | 23.518601  | 0.333701       |
-| min   | 1998         | 89488       | 0          | 0.255        | 0.000665     | 0.111    | 0                | 0.0283   | -19.362   | 0.0278      | 61.695     | 3              |
-| 25%   | 2014         | 199305      | 0          | 0.681        | 0.089525     | 0.669    | 0                | 0.07565  | -6.29875  | 0.0591      | 102.96125  | 4              |
-| 50%   | 2016         | 218509      | 13         | 0.761        | 0.2205       | 0.7845   | 0.000004         | 0.1035   | -4.5585   | 0.09795     | 112.7145   | 4              |
-| 75%   | 2017         | 242098.5    | 31         | 0.8295       | 0.403        | 0.87575  | 0.000234         | 0.164    | -3.331    | 0.177       | 125.03925  | 4              |
-| max   | 2020         | 511738      | 73         | 0.966        | 0.954        | 0.995    | 0.91             | 0.811    | 0.582     | 0.514       | 206.007    | 5              |
+    dtypes: float64(8), int64(4), object(4)
+    memory usage: 66.4+ KB
+    ```
+
+1. Double-check for null values, by calling `isnull()` and verifying the sum being 0:
+
+    ```python
+    df.isnull().sum()
+    ```
+
+    Looking good:
+
+    ```output
+    name                0
+    album               0
+    artist              0
+    artist_top_genre    0
+    release_date        0
+    length              0
+    popularity          0
+    danceability        0
+    acousticness        0
+    energy              0
+    instrumentalness    0
+    liveness            0
+    loudness            0
+    speechiness         0
+    tempo               0
+    time_signature      0
+    dtype: int64
+    ```
+
+1. Describe the data:
+
+    ```python
+    df.describe()
+    ```
+
+    |       | release_date | length      | popularity | danceability | acousticness | energy   | instrumentalness | liveness | loudness  | speechiness | tempo      | time_signature |
+    | ----- | ------------ | ----------- | ---------- | ------------ | ------------ | -------- | ---------------- | -------- | --------- | ----------- | ---------- | -------------- |
+    | count | 530          | 530         | 530        | 530          | 530          | 530      | 530              | 530      | 530       | 530         | 530        | 530            |
+    | mean  | 2015.390566  | 222298.1698 | 17.507547  | 0.741619     | 0.265412     | 0.760623 | 0.016305         | 0.147308 | -4.953011 | 0.130748    | 116.487864 | 3.986792       |
+    | std   | 3.131688     | 39696.82226 | 18.992212  | 0.117522     | 0.208342     | 0.148533 | 0.090321         | 0.123588 | 2.464186  | 0.092939    | 23.518601  | 0.333701       |
+    | min   | 1998         | 89488       | 0          | 0.255        | 0.000665     | 0.111    | 0                | 0.0283   | -19.362   | 0.0278      | 61.695     | 3              |
+    | 25%   | 2014         | 199305      | 0          | 0.681        | 0.089525     | 0.669    | 0                | 0.07565  | -6.29875  | 0.0591      | 102.96125  | 4              |
+    | 50%   | 2016         | 218509      | 13         | 0.761        | 0.2205       | 0.7845   | 0.000004         | 0.1035   | -4.5585   | 0.09795     | 112.7145   | 4              |
+    | 75%   | 2017         | 242098.5    | 31         | 0.8295       | 0.403        | 0.87575  | 0.000234         | 0.164    | -3.331    | 0.177       | 125.03925  | 4              |
+    | max   | 2020         | 511738      | 73         | 0.966        | 0.954        | 0.995    | 0.91             | 0.811    | 0.582     | 0.514       | 206.007    | 5              |

 > 🤔 If we are working with clustering, an unsupervised method that does not require labeled data, why are we showing this data with labels? In the data exploration phase, they come in handy, but they are not necessary for the clustering algorithms to work. You could just as well remove the column headers and refer to the data by column number. 

 Look at the general values of the data. Note that popularity can be '0', which show songs that have no ranking. Let's remove those shortly.

-Use a barplot to find out the most popular genres:
+1. Use a barplot to find out the most popular genres:
+
+    ```python
+    import seaborn as sns
    
-```python
-import seaborn as sns
+    top = df['artist_top_genre'].value_counts()
+    plt.figure(figsize=(10,7))
+    sns.barplot(x=top[:5].index,y=top[:5].values)
+    plt.xticks(rotation=45)
+    plt.title('Top genres',color = 'blue')
+    ```

-top = df['artist_top_genre'].value_counts()
-plt.figure(figsize=(10,7))
-sns.barplot(x=top[:5].index,y=top[:5].values)
-plt.xticks(rotation=45)
-plt.title('Top genres',color = 'blue')
-```
-![most popular](./images/popular.png)
+    ![most popular](./images/popular.png)

 ✅ If you'd like to see more top values, change the top `[:5]` to a bigger value, or remove it to see all.

-Note, when the top genre is described as 'Missing', that means that Spotify did not classify it, so let's get rid of it:
+Note, when the top genre is described as 'Missing', that means that Spotify did not classify it, so let's get rid of it.
+
+1. Get rid of missing data by filtering it out
+
+    ```python
+    df = df[df['artist_top_genre'] != 'Missing']
+    top = df['artist_top_genre'].value_counts()
+    plt.figure(figsize=(10,7))
+    sns.barplot(x=top.index,y=top.values)
+    plt.xticks(rotation=45)
+    plt.title('Top genres',color = 'blue')
+    ```

-```python
-df = df[df['artist_top_genre'] != 'Missing']
-top = df['artist_top_genre'].value_counts()
-plt.figure(figsize=(10,7))
-sns.barplot(x=top.index,y=top.values)
-plt.xticks(rotation=45)
-plt.title('Top genres',color = 'blue')
-```
    Now recheck the genres:

-![most popular](images/all-genres.png)
+    ![most popular](images/all-genres.png)

-By far, the top three genres dominate this dataset, so let's concentrate on `afro dancehall`, `afropop`, and `nigerian pop`, also filtering the dataset to remove anything with a 0 popularity value (meaning it was not classified with a popularity in the dataset and can be considered noise for our purposes):
+1. By far, the top three genres dominate this dataset. Let's concentrate on `afro dancehall`, `afropop`, and `nigerian pop`, additionally filter the dataset to remove anything with a 0 popularity value (meaning it was not classified with a popularity in the dataset and can be considered noise for our purposes):

-```python
-df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
-df = df[(df['popularity'] > 0)]
-top = df['artist_top_genre'].value_counts()
-plt.figure(figsize=(10,7))
-sns.barplot(x=top.index,y=top.values)
-plt.xticks(rotation=45)
-plt.title('Top genres',color = 'blue')
-```
+    ```python
+    df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
+    df = df[(df['popularity'] > 0)]
+    top = df['artist_top_genre'].value_counts()
+    plt.figure(figsize=(10,7))
+    sns.barplot(x=top.index,y=top.values)
+    plt.xticks(rotation=45)
+    plt.title('Top genres',color = 'blue')
+    ```

-Do a quick test to see if the data correlates in any particularly strong way:
+1. Do a quick test to see if the data correlates in any particularly strong way:

-```python
-corrmat = df.corr()
-f, ax = plt.subplots(figsize=(12, 9))
-sns.heatmap(corrmat, vmax=.8, square=True);
-```
-![correlations](images/correlation.png)
+    ```python
+    corrmat = df.corr()
+    f, ax = plt.subplots(figsize=(12, 9))
+    sns.heatmap(corrmat, vmax=.8, square=True)
+    ```

-The only strong correlation is between energy and loudness, which is not too surprising, given that loud music is usually pretty energetic. Otherwise, the correlations are relatively weak. It will be interesting to see what a clustering algorithm can make of this data.
+    ![correlations](images/correlation.png)

-> 🎓 Note that correlation does not imply causation! We have proof of correlation but no proof of causation. An [amusing web site](https://tylervigen.com/spurious-correlations) has some visuals that emphasize this point.
+    The only strong correlation is between `energy` and `loudness`, which is not too surprising, given that loud music is usually pretty energetic. Otherwise, the correlations are relatively weak. It will be interesting to see what a clustering algorithm can make of this data.
+
+    > 🎓 Note that correlation does not imply causation! We have proof of correlation but no proof of causation. An [amusing web site](https://tylervigen.com/spurious-correlations) has some visuals that emphasize this point.

 Is there any convergence in this dataset around a song's perceived popularity and danceability? A FacetGrid shows that there are concentric circles that line up, regardless of genre. Could it be that Nigerian tastes converge at a certain level of danceability for this genre?  

 ✅ Try different datapoints (energy, loudness, speechiness) and more or different musical genres. What can you discover? Take a look at the `df.describe()` table to see the general spread of the data points.

-### Data distribution
+### Exercise - data distribution
+
+Are these three genres significantly different in the perception of their danceability, based on their popularity?

-Are these three genres significantly different in the perception of their danceability, based on their popularity? Examine our top three genres data distribution for popularity and danceability along a given x and y axis.
+1. Examine our top three genres data distribution for popularity and danceability along a given x and y axis.

-```python
-sns.set_theme(style="ticks")
+    ```python
+    sns.set_theme(style="ticks")
    
-g = sns.jointplot(
+    g = sns.jointplot(
        data=df,
        x="popularity", y="danceability", hue="artist_top_genre",
        kind="kde",
-)
-```
+    )
+    ```

-You can discover concentric circles around a general point of convergence, showing the distribution of points. 
+    You can discover concentric circles around a general point of convergence, showing the distribution of points.

-> 🎓 Note that this example uses a KDE (Kernel Density Estimate) graph that represents the data using a continuous probability density curve. This allows us to interpret data when working with multiple distributions.
+    > 🎓 Note that this example uses a KDE (Kernel Density Estimate) graph that represents the data using a continuous probability density curve. This allows us to interpret data when working with multiple distributions.

-In general, the three genres align loosely in terms of their popularity and danceability. Determining clusters in this loosely-aligned data will be a challenge:
+    In general, the three genres align loosely in terms of their popularity and danceability. Determining clusters in this loosely-aligned data will be a challenge:

-![distribution](images/distribution.png)
+    ![distribution](images/distribution.png)

-A scatterplot of the same axes shows a similar pattern of convergence:
+1. Create a scatter plot:

-```python
-sns.FacetGrid(df, hue="artist_top_genre", size=5) \
+    ```python
+    sns.FacetGrid(df, hue="artist_top_genre", size=5) \
       .map(plt.scatter, "popularity", "danceability") \
       .add_legend()
-```
+    ```

-![Facetgrid](images/facetgrid.png)
+    A scatterplot of the same axes shows a similar pattern of convergence
+
+    ![Facetgrid](images/facetgrid.png)

 In general, for clustering, you can use scatterplots to show clusters of data, so mastering this type of visualization is very useful. In the next lesson, we will take this filtered data and use k-means clustering to discover groups in this data that see to overlap in interesting ways.

 ---
+
 ## 🚀Challenge

 In preparation for the next lesson, make a chart about the various clustering algorithms you might discover and use in a production environment. What kinds of problems is the clustering trying to address?
+
 ## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/28/)

 ## Review & Self Study
--- a/5-Clustering/README.md
+++ b/5-Clustering/README.md
@ -1,4 +1,7 @@
 # Clustering models for machine learning
+
+Clustering is a machine learning task where it looks to find objects that resemble one another and group these into groups called clusters.  What differs clustering from other approaches in machine learning, is that things happen automatically, in fact, it's fair to say it's the opposite of supervised learning. 
+
 ## Regional topic: clustering models for a Nigerian audience's musical taste 🎧

 Nigeria's diverse audience has diverse musical tastes. Using data scraped from Spotify (inspired by [this article](https://towardsdatascience.com/country-wise-visual-analysis-of-music-taste-using-spotify-api-seaborn-in-python-77f5b749b421), let's look at some music popular in Nigeria. This dataset includes data about various songs' 'danceability' score, 'acousticness', loudness, 'speechiness', popularity and energy. It will be interesting to discover patterns in this data!
@ -7,7 +10,6 @@ Nigeria's diverse audience has diverse musical tastes. Using data scraped from S

 Photo by <a href="https://unsplash.com/@marcelalaskoski?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Marcela Laskoski</a> on <a href="https://unsplash.com/s/photos/nigerian-music?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  
-
 In this series of lessons, you will discover new ways to analyze data using clustering techniques. Clustering is particularly useful when your dataset lacks labels. If it does have labels, then classification techniques such as those you learned in previous lessons might be more useful. But in cases where you are looking to group unlabelled data, clustering is a great way to discover patterns.

 > There are useful low-code tools that can help you learn about working with clustering models. Try [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-clustering-model-azure-machine-learning-designer/?WT.mc_id=academic-15963-cxa)