{ "cells": [ { "cell_type": "markdown", "source": [ "## **Nigerian Music scraped from Spotify - an analysis**\n", "\n", "Clustering is a type of [Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning) that assumes a dataset is unlabelled or that its inputs are not paired with predefined outputs. It uses various algorithms to sift through unlabeled data and create groupings based on patterns it identifies in the data.\n", "\n", "[**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/27/)\n", "\n", "### **Introduction**\n", "\n", "[Clustering](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-30164-8_124) is highly useful for exploring data. Let's see if it can help uncover trends and patterns in how Nigerian audiences consume music.\n", "\n", "> β Take a moment to think about the applications of clustering. In everyday life, clustering happens when you sort a pile of laundry into family members' clothes π§¦πππ©². In data science, clustering occurs when analyzing user preferences or identifying characteristics in an unlabeled dataset. Clustering, in a way, helps bring order to chaos, like organizing a sock drawer.\n", "\n", "In a professional context, clustering can be used for tasks like market segmentation, such as identifying which age groups purchase specific items. Another application is anomaly detection, for instance, identifying fraud in a dataset of credit card transactions. It can also be used to detect tumors in medical scans.\n", "\n", "β Take a moment to think about how you might have encountered clustering in real-world scenarios, such as in banking, e-commerce, or business.\n", "\n", "> π Interestingly, cluster analysis originated in the fields of Anthropology and Psychology in the 1930s. Can you imagine how it might have been applied back then?\n", "\n", "Alternatively, clustering can be used to group search resultsβfor example, by shopping links, images, or reviews. It is particularly useful for large datasets that need to be reduced for more detailed analysis, making it a valuable technique for understanding data before building other models.\n", "\n", "β Once your data is organized into clusters, you assign it a cluster ID. This approach can be helpful for preserving a dataset's privacy, as you can refer to a data point by its cluster ID rather than by more identifiable information. Can you think of other reasons why you might use a cluster ID instead of other elements of the cluster for identification?\n", "\n", "### Getting started with clustering\n", "\n", "> π How we create clusters depends heavily on how we group data points. Let's break down some key terms:\n", ">\n", "> π ['Transductive' vs. 'inductive'](https://wikipedia.org/wiki/Transduction_(machine_learning))\n", ">\n", "> Transductive inference is derived from observed training cases that map to specific test cases. Inductive inference is derived from training cases that map to general rules, which are then applied to test cases.\n", ">\n", "> Example: Imagine you have a dataset that is only partially labeled. Some items are 'records,' some are 'CDs,' and others are blank. Your task is to label the blanks. If you use an inductive approach, you'd train a model to identify 'records' and 'CDs' and apply those labels to the unlabeled data. This approach might struggle to classify items that are actually 'cassettes.' A transductive approach, however, handles unknown data more effectively by grouping similar items together and then applying a label to the group. In this case, clusters might represent 'round musical items' and 'square musical items.'\n", ">\n", "> π ['Non-flat' vs. 'flat' geometry](https://datascience.stackexchange.com/questions/52260/terminology-flat-geometry-in-the-context-of-clustering)\n", ">\n", "> Derived from mathematical terminology, non-flat vs. flat geometry refers to how distances between points are measuredβeither using 'flat' ([Euclidean](https://wikipedia.org/wiki/Euclidean_geometry)) or 'non-flat' (non-Euclidean) methods.\n", ">\n", "> 'Flat' refers to Euclidean geometry (often taught as 'plane' geometry), while 'non-flat' refers to non-Euclidean geometry. What does geometry have to do with machine learning? Since both fields are rooted in mathematics, there must be a common way to measure distances between points in clusters. This can be done in a 'flat' or 'non-flat' manner, depending on the nature of the data. [Euclidean distances](https://wikipedia.org/wiki/Euclidean_distance) are measured as the length of a straight line between two points. [Non-Euclidean distances](https://wikipedia.org/wiki/Non-Euclidean_geometry) are measured along a curve. If your data, when visualized, doesn't exist on a plane, you may need a specialized algorithm to handle it.\n", "\n", "
\n",
" \n",
"
\n",
" \n",
"
\n",
" \n",
"