{ "cells": [ { "cell_type": "markdown", "source": [ "## **从Spotify抓取的尼日利亚音乐分析**\n", "\n", "聚类是一种[无监督学习](https://wikipedia.org/wiki/Unsupervised_learning)方法，假设数据集是未标记的，或者其输入未与预定义的输出匹配。它使用各种算法对未标记的数据进行分类，并根据数据中识别出的模式提供分组。\n", "\n", "[**课前测验**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/27/)\n", "\n", "### **简介**\n", "\n", "[聚类](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-30164-8_124)在数据探索中非常有用。让我们看看它是否能帮助发现尼日利亚观众消费音乐的趋势和模式。\n", "\n", "> ✅ 花一分钟思考一下聚类的用途。在现实生活中，聚类就像你有一堆洗好的衣服，需要将家人各自的衣物分类🧦👕👖🩲。在数据科学中，聚类发生在分析用户偏好或确定任何未标记数据集的特征时。聚类在某种程度上帮助我们从混乱中找到秩序，比如整理袜子抽屉。\n", "\n", "在专业环境中，聚类可以用于市场细分，例如确定哪些年龄段购买哪些商品。另一个用途是异常检测，比如从信用卡交易数据集中检测欺诈行为。或者，你可以用聚类来识别一批医学扫描中的肿瘤。\n", "\n", "✅ 花一分钟思考一下你在银行、电子商务或商业环境中可能遇到过的聚类应用。\n", "\n", "> 🎓 有趣的是，聚类分析起源于20世纪30年代的人类学和心理学领域。你能想象它可能是如何被使用的吗？\n", "\n", "另外，你可以用它来对搜索结果进行分组——例如按购物链接、图片或评论分组。当你有一个大型数据集需要简化并进行更细致的分析时，聚类技术非常有用，因此它可以在构建其他模型之前帮助了解数据。\n", "\n", "✅ 一旦你的数据被组织成聚类，你可以为其分配一个聚类ID。这种技术在保护数据集隐私时非常有用；你可以用聚类ID来引用数据点，而不是使用更具识别性的详细数据。你能想到其他使用聚类ID而不是聚类中其他元素来标识数据点的原因吗？\n", "\n", "### 开始学习聚类\n", "\n", "> 🎓 我们如何创建聚类与我们如何将数据点分组密切相关。让我们来解读一些术语：\n", ">\n", "> 🎓 ['传导性' vs. '归纳性'](https://wikipedia.org/wiki/Transduction_(machine_learning))\n", ">\n", "> 传导性推理是从观察到的训练案例中得出的，这些案例映射到特定的测试案例。归纳性推理是从训练案例中得出的，这些案例映射到一般规则，然后才应用于测试案例。\n", ">\n", "> 举个例子：假设你有一个部分标记的数据集。一些是“唱片”，一些是“CD”，还有一些是空白的。你的任务是为空白部分提供标签。如果你选择归纳方法，你会训练一个模型寻找“唱片”和“CD”，并将这些标签应用于未标记的数据。这种方法可能难以分类实际上是“磁带”的东西。而传导方法则更有效地处理这些未知数据，因为它会将相似的项目分组，然后为整个组应用一个标签。在这种情况下，聚类可能反映“圆形音乐物品”和“方形音乐物品”。\n", ">\n", "> 🎓 ['非平面' vs. '平面'几何](https://datascience.stackexchange.com/questions/52260/terminology-flat-geometry-in-the-context-of-clustering)\n", ">\n", "> 源于数学术语，非平面与平面几何指的是通过“平面”（[欧几里得](https://wikipedia.org/wiki/Euclidean_geometry)）或“非平面”（非欧几里得）几何方法测量点之间的距离。\n", ">\n", "> 在此上下文中，“平面”指的是欧几里得几何（部分被称为“平面”几何），而“非平面”指的是非欧几里得几何。几何与机器学习有什么关系？作为两个都以数学为基础的领域，必须有一种通用的方法来测量聚类中点之间的距离，这可以根据数据的性质以“平面”或“非平面”的方式进行。[欧几里得距离](https://wikipedia.org/wiki/Euclidean_distance)是通过两点之间线段的长度来测量的。[非欧几里得距离](https://wikipedia.org/wiki/Non-Euclidean_geometry)则沿曲线测量。如果你的数据在可视化时似乎不在一个平面上，你可能需要使用专门的算法来处理它。\n", "\n", "

\n", " \n", "

Dasani Madipalli制作的信息图

\n", "\n", "> 🎓 ['距离'](https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf)\n", ">\n", "> 聚类由其距离矩阵定义，例如点之间的距离。这种距离可以通过几种方式测量。欧几里得聚类由点值的平均值定义，并包含一个“质心”或中心点。因此，距离是通过到质心的距离来测量的。非欧几里得距离指的是“聚心”，即最接近其他点的点。聚心可以通过多种方式定义。\n", ">\n", "> 🎓 ['约束'](https://wikipedia.org/wiki/Constrained_clustering)\n", ">\n", "> [约束聚类](https://web.cs.ucdavis.edu/~davidson/Publications/ICDMTutorial.pdf)在这种无监督方法中引入了“半监督”学习。点之间的关系被标记为“不能链接”或“必须链接”，因此对数据集施加了一些规则。\n", ">\n", "> 举个例子：如果一个算法在一批未标记或半标记的数据上自由运行，它生成的聚类可能质量较差。在上面的例子中，聚类可能会将“圆形音乐物品”、“方形音乐物品”、“三角形物品”和“饼干”分组。如果给出一些约束或规则（“物品必须是塑料制成的”，“物品需要能够产生音乐”），这可以帮助“约束”算法做出更好的选择。\n", ">\n", "> 🎓 '密度'\n", ">\n", "> 数据“噪声”被认为是“密集”的。每个聚类中点之间的距离可能在检查时表现为更密集或更稀疏，因此需要使用适当的聚类方法进行分析。[这篇文章](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html)展示了使用K均值聚类与HDBSCAN算法探索具有不均匀聚类密度的噪声数据集的区别。\n", "\n", "通过这个[学习模块](https://docs.microsoft.com/learn/modules/train-evaluate-cluster-models?WT.mc_id=academic-77952-leestott)加深对聚类技术的理解。\n", "\n", "### **聚类算法**\n", "\n", "有超过100种聚类算法，其使用取决于手头数据的性质。让我们讨论一些主要的算法：\n", "\n", "- **层次聚类**。如果一个对象根据其与附近对象的接近程度而被分类，而不是与更远的对象，聚类是根据其成员与其他对象的距离形成的。层次聚类的特点是反复合并两个聚类。\n", "\n", "

\n", " \n", "

Dasani Madipalli制作的信息图

\n", "\n", "- **质心聚类**。这种流行的算法需要选择“k”，即要形成的聚类数量，然后算法确定聚类的中心点并围绕该点收集数据。[K均值聚类](https://wikipedia.org/wiki/K-means_clustering)是质心聚类的一种流行版本，它将数据集分为预定义的K组。中心点由最近的平均值确定，因此得名。聚类的平方距离被最小化。\n", "\n", "

\n", " \n", "

Dasani Madipalli制作的信息图

\n", "\n", "- **基于分布的聚类**。基于统计建模，分布式聚类的核心是确定数据点属于某个聚类的概率，并据此分配。高斯混合方法属于这一类型。\n", "\n", "- **基于密度的聚类**。数据点根据其密度或围绕彼此的分组被分配到聚类中。远离组的数据点被认为是异常值或噪声。DBSCAN、Mean-shift和OPTICS属于这一类型的聚类。\n", "\n", "- **基于网格的聚类**。对于多维数据集，创建一个网格并将数据分配到网格的单元中，从而形成聚类。\n", "\n", "学习聚类的最佳方法是亲自尝试，这正是你将在本练习中做的。\n", "\n", "我们需要一些包来完成这个模块。你可以通过以下方式安装它们：`install.packages(c('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork'))`\n", "\n", "或者，下面的脚本会检查你是否拥有完成此模块所需的包，并在缺少时为你安装它们。\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n", "\r\n", "pacman::p_load('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork')\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 练习 - 对数据进行聚类\n", "\n", "聚类作为一种技术，通过适当的可视化可以大大提高效果，因此让我们从可视化音乐数据开始。这项练习将帮助我们决定哪种聚类方法最适合用于处理这些数据的特性。\n", "\n", "让我们立即开始，导入数据。\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Load the core tidyverse and make it available in your current R session\r\n", "library(tidyverse)\r\n", "\r\n", "# Import the data into a tibble\r\n", "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv\")\r\n", "\r\n", "# View the first 5 rows of the data set\r\n", "df %>% \r\n", " slice_head(n = 5)\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "有时候，我们可能希望对数据有更多的了解。我们可以通过使用 [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) 函数来查看 `数据` 和 `其结构`：\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Glimpse into the data set\r\n", "df %>% \r\n", " glimpse()\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "干得好！💪\n", "\n", "我们可以看到，`glimpse()` 会显示数据集的总行数（观测值）和列数（变量），然后在变量名称后按行显示每个变量的前几个条目。此外，变量的*数据类型*会紧跟在每个变量名称后面，用 `< >` 表示。\n", "\n", "`DataExplorer::introduce()` 可以将这些信息整齐地总结出来：\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Describe basic information for our data\r\n", "df %>% \r\n", " introduce()\r\n", "\r\n", "# A visual display of the same\r\n", "df %>% \r\n", " plot_intro()\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "太棒了！我们刚刚了解到我们的数据没有缺失值。\n", "\n", "既然如此，我们可以使用 `summarytools::descr()` 来探索常见的集中趋势统计（例如 [均值](https://en.wikipedia.org/wiki/Arithmetic_mean) 和 [中位数](https://en.wikipedia.org/wiki/Median)）以及离散程度的度量（例如 [标准差](https://en.wikipedia.org/wiki/Standard_deviation)）。\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Describe common statistics\r\n", "df %>% \r\n", " descr(stats = \"common\")\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "让我们来看一下数据的总体值。请注意，流行度可以为 `0`，这表示没有排名的歌曲。我们稍后会将这些移除。\n", "\n", "> 🤔 如果我们正在使用聚类，这是一种不需要标签数据的无监督方法，为什么我们还要展示带有标签的数据呢？在数据探索阶段，这些标签非常有用，但它们并不是聚类算法运行所必需的。\n", "\n", "### 1. 探索流行的音乐类型\n", "\n", "让我们继续找出最流行的音乐类型 🎶，通过统计它出现的次数来实现。\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Popular genres\r\n", "top_genres <- df %>% \r\n", " count(artist_top_genre, sort = TRUE) %>% \r\n", "# Encode to categorical and reorder the according to count\r\n", " mutate(artist_top_genre = factor(artist_top_genre) %>% fct_inorder())\r\n", "\r\n", "# Print the top genres\r\n", "top_genres\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "那很顺利！人们常说“一张图片胜过千行数据框”（其实没人这么说过 😅）。但你明白我的意思，对吧？\n", "\n", "可视化分类数据（字符或因子变量）的一种方法是使用柱状图。让我们绘制一个前10大流派的柱状图：\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Change the default gray theme\r\n", "theme_set(theme_light())\r\n", "\r\n", "# Visualize popular genres\r\n", "top_genres %>%\r\n", " slice(1:10) %>% \r\n", " ggplot(mapping = aes(x = artist_top_genre, y = n,\r\n", " fill = artist_top_genre)) +\r\n", " geom_col(alpha = 0.8) +\r\n", " paletteer::scale_fill_paletteer_d(\"rcartocolor::Vivid\") +\r\n", " ggtitle(\"Top genres\") +\r\n", " theme(plot.title = element_text(hjust = 0.5),\r\n", " # Rotates the X markers (so we can read them)\r\n", " axis.text.x = element_text(angle = 90))\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "现在更容易发现我们有`缺失`的音乐类型了 🧐！\n", "\n", "> 一个好的可视化能够展示你意想不到的内容，或者引发你对数据的新疑问 —— Hadley Wickham 和 Garrett Grolemund，《R For Data Science》(https://r4ds.had.co.nz/introduction.html)\n", "\n", "注意，当主要音乐类型被描述为`缺失`时，这意味着Spotify没有对其进行分类，所以我们需要将其去除。\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Visualize popular genres\r\n", "top_genres %>%\r\n", " filter(artist_top_genre != \"Missing\") %>% \r\n", " slice(1:10) %>% \r\n", " ggplot(mapping = aes(x = artist_top_genre, y = n,\r\n", " fill = artist_top_genre)) +\r\n", " geom_col(alpha = 0.8) +\r\n", " paletteer::scale_fill_paletteer_d(\"rcartocolor::Vivid\") +\r\n", " ggtitle(\"Top genres\") +\r\n", " theme(plot.title = element_text(hjust = 0.5),\r\n", " # Rotates the X markers (so we can read them)\r\n", " axis.text.x = element_text(angle = 90))\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "通过初步的数据探索，我们了解到前三大音乐类型在这个数据集中占据主导地位。让我们专注于 `afro dancehall`、`afropop` 和 `nigerian pop`，并进一步过滤数据集，去除任何流行度值为 0 的条目（这意味着这些条目在数据集中未被分类为流行，且对于我们的目的来说可以视为噪声）：\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "nigerian_songs <- df %>% \r\n", " # Concentrate on top 3 genres\r\n", " filter(artist_top_genre %in% c(\"afro dancehall\", \"afropop\",\"nigerian pop\")) %>% \r\n", " # Remove unclassified observations\r\n", " filter(popularity != 0)\r\n", "\r\n", "\r\n", "\r\n", "# Visualize popular genres\r\n", "nigerian_songs %>%\r\n", " count(artist_top_genre) %>%\r\n", " ggplot(mapping = aes(x = artist_top_genre, y = n,\r\n", " fill = artist_top_genre)) +\r\n", " geom_col(alpha = 0.8) +\r\n", " paletteer::scale_fill_paletteer_d(\"ggsci::category10_d3\") +\r\n", " ggtitle(\"Top genres\") +\r\n", " theme(plot.title = element_text(hjust = 0.5))\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "让我们看看数据集中数值变量之间是否存在明显的线性关系。这种关系可以通过[相关统计量](https://en.wikipedia.org/wiki/Correlation)在数学上进行量化。\n", "\n", "相关统计量是一个介于 -1 和 1 之间的值，用于表示关系的强度。大于 0 的值表示*正相关*（一个变量的高值往往与另一个变量的高值同时出现），而小于 0 的值表示*负相关*（一个变量的高值往往与另一个变量的低值同时出现）。\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Narrow down to numeric variables and fid correlation\r\n", "corr_mat <- nigerian_songs %>% \r\n", " select(where(is.numeric)) %>% \r\n", " cor()\r\n", "\r\n", "# Visualize correlation matrix\r\n", "corrplot(corr_mat, order = 'AOE', col = c('white', 'black'), bg = 'gold2') \r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "数据之间的相关性并不强，除了 `energy` 和 `loudness` 之间的关系，这很合理，因为响亮的音乐通常充满活力。`Popularity` 与 `release date` 也有一定的对应关系，这也合乎逻辑，因为较新的歌曲可能更受欢迎。长度和能量似乎也存在一定的相关性。\n", "\n", "看看聚类算法如何处理这些数据会很有趣！\n", "\n", "> 🎓 请注意，相关性并不意味着因果关系！我们有相关性的证据，但没有因果关系的证明。一个[有趣的网站](https://tylervigen.com/spurious-correlations)提供了一些视觉化内容来强调这一点。\n", "\n", "### 2. 探索数据分布\n", "\n", "让我们提出一些更微妙的问题。基于流行度，不同的音乐类型在舞蹈性上的感知是否有显著差异？让我们使用[密度图](https://www.khanacademy.org/math/ap-statistics/density-curves-normal-distribution-ap/density-curves/v/density-curves)沿着给定的 x 和 y 轴来检查我们前三大音乐类型在流行度和舞蹈性上的数据分布。\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Perform 2D kernel density estimation\r\n", "density_estimate_2d <- nigerian_songs %>% \r\n", " ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre)) +\r\n", " geom_density_2d(bins = 5, size = 1) +\r\n", " paletteer::scale_color_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n", " xlim(-20, 80) +\r\n", " ylim(0, 1.2)\r\n", "\r\n", "# Density plot based on the popularity\r\n", "density_estimate_pop <- nigerian_songs %>% \r\n", " ggplot(mapping = aes(x = popularity, fill = artist_top_genre, color = artist_top_genre)) +\r\n", " geom_density(size = 1, alpha = 0.5) +\r\n", " paletteer::scale_fill_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n", " paletteer::scale_color_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n", " theme(legend.position = \"none\")\r\n", "\r\n", "# Density plot based on the danceability\r\n", "density_estimate_dance <- nigerian_songs %>% \r\n", " ggplot(mapping = aes(x = danceability, fill = artist_top_genre, color = artist_top_genre)) +\r\n", " geom_density(size = 1, alpha = 0.5) +\r\n", " paletteer::scale_fill_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n", " paletteer::scale_color_paletteer_d(\"RSkittleBrewer::wildberry\")\r\n", "\r\n", "\r\n", "# Patch everything together\r\n", "library(patchwork)\r\n", "density_estimate_2d / (density_estimate_pop + density_estimate_dance)\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "我们发现，无论是哪种类型，都有同心圆对齐的现象。难道尼日利亚人的品味在这个类型的某种舞蹈性水平上趋于一致？\n", "\n", "总体来说，这三种类型在受欢迎程度和舞蹈性方面是相符的。在这种松散对齐的数据中确定聚类将是一个挑战。让我们看看散点图是否能对此提供支持。\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# A scatter plot of popularity and danceability\r\n", "scatter_plot <- nigerian_songs %>% \r\n", " ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre, shape = artist_top_genre)) +\r\n", " geom_point(size = 2, alpha = 0.8) +\r\n", " paletteer::scale_color_paletteer_d(\"futurevisions::mars\")\r\n", "\r\n", "# Add a touch of interactivity\r\n", "ggplotly(scatter_plot)\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "同一坐标轴的散点图显示了类似的收敛模式。\n", "\n", "通常来说，在聚类分析中，你可以使用散点图来展示数据的聚类情况，因此掌握这种可视化方法非常有用。在下一节课中，我们将使用过滤后的数据，并通过 k-means 聚类来发现数据中以有趣方式重叠的群组。\n", "\n", "## **🚀 挑战**\n", "\n", "为下一节课做准备，制作一张关于各种聚类算法的图表，这些算法可能会在生产环境中被发现和使用。聚类试图解决哪些类型的问题？\n", "\n", "## [**课后测验**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/28/)\n", "\n", "## **复习与自学**\n", "\n", "在应用聚类算法之前，正如我们所学的，了解数据集的性质是一个好主意。你可以在[这里](https://www.kdnuggets.com/2019/10/right-clustering-algorithm.html)阅读更多相关内容。\n", "\n", "加深对聚类技术的理解：\n", "\n", "- [使用 Tidymodels 和相关工具训练和评估聚类模型](https://rpubs.com/eR_ic/clustering)\n", "\n", "- Bradley Boehmke 和 Brandon Greenwell 的[*Hands-On Machine Learning with R*](https://bradleyboehmke.github.io/HOML/)*.*\n", "\n", "## **作业**\n", "\n", "[研究其他用于聚类的可视化方法](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/assignment.md)\n", "\n", "## 特别感谢：\n", "\n", "[Jen Looper](https://www.twitter.com/jenlooper) 创建了本模块的原始 Python 版本 ♥️\n", "\n", "[`Dasani Madipalli`](https://twitter.com/dasani_decoded) 创作了精彩的插图，使机器学习概念更易于理解和解释。\n", "\n", "祝学习愉快，\n", "\n", "[Eric](https://twitter.com/ericntay)，Gold Microsoft Learn 学生大使。\n" ], "metadata": {} }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**免责声明**： \n本文档使用AI翻译服务[Co-op Translator](https://github.com/Azure/co-op-translator)进行翻译。尽管我们努力确保翻译的准确性，但请注意，自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于关键信息，建议使用专业人工翻译。我们不对因使用此翻译而产生的任何误解或误读承担责任。\n" ] } ], "metadata": { "anaconda-cloud": "", "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.1" }, "coopTranslator": { "original_hash": "99c36449cad3708a435f6798cfa39972", "translation_date": "2025-09-03T20:07:40+00:00", "source_file": "5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb", "language_code": "zh" } }, "nbformat": 4, "nbformat_minor": 1 }