You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/zh/5-Clustering/2-K-Means/solution/R/lesson_15-R.ipynb

637 lines
25 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"anaconda-cloud": "",
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.1"
},
"colab": {
"name": "lesson_14.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"coopTranslator": {
"original_hash": "ad65fb4aad0a156b42216e4929f490fc",
"translation_date": "2025-09-03T20:17:16+00:00",
"source_file": "5-Clustering/2-K-Means/solution/R/lesson_15-R.ipynb",
"language_code": "zh"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "GULATlQXLXyR"
},
"source": [
"## 使用 R 和 Tidy 数据原则探索 K-Means 聚类\n",
"\n",
"### [**课前测验**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/29/)\n",
"\n",
"在本课中,您将学习如何使用 Tidymodels 包以及 R 生态系统中的其他包(我们称它们为朋友 🧑‍🤝‍🧑)创建聚类,并使用您之前导入的尼日利亚音乐数据集。我们将介绍 K-Means 聚类的基础知识。请记住,正如您在之前的课程中学到的那样,有许多方法可以处理聚类,您使用的方法取决于您的数据。我们将尝试 K-Means因为它是最常见的聚类技术。让我们开始吧\n",
"\n",
"您将学习的术语:\n",
"\n",
"- Silhouette评分\n",
"\n",
"- 肘部法则\n",
"\n",
"- 惯性\n",
"\n",
"- 方差\n",
"\n",
"### **简介**\n",
"\n",
"[K-Means 聚类](https://wikipedia.org/wiki/K-means_clustering) 是一种源自信号处理领域的方法。它用于根据特征的相似性将数据分成 `k 个聚类`。\n",
"\n",
"这些聚类可以通过 [Voronoi 图](https://wikipedia.org/wiki/Voronoi_diagram) 可视化,其中包括一个点(或“种子”)及其对应的区域。\n",
"\n",
"<p >\n",
" <img src=\"../../images/voronoi.png\"\n",
" width=\"500\"/>\n",
" <figcaption>Jen Looper 制作的信息图</figcaption>\n",
"\n",
"K-Means 聚类的步骤如下:\n",
"\n",
"1. 数据科学家首先指定要创建的聚类数量。\n",
"\n",
"2. 接下来,算法从数据集中随机选择 K 个观测值作为聚类的初始中心(即质心)。\n",
"\n",
"3. 然后,将其余的观测值分配到距离最近的质心。\n",
"\n",
"4. 接下来,计算每个聚类的新均值,并将质心移动到均值位置。\n",
"\n",
"5. 现在质心已经重新计算,每个观测值再次被检查是否更接近其他聚类。所有对象再次使用更新后的聚类均值重新分配。聚类分配和质心更新步骤会迭代重复,直到聚类分配不再变化(即达到收敛)。通常,当每次新迭代导致质心的移动微乎其微且聚类变得静态时,算法会终止。\n",
"\n",
"<div>\n",
"\n",
"> 请注意,由于初始 k 个观测值的随机化,作为起始质心,每次应用该过程时可能会得到略有不同的结果。因此,大多数算法会使用多个 *随机起点* 并选择具有最低 WCSS 的迭代。因此,强烈建议始终使用多个 *nstart* 值运行 K-Means以避免 *不理想的局部最优解*。\n",
"\n",
"</div>\n",
"\n",
"以下短动画使用 Allison Horst 的 [插画](https://github.com/allisonhorst/stats-illustrations) 解释了聚类过程:\n",
"\n",
"<p >\n",
" <img src=\"../../images/kmeans.gif\"\n",
" width=\"550\"/>\n",
" <figcaption>@allison_horst 的插画</figcaption>\n",
"\n",
"聚类中一个基本问题是:如何确定将数据分成多少个聚类?使用 K-Means 的一个缺点是您需要确定 `k`,即 `质心` 的数量。幸运的是,`肘部法则` 可以帮助估算一个好的起始值。您马上就会尝试。\n",
"\n",
"### \n",
"\n",
"**前提条件**\n",
"\n",
"我们将从 [上一课](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb) 停止的地方继续,在那里我们分析了数据集,进行了大量可视化,并过滤了感兴趣的观测值。一定要查看!\n",
"\n",
"我们需要一些包来完成本模块。您可以通过以下方式安装它们:`install.packages(c('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork'))`\n",
"\n",
"或者,下面的脚本会检查您是否拥有完成本模块所需的包,并在缺少某些包时为您安装它们。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ah_tBi58LXyi"
},
"source": [
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork')\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7e--UCUTLXym"
},
"source": [
"让我们开始吧!\n",
"\n",
"## 1. 与数据共舞:缩小到最受欢迎的三个音乐类型\n",
"\n",
"这是我们上一节课所做内容的回顾。让我们来分析一些数据吧!\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Ycamx7GGLXyn"
},
"source": [
"# Load the core tidyverse and make it available in your current R session\n",
"library(tidyverse)\n",
"\n",
"# Import the data into a tibble\n",
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv\", show_col_types = FALSE)\n",
"\n",
"# Narrow down to top 3 popular genres\n",
"nigerian_songs <- df %>% \n",
" # Concentrate on top 3 genres\n",
" filter(artist_top_genre %in% c(\"afro dancehall\", \"afropop\",\"nigerian pop\")) %>% \n",
" # Remove unclassified observations\n",
" filter(popularity != 0)\n",
"\n",
"\n",
"\n",
"# Visualize popular genres using bar plots\n",
"theme_set(theme_light())\n",
"nigerian_songs %>%\n",
" count(artist_top_genre) %>%\n",
" ggplot(mapping = aes(x = artist_top_genre, y = n,\n",
" fill = artist_top_genre)) +\n",
" geom_col(alpha = 0.8) +\n",
" paletteer::scale_fill_paletteer_d(\"ggsci::category10_d3\") +\n",
" ggtitle(\"Top genres\") +\n",
" theme(plot.title = element_text(hjust = 0.5))\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "b5h5zmkPLXyp"
},
"source": [
"🤩 这进展得很顺利!\n",
"\n",
"## 2. 更多数据探索\n",
"\n",
"这些数据有多干净?让我们使用箱线图检查异常值。我们将专注于异常值较少的数值列(尽管你也可以清理异常值)。箱线图可以显示数据的范围,并帮助选择要使用的列。注意,箱线图并不显示方差,而方差是良好可聚类数据的重要元素。请参阅[这个讨论](https://stats.stackexchange.com/questions/91536/deduce-variance-from-boxplot)以了解更多信息。\n",
"\n",
"[箱线图](https://en.wikipedia.org/wiki/Box_plot)用于以图形方式描述`数值`数据的分布,因此我们先从*选择*所有数值列以及流行音乐流派开始。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "HhNreJKLLXyq"
},
"source": [
"# Select top genre column and all other numeric columns\n",
"df_numeric <- nigerian_songs %>% \n",
" select(artist_top_genre, where(is.numeric)) \n",
"\n",
"# Display the data\n",
"df_numeric %>% \n",
" slice_head(n = 5)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "uYXrwJRaLXyq"
},
"source": [
"看看选择助手 `where` 是如何让这一切变得简单的 💁?可以在[这里](https://tidyselect.r-lib.org/)探索其他类似的函数。\n",
"\n",
"由于我们将为每个数值特征制作箱线图,并且希望避免使用循环,让我们将数据重新格式化为*更长*的格式,这样就可以利用 `facets`——每个子图分别显示数据的一个子集。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "gd5bR3f8LXys"
},
"source": [
"# Pivot data from wide to long\n",
"df_numeric_long <- df_numeric %>% \n",
" pivot_longer(!artist_top_genre, names_to = \"feature_names\", values_to = \"values\") \n",
"\n",
"# Print out data\n",
"df_numeric_long %>% \n",
" slice_head(n = 15)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "-7tE1swnLXyv"
},
"source": [
"更长了!现在是时候使用一些 `ggplots` 了!那么我们会用什么 `geom` 呢?\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "r88bIsyuLXyy"
},
"source": [
"# Make a box plot\n",
"df_numeric_long %>% \n",
" ggplot(mapping = aes(x = feature_names, y = values, fill = feature_names)) +\n",
" geom_boxplot() +\n",
" facet_wrap(~ feature_names, ncol = 4, scales = \"free\") +\n",
" theme(legend.position = \"none\")\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "EYVyKIUELXyz"
},
"source": [
"现在我们可以看到这些数据有些杂乱:通过观察每一列的箱线图,可以发现存在异常值。你可以遍历整个数据集并移除这些异常值,但这样会使数据变得非常少。\n",
"\n",
"目前,我们来选择用于聚类练习的列。我们选择范围相似的数值列。我们可以将 `artist_top_genre` 编码为数值,但暂时先舍弃它。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "-wkpINyZLXy0"
},
"source": [
"# Select variables with similar ranges\n",
"df_numeric_select <- df_numeric %>% \n",
" select(popularity, danceability, acousticness, loudness, energy) \n",
"\n",
"# Normalize data\n",
"# df_numeric_select <- scale(df_numeric_select)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "D7dLzgpqLXy1"
},
"source": [
"## 3. 在 R 中计算 k-means 聚类\n",
"\n",
"我们可以使用 R 中内置的 `kmeans` 函数计算 k-means参见 `help(\"kmeans()\")`。`kmeans()` 函数的主要参数是一个包含所有数值型列的数据框。\n",
"\n",
"使用 k-means 聚类的第一步是指定最终解决方案中要生成的聚类数量k。我们知道从数据集中划分出了 3 种歌曲类型,因此我们可以尝试设置为 3\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "uC4EQ5w7LXy5"
},
"source": [
"set.seed(2056)\n",
"# Kmeans clustering for 3 clusters\n",
"kclust <- kmeans(\n",
" df_numeric_select,\n",
" # Specify the number of clusters\n",
" centers = 3,\n",
" # How many random initial configurations\n",
" nstart = 25\n",
")\n",
"\n",
"# Display clustering object\n",
"kclust\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hzfhscWrLXy-"
},
"source": [
"kmeans对象包含了许多信息这些信息在`help(\"kmeans()\")`中有详细说明。现在我们先关注几个关键点。我们可以看到数据被分成了3个簇分别包含65、110和111个样本。输出还包括了这3个簇在5个变量上的簇中心均值。\n",
"\n",
"聚类向量是每个观测值的簇分配。我们可以使用`augment`函数将簇分配添加到原始数据集中。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "0XwwpFGQLXy_"
},
"source": [
"# Add predicted cluster assignment to data set\n",
"augment(kclust, df_numeric_select) %>% \n",
" relocate(.cluster) %>% \n",
" slice_head(n = 10)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "NXIVXXACLXzA"
},
"source": [
"太好了,我们刚刚将数据集划分成了三个组。那么我们的聚类效果如何呢🤷?让我们来看看 `Silhouette score`。\n",
"\n",
"### **轮廓系数**\n",
"\n",
"[轮廓分析](https://en.wikipedia.org/wiki/Silhouette_(clustering))可以用来研究生成的聚类之间的分离距离。这个分数范围从 -1 到 1如果分数接近 1说明聚类紧密且与其他聚类分离良好。接近 0 的值表示聚类之间有重叠,样本非常接近邻近聚类的决策边界。[来源](https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam)。\n",
"\n",
"平均轮廓方法计算不同 *k* 值下观测点的平均轮廓分数。较高的平均轮廓分数表明聚类效果较好。\n",
"\n",
"使用 cluster 包中的 `silhouette` 函数可以计算平均轮廓宽度。\n",
"\n",
"> 轮廓分数可以使用任何[距离](https://en.wikipedia.org/wiki/Distance \"Distance\")度量来计算,例如我们在[上一课](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb)中讨论过的[欧几里得距离](https://en.wikipedia.org/wiki/Euclidean_distance \"Euclidean distance\")或[曼哈顿距离](https://en.wikipedia.org/wiki/Manhattan_distance \"Manhattan distance\")。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Jn0McL28LXzB"
},
"source": [
"# Load cluster package\n",
"library(cluster)\n",
"\n",
"# Compute average silhouette score\n",
"ss <- silhouette(kclust$cluster,\n",
" # Compute euclidean distance\n",
" dist = dist(df_numeric_select))\n",
"mean(ss[, 3])\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "QyQRn97nLXzC"
},
"source": [
"我们的得分是 **.549**,正好处于中间位置。这表明我们的数据并不特别适合这种类型的聚类。让我们看看是否可以通过可视化来验证这个猜测。[factoextra 包](https://rpkgs.datanovia.com/factoextra/index.html) 提供了用于可视化聚类的函数(`fviz_cluster()`)。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "7a6Km1_FLXzD"
},
"source": [
"library(factoextra)\n",
"\n",
"# Visualize clustering results\n",
"fviz_cluster(kclust, df_numeric_select)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "IBwCWt-0LXzD"
},
"source": [
"聚类之间的重叠表明我们的数据并不特别适合这种类型的聚类,但我们继续进行。\n",
"\n",
"## 4. 确定最佳聚类数\n",
"\n",
"在 K-Means 聚类中经常出现的一个基本问题是——在没有已知类别标签的情况下,如何确定将数据分成多少个聚类?\n",
"\n",
"我们可以尝试的一种方法是使用一个数据样本来`创建一系列聚类模型`,逐步增加聚类的数量(例如从 1 到 10并评估聚类指标例如 **Silhouette 分数**。\n",
"\n",
"让我们通过对不同的 *k* 值计算聚类算法,并评估 **聚类内平方和**WCSS来确定最佳的聚类数。聚类内平方和WCSS总量衡量了聚类的紧凑性我们希望它尽可能小较低的值意味着数据点更接近。\n",
"\n",
"让我们探索不同的 `k` 值(从 1 到 10对聚类结果的影响。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "hSeIiylDLXzE"
},
"source": [
"# Create a series of clustering models\n",
"kclusts <- tibble(k = 1:10) %>% \n",
" # Perform kmeans clustering for 1,2,3 ... ,10 clusters\n",
" mutate(model = map(k, ~ kmeans(df_numeric_select, centers = .x, nstart = 25)),\n",
" # Farm out clustering metrics eg WCSS\n",
" glanced = map(model, ~ glance(.x))) %>% \n",
" unnest(cols = glanced)\n",
" \n",
"\n",
"# View clustering rsulsts\n",
"kclusts\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "m7rS2U1eLXzE"
},
"source": [
"现在我们已经获得了每个聚类算法在中心 *k* 时的总簇内平方和 (tot.withinss),接下来我们使用[肘部法则](https://en.wikipedia.org/wiki/Elbow_method_(clustering))来确定最佳的聚类数量。该方法的核心是将簇内平方和 (WCSS) 作为聚类数量的函数进行绘图,并选择[曲线的肘部](https://en.wikipedia.org/wiki/Elbow_of_the_curve \"曲线的肘部\")作为要使用的聚类数量。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "o_DjHGItLXzF"
},
"source": [
"set.seed(2056)\n",
"# Use elbow method to determine optimum number of clusters\n",
"kclusts %>% \n",
" ggplot(mapping = aes(x = k, y = tot.withinss)) +\n",
" geom_line(size = 1.2, alpha = 0.8, color = \"#FF7F0EFF\") +\n",
" geom_point(size = 2, color = \"#FF7F0EFF\")\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "pLYyt5XSLXzG"
},
"source": [
"图表显示当聚类数量从一个增加到两个时WCSS显著减少即更高的*紧密度*),从两个增加到三个聚类时也有进一步明显的减少。之后,减少的幅度变得不那么显著,在图表中大约三个聚类处形成一个`肘部` 💪。这表明数据点可以合理地分为两到三个较为独立的聚类。\n",
"\n",
"现在我们可以继续提取聚类模型,其中`k = 3`\n",
"\n",
"> `pull()`: 用于提取单列\n",
">\n",
"> `pluck()`: 用于索引数据结构,例如列表\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "JP_JPKBILXzG"
},
"source": [
"# Extract k = 3 clustering\n",
"final_kmeans <- kclusts %>% \n",
" filter(k == 3) %>% \n",
" pull(model) %>% \n",
" pluck(1)\n",
"\n",
"\n",
"final_kmeans\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "l_PDTu8tLXzI"
},
"source": [
"太好了!让我们来可视化获得的聚类。想用 `plotly` 增加一些互动性吗?\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "dNcleFe-LXzJ"
},
"source": [
"# Add predicted cluster assignment to data set\n",
"results <- augment(final_kmeans, df_numeric_select) %>% \n",
" bind_cols(df_numeric %>% select(artist_top_genre)) \n",
"\n",
"# Plot cluster assignments\n",
"clust_plt <- results %>% \n",
" ggplot(mapping = aes(x = popularity, y = danceability, color = .cluster, shape = artist_top_genre)) +\n",
" geom_point(size = 2, alpha = 0.8) +\n",
" paletteer::scale_color_paletteer_d(\"ggthemes::Tableau_10\")\n",
"\n",
"ggplotly(clust_plt)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "6JUM_51VLXzK"
},
"source": [
"也许我们会预期每个聚类(用不同颜色表示)会有明显不同的类型(用不同形状表示)。\n",
"\n",
"让我们来看看模型的准确性。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "HdIMUGq7LXzL"
},
"source": [
"# Assign genres to predefined integers\n",
"label_count <- results %>% \n",
" group_by(artist_top_genre) %>% \n",
" mutate(id = cur_group_id()) %>% \n",
" ungroup() %>% \n",
" summarise(correct_labels = sum(.cluster == id))\n",
"\n",
"\n",
"# Print results \n",
"cat(\"Result:\", label_count$correct_labels, \"out of\", nrow(results), \"samples were correctly labeled.\")\n",
"\n",
"cat(\"\\nAccuracy score:\", label_count$correct_labels/nrow(results))\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "C50wvaAOLXzM"
},
"source": [
"这个模型的准确性还可以,但并不算优秀。这可能是因为数据本身不太适合使用 K-Means 聚类。这些数据过于不平衡,相关性较低,并且列值之间的差异太大,导致聚类效果不佳。实际上,形成的聚类可能会受到我们之前定义的三个类别的强烈影响或偏斜。\n",
"\n",
"尽管如此,这仍然是一个很好的学习过程!\n",
"\n",
"在 Scikit-learn 的文档中,你可以看到像这样的模型,聚类边界不太清晰,存在“方差”问题:\n",
"\n",
"<p >\n",
" <img src=\"../../images/problems.png\"\n",
" width=\"500\"/>\n",
" <figcaption>来自 Scikit-learn 的信息图</figcaption>\n",
"\n",
"\n",
"\n",
"## **方差**\n",
"\n",
"方差被定义为“与平均值的平方差的平均值” [来源](https://www.mathsisfun.com/data/standard-deviation.html)。在这个聚类问题的背景下,它指的是数据集中数值偏离平均值的程度过大。\n",
"\n",
"✅ 这是一个很好的时机来思考如何解决这个问题。稍微调整数据?使用不同的列?尝试不同的算法?提示:试试[对数据进行缩放](https://www.mygreatlearning.com/blog/learning-data-science-with-k-means-clustering/)以进行归一化,并测试其他列。\n",
"\n",
"> 试试这个‘[方差计算器](https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php)’来更好地理解这个概念。\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"## **🚀挑战**\n",
"\n",
"花些时间研究这个笔记本,调整参数。通过进一步清理数据(例如删除异常值),你能否提高模型的准确性?你可以使用权重为某些数据样本赋予更大的权重。还有什么方法可以创建更好的聚类?\n",
"\n",
"提示:试试对数据进行缩放。笔记本中有注释代码,可以添加标准化缩放,使数据列在范围上更接近。你会发现虽然轮廓分数下降了,但肘部图中的“折点”变得更加平滑。这是因为未缩放的数据允许方差较小的数据权重更大。可以在[这里](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226)阅读更多相关问题。\n",
"\n",
"## [**课后测验**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/30/)\n",
"\n",
"## **复习与自学**\n",
"\n",
"- 看看一个 K-Means 模拟器 [例如这个](https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/)。你可以使用这个工具可视化样本数据点并确定其质心。你可以编辑数据的随机性、聚类数量和质心数量。这是否帮助你更好地理解数据如何分组?\n",
"\n",
"- 另外,看看斯坦福的[这份 K-Means 手册](https://stanford.edu/~cpiech/cs221/handouts/kmeans.html)。\n",
"\n",
"想尝试将你新学到的聚类技能应用到适合 K-Means 聚类的数据集上?请参考以下内容:\n",
"\n",
"- [训练和评估聚类模型](https://rpubs.com/eR_ic/clustering),使用 Tidymodels 和相关工具\n",
"\n",
"- [K-Means 聚类分析](https://uc-r.github.io/kmeans_clustering)UC 商业分析 R 编程指南\n",
"\n",
"- [使用整洁数据原则进行 K-Means 聚类](https://www.tidymodels.org/learn/statistics/k-means/)\n",
"\n",
"## **作业**\n",
"\n",
"[尝试不同的聚类方法](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/2-K-Means/assignment.md)\n",
"\n",
"## 特别感谢:\n",
"\n",
"[Jen Looper](https://www.twitter.com/jenlooper) 创建了这个模块的原始 Python 版本 ♥️\n",
"\n",
"[`Allison Horst`](https://twitter.com/allison_horst/) 创作了令人惊叹的插图,使 R 更加友好和吸引人。可以在她的[画廊](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM)中找到更多插图。\n",
"\n",
"祝学习愉快,\n",
"\n",
"[Eric](https://twitter.com/ericntay)Gold Microsoft Learn 学生大使。\n",
"\n",
"<p >\n",
" <img src=\"../../images/r_learners_sm.jpeg\"\n",
" width=\"500\"/>\n",
" <figcaption>由 @allison_horst 创作的艺术作品</figcaption>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免责声明** \n本文档使用AI翻译服务[Co-op Translator](https://github.com/Azure/co-op-translator)进行翻译。尽管我们努力确保翻译的准确性,但请注意,自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于关键信息,建议使用专业人工翻译。我们不对因使用此翻译而产生的任何误解或误读承担责任。\n"
]
}
]
}