You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/mo/5-Clustering/2-K-Means/solution/R/lesson_15-R.ipynb

637 lines
25 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"anaconda-cloud": "",
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.1"
},
"colab": {
"name": "lesson_14.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"coopTranslator": {
"original_hash": "ad65fb4aad0a156b42216e4929f490fc",
"translation_date": "2025-08-29T23:38:16+00:00",
"source_file": "5-Clustering/2-K-Means/solution/R/lesson_15-R.ipynb",
"language_code": "mo"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "GULATlQXLXyR"
},
"source": [
"## 探索使用 R 和 Tidy 數據原則進行 K-Means 分群\n",
"\n",
"### [**課前測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/29/)\n",
"\n",
"在本課程中,您將學習如何使用 Tidymodels 套件以及 R 生態系統中的其他套件(我們稱它們為朋友 🧑‍🤝‍🧑),以及您之前匯入的尼日利亞音樂數據集來創建分群。我們將介紹 K-Means 分群的基礎知識。請記住,正如您在之前的課程中所學,分群有許多不同的方法,您使用的方法取決於您的數據。我們將嘗試 K-Means因為它是最常見的分群技術。讓我們開始吧\n",
"\n",
"您將學習的術語:\n",
"\n",
"- Silhouette 評分\n",
"\n",
"- Elbow 方法\n",
"\n",
"- Inertia慣性\n",
"\n",
"- Variance方差\n",
"\n",
"### **簡介**\n",
"\n",
"[K-Means 分群](https://wikipedia.org/wiki/K-means_clustering) 是一種源自信號處理領域的方法。它用於根據特徵的相似性將數據分成 `k 個分群`。\n",
"\n",
"這些分群可以以 [Voronoi 圖](https://wikipedia.org/wiki/Voronoi_diagram) 的形式進行可視化,其中包括一個點(或“種子”)及其對應的區域。\n",
"\n",
"<p >\n",
" <img src=\"../../images/voronoi.png\"\n",
" width=\"500\"/>\n",
" <figcaption>Jen Looper 的信息圖</figcaption>\n",
"\n",
"K-Means 分群的步驟如下:\n",
"\n",
"1. 數據科學家首先指定希望創建的分群數量。\n",
"\n",
"2. 接下來,算法隨機選擇數據集中的 K 個觀測值作為分群的初始中心(即質心)。\n",
"\n",
"3. 然後,將其餘的每個觀測值分配到距離最近的質心。\n",
"\n",
"4. 接著,計算每個分群的新平均值,並將質心移動到該平均值。\n",
"\n",
"5. 現在中心已重新計算,所有觀測值再次被檢查是否可能更接近其他分群。使用更新的分群平均值重新分配所有對象。分群分配和質心更新步驟會反覆進行,直到分群分配不再改變(即達到收斂)。通常,算法在每次新迭代中質心移動微乎其微且分群變得穩定時終止。\n",
"\n",
"<div>\n",
"\n",
"> 請注意,由於初始 k 個觀測值的隨機化,我們每次應用該程序時可能會得到略有不同的結果。因此,大多數算法使用多次 *隨機起始*,並選擇具有最低 WCSS 的迭代。因此,強烈建議始終使用多個 *nstart* 值運行 K-Means以避免 *不理想的局部最優解*。\n",
"\n",
"</div>\n",
"\n",
"以下短動畫使用 Allison Horst 的 [插圖](https://github.com/allisonhorst/stats-illustrations) 解釋了分群過程:\n",
"\n",
"<p >\n",
" <img src=\"../../images/kmeans.gif\"\n",
" width=\"550\"/>\n",
" <figcaption>@allison_horst 的插圖</figcaption>\n",
"\n",
"分群中出現的一個基本問題是:如何知道應將數據分成多少個分群?使用 K-Means 的一個缺點是您需要確定 `k`,即 `質心` 的數量。幸運的是,`elbow 方法` 有助於估算 `k` 的良好起始值。您稍後將嘗試使用它。\n",
"\n",
"### \n",
"\n",
"**前置條件**\n",
"\n",
"我們將從 [上一課](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb) 停止的地方繼續,該課程中我們分析了數據集,進行了大量可視化,並篩選了感興趣的觀測值。一定要查看!\n",
"\n",
"我們需要一些套件來完成這個模組。您可以通過以下方式安裝它們:`install.packages(c('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork'))`\n",
"\n",
"或者,以下腳本會檢查您是否擁有完成此模組所需的套件,並在缺少某些套件時為您安裝它們。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ah_tBi58LXyi"
},
"source": [
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork')\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7e--UCUTLXym"
},
"source": [
"讓我們開始吧!\n",
"\n",
"## 1. 與數據共舞:縮小至三種最受歡迎的音樂類型\n",
"\n",
"這是我們上一課所做內容的回顧。讓我們來切分和分析一些數據吧!\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Ycamx7GGLXyn"
},
"source": [
"# Load the core tidyverse and make it available in your current R session\n",
"library(tidyverse)\n",
"\n",
"# Import the data into a tibble\n",
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv\", show_col_types = FALSE)\n",
"\n",
"# Narrow down to top 3 popular genres\n",
"nigerian_songs <- df %>% \n",
" # Concentrate on top 3 genres\n",
" filter(artist_top_genre %in% c(\"afro dancehall\", \"afropop\",\"nigerian pop\")) %>% \n",
" # Remove unclassified observations\n",
" filter(popularity != 0)\n",
"\n",
"\n",
"\n",
"# Visualize popular genres using bar plots\n",
"theme_set(theme_light())\n",
"nigerian_songs %>%\n",
" count(artist_top_genre) %>%\n",
" ggplot(mapping = aes(x = artist_top_genre, y = n,\n",
" fill = artist_top_genre)) +\n",
" geom_col(alpha = 0.8) +\n",
" paletteer::scale_fill_paletteer_d(\"ggsci::category10_d3\") +\n",
" ggtitle(\"Top genres\") +\n",
" theme(plot.title = element_text(hjust = 0.5))\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "b5h5zmkPLXyp"
},
"source": [
"🤩 這進展得很順利!\n",
"\n",
"## 2. 更多數據探索\n",
"\n",
"這份數據有多乾淨?讓我們使用盒鬚圖檢查是否有異常值。我們將專注於異常值較少的數值型欄位(雖然你也可以清理掉異常值)。盒鬚圖可以顯示數據的範圍,並幫助選擇要使用的欄位。請注意,盒鬚圖並不顯示方差,而方差是良好可聚類數據的一個重要元素。如需進一步了解,請參閱[這篇討論](https://stats.stackexchange.com/questions/91536/deduce-variance-from-boxplot)。\n",
"\n",
"[盒鬚圖](https://en.wikipedia.org/wiki/Box_plot) 用於以圖形方式描述「數值型」數據的分佈,因此讓我們從*選擇*所有數值型欄位以及流行音樂類型開始吧。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "HhNreJKLLXyq"
},
"source": [
"# Select top genre column and all other numeric columns\n",
"df_numeric <- nigerian_songs %>% \n",
" select(artist_top_genre, where(is.numeric)) \n",
"\n",
"# Display the data\n",
"df_numeric %>% \n",
" slice_head(n = 5)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "uYXrwJRaLXyq"
},
"source": [
"看看選擇輔助工具 `where` 是如何讓這變得簡單的 💁?可以在[這裡](https://tidyselect.r-lib.org/)探索其他類似的函數。\n",
"\n",
"由於我們將為每個數值型特徵製作箱型圖,而且我們想避免使用迴圈,因此讓我們將數據重新格式化為*更長*的格式,這樣我們就可以利用 `facets`——每個子圖顯示數據的一個子集。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "gd5bR3f8LXys"
},
"source": [
"# Pivot data from wide to long\n",
"df_numeric_long <- df_numeric %>% \n",
" pivot_longer(!artist_top_genre, names_to = \"feature_names\", values_to = \"values\") \n",
"\n",
"# Print out data\n",
"df_numeric_long %>% \n",
" slice_head(n = 15)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "-7tE1swnLXyv"
},
"source": [
"現在來點更長的內容吧!現在是時候使用一些 `ggplots` 了!那麼我們會使用哪種 `geom` 呢?\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "r88bIsyuLXyy"
},
"source": [
"# Make a box plot\n",
"df_numeric_long %>% \n",
" ggplot(mapping = aes(x = feature_names, y = values, fill = feature_names)) +\n",
" geom_boxplot() +\n",
" facet_wrap(~ feature_names, ncol = 4, scales = \"free\") +\n",
" theme(legend.position = \"none\")\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "EYVyKIUELXyz"
},
"source": [
"現在我們可以看到這些數據有些雜亂:透過觀察每一列的盒狀圖,可以看到一些異常值。你可以逐一檢查數據集並移除這些異常值,但這樣可能會使數據變得過於簡化。\n",
"\n",
"目前,我們來選擇要用於聚類練習的列。讓我們挑選範圍相似的數值型列。我們可以將 `artist_top_genre` 編碼為數值型,但暫時先忽略它。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "-wkpINyZLXy0"
},
"source": [
"# Select variables with similar ranges\n",
"df_numeric_select <- df_numeric %>% \n",
" select(popularity, danceability, acousticness, loudness, energy) \n",
"\n",
"# Normalize data\n",
"# df_numeric_select <- scale(df_numeric_select)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "D7dLzgpqLXy1"
},
"source": [
"## 3. 在 R 中計算 k-means 分群\n",
"\n",
"我們可以使用 R 中內建的 `kmeans` 函數來計算 k-means請參考 `help(\"kmeans()\")`。`kmeans()` 函數的主要參數是一個包含所有數值型欄位的資料框。\n",
"\n",
"使用 k-means 分群的第一步是指定最終解決方案中要生成的群組數量 (k)。我們知道從資料集中分出來的歌曲類型有 3 種,因此我們嘗試使用 3\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "uC4EQ5w7LXy5"
},
"source": [
"set.seed(2056)\n",
"# Kmeans clustering for 3 clusters\n",
"kclust <- kmeans(\n",
" df_numeric_select,\n",
" # Specify the number of clusters\n",
" centers = 3,\n",
" # How many random initial configurations\n",
" nstart = 25\n",
")\n",
"\n",
"# Display clustering object\n",
"kclust\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hzfhscWrLXy-"
},
"source": [
"kmeans 物件包含了許多資訊,這些資訊在 `help(\"kmeans()\")` 中有詳細說明。目前,我們先專注於幾個重點。我們可以看到資料被分成了 3 個群組,大小分別為 65、110 和 111。輸出中還包含了這 3 個群組在 5 個變數上的群中心(平均值)。\n",
"\n",
"群集向量是每個觀測值的群組分配。我們可以使用 `augment` 函數將群組分配新增到原始資料集中。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "0XwwpFGQLXy_"
},
"source": [
"# Add predicted cluster assignment to data set\n",
"augment(kclust, df_numeric_select) %>% \n",
" relocate(.cluster) %>% \n",
" slice_head(n = 10)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "NXIVXXACLXzA"
},
"source": [
"太好了,我們已經將數據集分成了三個群組。那麼,我們的分群效果如何呢 🤷?讓我們來看看 `Silhouette score`。\n",
"\n",
"### **Silhouette score**\n",
"\n",
"[Silhouette 分析](https://en.wikipedia.org/wiki/Silhouette_(clustering)) 可以用來研究結果群組之間的分離距離。這個分數範圍從 -1 到 1如果分數接近 1表示群組密集且與其他群組分離良好。分數接近 0 則表示群組重疊,樣本非常接近鄰近群組的決策邊界。[來源](https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam)。\n",
"\n",
"平均 Silhouette 方法計算不同 *k* 值下觀測值的平均 Silhouette 分數。高平均 Silhouette 分數表示分群效果良好。\n",
"\n",
"使用 cluster 套件中的 `silhouette` 函數來計算平均 Silhouette 寬度。\n",
"\n",
"> Silhouette 可以使用任何[距離](https://en.wikipedia.org/wiki/Distance \"Distance\")度量來計算,例如我們在[上一課](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb)中討論過的[歐幾里得距離](https://en.wikipedia.org/wiki/Euclidean_distance \"Euclidean distance\")或[曼哈頓距離](https://en.wikipedia.org/wiki/Manhattan_distance \"Manhattan distance\")。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Jn0McL28LXzB"
},
"source": [
"# Load cluster package\n",
"library(cluster)\n",
"\n",
"# Compute average silhouette score\n",
"ss <- silhouette(kclust$cluster,\n",
" # Compute euclidean distance\n",
" dist = dist(df_numeric_select))\n",
"mean(ss[, 3])\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "QyQRn97nLXzC"
},
"source": [
"我們的分數是 **.549**,正好位於中間位置。這表明我們的數據並不特別適合這種類型的聚類。讓我們看看是否可以通過視覺化來驗證這個猜測。[factoextra 套件](https://rpkgs.datanovia.com/factoextra/index.html) 提供了用於視覺化聚類的函數(`fviz_cluster()`)。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "7a6Km1_FLXzD"
},
"source": [
"library(factoextra)\n",
"\n",
"# Visualize clustering results\n",
"fviz_cluster(kclust, df_numeric_select)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "IBwCWt-0LXzD"
},
"source": [
"群集之間的重疊表明,我們的數據並不特別適合這種類型的群集,但我們還是繼續進行。\n",
"\n",
"## 4. 確定最佳群集數量\n",
"\n",
"在 K-Means 群集分析中,經常出現的一個基本問題是——在沒有已知類別標籤的情況下,如何知道應將數據分成多少個群集?\n",
"\n",
"我們可以嘗試的一種方法是使用數據樣本來`創建一系列群集模型`,並逐步增加群集的數量(例如從 1 到 10然後評估群集指標例如 **Silhouette 分數**。\n",
"\n",
"讓我們通過計算不同 *k* 值的群集算法來確定最佳群集數量,並評估 **群集內平方和**WCSS。群集內平方和WCSS衡量群集的緊密性我們希望它越小越好較低的值意味著數據點更接近。\n",
"\n",
"讓我們探索不同的 `k` 值(從 1 到 10對此群集的影響。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "hSeIiylDLXzE"
},
"source": [
"# Create a series of clustering models\n",
"kclusts <- tibble(k = 1:10) %>% \n",
" # Perform kmeans clustering for 1,2,3 ... ,10 clusters\n",
" mutate(model = map(k, ~ kmeans(df_numeric_select, centers = .x, nstart = 25)),\n",
" # Farm out clustering metrics eg WCSS\n",
" glanced = map(model, ~ glance(.x))) %>% \n",
" unnest(cols = glanced)\n",
" \n",
"\n",
"# View clustering rsulsts\n",
"kclusts\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "m7rS2U1eLXzE"
},
"source": [
"現在我們已經獲得每個聚類算法在中心 *k* 下的總體內部平方和 (tot.withinss),接下來我們使用[肘部法](https://en.wikipedia.org/wiki/Elbow_method_(clustering))來尋找最佳的聚類數量。此方法的步驟是將 WCSS 繪製成聚類數量的函數圖,並選擇[曲線的肘部](https://en.wikipedia.org/wiki/Elbow_of_the_curve \"Elbow of the curve\")作為要使用的聚類數量。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "o_DjHGItLXzF"
},
"source": [
"set.seed(2056)\n",
"# Use elbow method to determine optimum number of clusters\n",
"kclusts %>% \n",
" ggplot(mapping = aes(x = k, y = tot.withinss)) +\n",
" geom_line(size = 1.2, alpha = 0.8, color = \"#FF7F0EFF\") +\n",
" geom_point(size = 2, color = \"#FF7F0EFF\")\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "pLYyt5XSLXzG"
},
"source": [
"該圖顯示當群集數量從一個增加到兩個時WCSS因此*緊密性*)大幅減少,從兩個增加到三個群集時也有明顯的減少。之後,減少的幅度變得不那麼明顯,導致圖表在大約三個群集處出現一個「肘部」💪。這是一個很好的指標,表明數據點大致可以分為兩到三個相對獨立的群集。\n",
"\n",
"現在我們可以繼續提取 `k = 3` 的群集模型:\n",
"\n",
"> `pull()`: 用於提取單一列\n",
">\n",
"> `pluck()`: 用於索引像列表這樣的數據結構\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "JP_JPKBILXzG"
},
"source": [
"# Extract k = 3 clustering\n",
"final_kmeans <- kclusts %>% \n",
" filter(k == 3) %>% \n",
" pull(model) %>% \n",
" pluck(1)\n",
"\n",
"\n",
"final_kmeans\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "l_PDTu8tLXzI"
},
"source": [
"太好了!讓我們來看看獲得的群集。想用 `plotly` 增加一些互動性嗎?\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "dNcleFe-LXzJ"
},
"source": [
"# Add predicted cluster assignment to data set\n",
"results <- augment(final_kmeans, df_numeric_select) %>% \n",
" bind_cols(df_numeric %>% select(artist_top_genre)) \n",
"\n",
"# Plot cluster assignments\n",
"clust_plt <- results %>% \n",
" ggplot(mapping = aes(x = popularity, y = danceability, color = .cluster, shape = artist_top_genre)) +\n",
" geom_point(size = 2, alpha = 0.8) +\n",
" paletteer::scale_color_paletteer_d(\"ggthemes::Tableau_10\")\n",
"\n",
"ggplotly(clust_plt)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "6JUM_51VLXzK"
},
"source": [
"或許我們原本預期,每個群集(以不同顏色表示)會有明顯不同的類型(以不同形狀表示)。\n",
"\n",
"讓我們來看看模型的準確性。\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "HdIMUGq7LXzL"
},
"source": [
"# Assign genres to predefined integers\n",
"label_count <- results %>% \n",
" group_by(artist_top_genre) %>% \n",
" mutate(id = cur_group_id()) %>% \n",
" ungroup() %>% \n",
" summarise(correct_labels = sum(.cluster == id))\n",
"\n",
"\n",
"# Print results \n",
"cat(\"Result:\", label_count$correct_labels, \"out of\", nrow(results), \"samples were correctly labeled.\")\n",
"\n",
"cat(\"\\nAccuracy score:\", label_count$correct_labels/nrow(results))\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "C50wvaAOLXzM"
},
"source": [
"這個模型的準確性不算差,但也不算好。可能是因為這些數據不太適合使用 K-Means 分群。這些數據過於不平衡,相關性太低,而且欄位值之間的差異太大,導致分群效果不佳。事實上,形成的群集可能受到我們之前定義的三個類型分類的影響或偏斜。\n",
"\n",
"儘管如此,這仍然是一個相當有趣的學習過程!\n",
"\n",
"在 Scikit-learn 的文件中,你可以看到像這樣的模型,群集劃分不太清晰,存在「變異」問題:\n",
"\n",
"<p >\n",
" <img src=\"../../images/problems.png\"\n",
" width=\"500\"/>\n",
" <figcaption>來自 Scikit-learn 的資訊圖表</figcaption>\n",
"\n",
"\n",
"\n",
"## **變異**\n",
"\n",
"變異被定義為「與平均值的平方差的平均值」[來源](https://www.mathsisfun.com/data/standard-deviation.html)。在這個分群問題的背景下,它指的是數據集中數值偏離平均值的程度。\n",
"\n",
"✅ 這是一個很好的時機來思考如何解決這個問題。稍微調整數據?使用不同的欄位?使用不同的演算法?提示:試試[縮放你的數據](https://www.mygreatlearning.com/blog/learning-data-science-with-k-means-clustering/)以進行正規化,並測試其他欄位。\n",
"\n",
"> 試試這個「[變異計算器](https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php)」來更深入了解這個概念。\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"## **🚀挑戰**\n",
"\n",
"花些時間使用這個筆記本,調整參數。你能否通過更清理數據(例如移除異常值)來提高模型的準確性?你可以使用權重來給某些數據樣本更高的權重。還有什麼方法可以用來創建更好的群集?\n",
"\n",
"提示:試著縮放你的數據。筆記本中有註解的程式碼,添加了標準縮放以使數據欄位在範圍上更接近。你會發現,雖然輪廓分數下降了,但肘部圖中的「折點」變得更平滑。這是因為未縮放的數據允許變異較小的數據佔據更大的權重。可以在[這裡](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226)閱讀更多相關問題。\n",
"\n",
"## [**課後測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/30/)\n",
"\n",
"## **回顧與自學**\n",
"\n",
"- 看看一個 K-Means 模擬器[例如這個](https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/)。你可以使用這個工具來視覺化樣本數據點並確定其中心點。你可以編輯數據的隨機性、群集數量和中心點數量。這是否幫助你更好地理解數據如何分組?\n",
"\n",
"- 另外,看看[這份來自 Stanford 的 K-Means 手冊](https://stanford.edu/~cpiech/cs221/handouts/kmeans.html)。\n",
"\n",
"想要嘗試使用你新學到的分群技能來處理適合 K-Means 分群的數據集嗎?請參考:\n",
"\n",
"- [訓練和評估分群模型](https://rpubs.com/eR_ic/clustering),使用 Tidymodels 和相關工具\n",
"\n",
"- [K-Means 分群分析](https://uc-r.github.io/kmeans_clustering)UC 商業分析 R 編程指南\n",
"\n",
"- [使用整潔數據原則進行 K-Means 分群](https://www.tidymodels.org/learn/statistics/k-means/)\n",
"\n",
"## **作業**\n",
"\n",
"[嘗試不同的分群方法](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/2-K-Means/assignment.md)\n",
"\n",
"## 特別感謝:\n",
"\n",
"[Jen Looper](https://www.twitter.com/jenlooper) 創建了這個模組的原始 Python 版本 ♥️\n",
"\n",
"[`Allison Horst`](https://twitter.com/allison_horst/) 創作了令人驚嘆的插圖,使 R 更加友好和吸引人。可以在她的[畫廊](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM)找到更多插圖。\n",
"\n",
"祝學習愉快,\n",
"\n",
"[Eric](https://twitter.com/ericntay)Gold Microsoft Learn 學生大使。\n",
"\n",
"<p >\n",
" <img src=\"../../images/r_learners_sm.jpeg\"\n",
" width=\"500\"/>\n",
" <figcaption>@allison_horst 的作品</figcaption>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**免責聲明** \n本文件已使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。儘管我們努力確保翻譯的準確性,但請注意,自動翻譯可能包含錯誤或不準確之處。原始文件的母語版本應被視為權威來源。對於關鍵信息,建議使用專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或錯誤解釋不承擔責任。\n"
]
}
]
}