ML-For-Beginners/translations/tw/5-Clustering/2-K-Means/solution/R/lesson_15-R.ipynb

{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "anaconda-cloud": "",
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.4.1"
  },
  "colab": {
   "name": "lesson_14.ipynb",
   "provenance": [],
   "collapsed_sections": [],
   "toc_visible": true
  },
  "coopTranslator": {
   "original_hash": "ad65fb4aad0a156b42216e4929f490fc",
   "translation_date": "2025-09-03T20:19:29+00:00",
   "source_file": "5-Clustering/2-K-Means/solution/R/lesson_15-R.ipynb",
   "language_code": "tw"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GULATlQXLXyR"
   },
   "source": [
    "## 使用 R 和 Tidy 數據原則探索 K-Means 分群\n",
    "\n",
    "### [**課前測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/29/)\n",
    "\n",
    "在本課程中，您將學習如何使用 Tidymodels 套件以及 R 生態系統中的其他套件（我們稱它們為朋友 🧑‍🤝‍🧑），以及您之前導入的尼日利亞音樂數據集來創建分群。我們將介紹 K-Means 分群的基本概念。請記住，正如您在之前的課程中所學，處理分群的方法有很多，您使用的方法取決於您的數據。我們將嘗試 K-Means，因為它是最常見的分群技術。讓我們開始吧！\n",
    "\n",
    "您將學習的術語：\n",
    "\n",
    "-   Silhouette 評分\n",
    "\n",
    "-   Elbow 方法\n",
    "\n",
    "-   Inertia（慣性）\n",
    "\n",
    "-   Variance（方差）\n",
    "\n",
    "### **簡介**\n",
    "\n",
    "[K-Means 分群](https://wikipedia.org/wiki/K-means_clustering) 是一種源自信號處理領域的方法。它用於根據特徵的相似性將數據分成 `k 個分群`。\n",
    "\n",
    "這些分群可以用 [Voronoi 圖](https://wikipedia.org/wiki/Voronoi_diagram) 來可視化，其中包括一個點（或“種子”）及其對應的區域。\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/voronoi.png\"\n",
    "   width=\"500\"/>\n",
    "   <figcaption>Jen Looper 的信息圖</figcaption>\n",
    "\n",
    "K-Means 分群的步驟如下：\n",
    "\n",
    "1.  數據科學家首先指定要創建的分群數量。\n",
    "\n",
    "2.  接下來，算法隨機選擇數據集中的 K 個觀測值作為分群的初始中心（即質心）。\n",
    "\n",
    "3.  然後，將其餘的觀測值分配到距離最近的質心。\n",
    "\n",
    "4.  接下來，計算每個分群的新均值，並將質心移動到均值位置。\n",
    "\n",
    "5.  現在質心已重新計算，每個觀測值再次被檢查是否可能更接近其他分群。使用更新的分群均值重新分配所有對象。分群分配和質心更新步驟會反覆進行，直到分群分配不再改變（即達到收斂）。通常，當每次新迭代導致質心的移動微乎其微且分群變得穩定時，算法就會終止。\n",
    "\n",
    "<div>\n",
    "\n",
    "> 請注意，由於初始 k 個觀測值的隨機化，我們每次應用該程序時可能會得到略有不同的結果。因此，大多數算法會使用多次 *隨機啟動*，並選擇具有最低 WCSS 的迭代。因此，強烈建議始終使用多個 *nstart* 值運行 K-Means，以避免 *不理想的局部最優解*。\n",
    "\n",
    "</div>\n",
    "\n",
    "以下短動畫使用 Allison Horst 的 [插圖](https://github.com/allisonhorst/stats-illustrations) 解釋了分群過程：\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/kmeans.gif\"\n",
    "   width=\"550\"/>\n",
    "   <figcaption>@allison_horst 的插圖</figcaption>\n",
    "\n",
    "分群中出現的一個基本問題是：如何知道應將數據分成多少個分群？使用 K-Means 的一個缺點是您需要確定 `k`，即 `質心` 的數量。幸運的是，`elbow 方法` 有助於估算 `k` 的良好起始值。您稍後將嘗試使用它。\n",
    "\n",
    "### \n",
    "\n",
    "**前置條件**\n",
    "\n",
    "我們將從 [上一課](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb) 的結尾開始，在那裡我們分析了數據集，進行了大量可視化，並篩選了感興趣的觀測值。一定要查看！\n",
    "\n",
    "我們需要一些套件來完成這個模組。您可以通過以下方式安裝它們：`install.packages(c('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork'))`\n",
    "\n",
    "或者，以下腳本會檢查您是否擁有完成此模組所需的套件，並在缺少某些套件時為您安裝它們。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "ah_tBi58LXyi"
   },
   "source": [
    "suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n",
    "\n",
    "pacman::p_load('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork')\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7e--UCUTLXym"
   },
   "source": [
    "讓我們快速開始吧！\n",
    "\n",
    "## 1. 與數據共舞：縮小範圍至三大最受歡迎的音樂類型\n",
    "\n",
    "這是我們在上一課中所做內容的回顧。讓我們來切分和分析一些數據吧！\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "Ycamx7GGLXyn"
   },
   "source": [
    "# Load the core tidyverse and make it available in your current R session\n",
    "library(tidyverse)\n",
    "\n",
    "# Import the data into a tibble\n",
    "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv\", show_col_types = FALSE)\n",
    "\n",
    "# Narrow down to top 3 popular genres\n",
    "nigerian_songs <- df %>% \n",
    "  # Concentrate on top 3 genres\n",
    "  filter(artist_top_genre %in% c(\"afro dancehall\", \"afropop\",\"nigerian pop\")) %>% \n",
    "  # Remove unclassified observations\n",
    "  filter(popularity != 0)\n",
    "\n",
    "\n",
    "\n",
    "# Visualize popular genres using bar plots\n",
    "theme_set(theme_light())\n",
    "nigerian_songs %>%\n",
    "  count(artist_top_genre) %>%\n",
    "  ggplot(mapping = aes(x = artist_top_genre, y = n,\n",
    "                       fill = artist_top_genre)) +\n",
    "  geom_col(alpha = 0.8) +\n",
    "  paletteer::scale_fill_paletteer_d(\"ggsci::category10_d3\") +\n",
    "  ggtitle(\"Top genres\") +\n",
    "  theme(plot.title = element_text(hjust = 0.5))\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "b5h5zmkPLXyp"
   },
   "source": [
    "🤩 太棒了！\n",
    "\n",
    "## 2. 更多數據探索\n",
    "\n",
    "這些數據有多乾淨？讓我們使用盒形圖檢查是否有異常值。我們將專注於異常值較少的數值型欄位（雖然你也可以清理掉異常值）。盒形圖可以顯示數據的範圍，並幫助選擇要使用的欄位。請注意，盒形圖不顯示方差，而方差是良好可聚類數據的重要元素。請參閱[這篇討論](https://stats.stackexchange.com/questions/91536/deduce-variance-from-boxplot)以了解更多。\n",
    "\n",
    "[盒形圖](https://en.wikipedia.org/wiki/Box_plot)用於以圖形方式描述`數值型`數據的分佈，因此讓我們從*選擇*所有數值型欄位以及流行音樂類型開始。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "HhNreJKLLXyq"
   },
   "source": [
    "# Select top genre column and all other numeric columns\n",
    "df_numeric <- nigerian_songs %>% \n",
    "  select(artist_top_genre, where(is.numeric)) \n",
    "\n",
    "# Display the data\n",
    "df_numeric %>% \n",
    "  slice_head(n = 5)\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uYXrwJRaLXyq"
   },
   "source": [
    "看看選擇輔助工具 `where` 是如何讓這件事變得簡單的 💁？可以在[這裡](https://tidyselect.r-lib.org/)探索其他類似的函數。\n",
    "\n",
    "由於我們將為每個數值特徵製作箱型圖，並且希望避免使用迴圈，因此我們需要將數據重新格式化為*更長的*格式，這樣就可以利用 `facets`——每個子圖都顯示數據的一個子集。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "gd5bR3f8LXys"
   },
   "source": [
    "# Pivot data from wide to long\n",
    "df_numeric_long <- df_numeric %>% \n",
    "  pivot_longer(!artist_top_genre, names_to = \"feature_names\", values_to = \"values\") \n",
    "\n",
    "# Print out data\n",
    "df_numeric_long %>% \n",
    "  slice_head(n = 15)\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-7tE1swnLXyv"
   },
   "source": [
    "更長了！現在是時候使用一些 `ggplots` 了！那麼我們會使用哪種 `geom` 呢？\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "r88bIsyuLXyy"
   },
   "source": [
    "# Make a box plot\n",
    "df_numeric_long %>% \n",
    "  ggplot(mapping = aes(x = feature_names, y = values, fill = feature_names)) +\n",
    "  geom_boxplot() +\n",
    "  facet_wrap(~ feature_names, ncol = 4, scales = \"free\") +\n",
    "  theme(legend.position = \"none\")\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EYVyKIUELXyz"
   },
   "source": [
    "現在我們可以看到這些數據有些雜亂：透過觀察每一列的盒狀圖，可以看到有異常值。你可以逐一檢查數據集並移除這些異常值，但這樣會使數據變得非常有限。\n",
    "\n",
    "目前，我們來選擇要用於聚類練習的列。讓我們挑選範圍相似的數值型列。我們可以將 `artist_top_genre` 編碼為數值型，但現在先將其排除。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "-wkpINyZLXy0"
   },
   "source": [
    "# Select variables with similar ranges\n",
    "df_numeric_select <- df_numeric %>% \n",
    "  select(popularity, danceability, acousticness, loudness, energy) \n",
    "\n",
    "# Normalize data\n",
    "# df_numeric_select <- scale(df_numeric_select)\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "D7dLzgpqLXy1"
   },
   "source": [
    "## 3. 在 R 中計算 k-means 分群\n",
    "\n",
    "我們可以使用 R 中內建的 `kmeans` 函數來計算 k-means，請參閱 `help(\"kmeans()\")`。`kmeans()` 函數的主要參數是一個包含所有數值型欄位的資料框。\n",
    "\n",
    "使用 k-means 分群的第一步是指定最終解決方案中要生成的群數（k）。我們知道從資料集中分出了 3 種歌曲類型，因此我們嘗試設置為 3：\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "uC4EQ5w7LXy5"
   },
   "source": [
    "set.seed(2056)\n",
    "# Kmeans clustering for 3 clusters\n",
    "kclust <- kmeans(\n",
    "  df_numeric_select,\n",
    "  # Specify the number of clusters\n",
    "  centers = 3,\n",
    "  # How many random initial configurations\n",
    "  nstart = 25\n",
    ")\n",
    "\n",
    "# Display clustering object\n",
    "kclust\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hzfhscWrLXy-"
   },
   "source": [
    "kmeans 物件包含了許多資訊，這些資訊在 `help(\"kmeans()\")` 中有詳細說明。目前，我們先專注於幾個重點。我們可以看到資料已被分成三個群組，大小分別為 65、110 和 111。輸出中還包含了三個群組在五個變數上的群中心（平均值）。\n",
    "\n",
    "聚類向量是每個觀測值的群組分配。我們可以使用 `augment` 函數將群組分配加入到原始資料集中。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "0XwwpFGQLXy_"
   },
   "source": [
    "# Add predicted cluster assignment to data set\n",
    "augment(kclust, df_numeric_select) %>% \n",
    "  relocate(.cluster) %>% \n",
    "  slice_head(n = 10)\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NXIVXXACLXzA"
   },
   "source": [
    "太好了，我們剛剛將數據集分成了三個群組。那麼，我們的分群效果如何呢 🤷？讓我們來看看 `Silhouette score`。\n",
    "\n",
    "### **Silhouette score**\n",
    "\n",
    "[Silhouette 分析](https://en.wikipedia.org/wiki/Silhouette_(clustering)) 可以用來研究結果群組之間的分離距離。這個分數範圍從 -1 到 1，如果分數接近 1，表示群組密集且與其他群組分離良好。接近 0 的值則表示群組重疊，樣本非常接近鄰近群組的決策邊界。[來源](https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam)。\n",
    "\n",
    "平均 Silhouette 方法計算不同 *k* 值下觀測值的平均 Silhouette 分數。高的平均 Silhouette 分數表示分群效果良好。\n",
    "\n",
    "使用 cluster 套件中的 `silhouette` 函數來計算平均 Silhouette 寬度。\n",
    "\n",
    "> Silhouette 可以使用任何 [距離](https://en.wikipedia.org/wiki/Distance \"Distance\") 度量來計算，例如 [歐幾里得距離](https://en.wikipedia.org/wiki/Euclidean_distance \"Euclidean distance\") 或 [曼哈頓距離](https://en.wikipedia.org/wiki/Manhattan_distance \"Manhattan distance\")，這些我們在[上一課](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb)中已經討論過。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "Jn0McL28LXzB"
   },
   "source": [
    "# Load cluster package\n",
    "library(cluster)\n",
    "\n",
    "# Compute average silhouette score\n",
    "ss <- silhouette(kclust$cluster,\n",
    "                 # Compute euclidean distance\n",
    "                 dist = dist(df_numeric_select))\n",
    "mean(ss[, 3])\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QyQRn97nLXzC"
   },
   "source": [
    "我們的分數是 **.549**，正好位於中間位置。這表明我們的數據並不特別適合這種類型的聚類。讓我們看看是否可以通過視覺化來確認這個猜測。[factoextra 套件](https://rpkgs.datanovia.com/factoextra/index.html) 提供了函數 (`fviz_cluster()`) 用於視覺化聚類。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "7a6Km1_FLXzD"
   },
   "source": [
    "library(factoextra)\n",
    "\n",
    "# Visualize clustering results\n",
    "fviz_cluster(kclust, df_numeric_select)\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IBwCWt-0LXzD"
   },
   "source": [
    "群集之間的重疊表明，我們的數據並不特別適合這種類型的群集，但我們還是繼續進行。\n",
    "\n",
    "## 4. 確定最佳群集數量\n",
    "\n",
    "在 K-Means 群集分析中，經常出現的一個基本問題是——在沒有已知類別標籤的情況下，如何知道應將數據分成多少個群集？\n",
    "\n",
    "我們可以嘗試的一種方法是使用數據樣本來`創建一系列群集模型`，並逐步增加群集的數量（例如從 1 到 10），然後評估群集指標，例如 **Silhouette 分數**。\n",
    "\n",
    "讓我們通過計算不同 *k* 值的群集算法來確定最佳群集數量，並評估 **群集內平方和**（WCSS）。群集內平方和（WCSS）總量衡量群集的緊密性，我們希望它越小越好，較低的值意味著數據點更接近。\n",
    "\n",
    "讓我們探索不同的 `k` 選擇（從 1 到 10）對此群集的影響。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "hSeIiylDLXzE"
   },
   "source": [
    "# Create a series of clustering models\n",
    "kclusts <- tibble(k = 1:10) %>% \n",
    "  # Perform kmeans clustering for 1,2,3 ... ,10 clusters\n",
    "  mutate(model = map(k, ~ kmeans(df_numeric_select, centers = .x, nstart = 25)),\n",
    "  # Farm out clustering metrics eg WCSS\n",
    "         glanced = map(model, ~ glance(.x))) %>% \n",
    "  unnest(cols = glanced)\n",
    "  \n",
    "\n",
    "# View clustering rsulsts\n",
    "kclusts\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "m7rS2U1eLXzE"
   },
   "source": [
    "現在我們已經獲得每個聚類算法在中心 *k* 下的總集群內平方和 (tot.withinss)，接下來我們使用[肘部法](https://en.wikipedia.org/wiki/Elbow_method_(clustering))來尋找最佳的聚類數量。此方法包括將WCSS作為聚類數量的函數進行繪圖，並選擇[曲線的肘部](https://en.wikipedia.org/wiki/Elbow_of_the_curve \"曲線的肘部\")作為要使用的聚類數量。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "o_DjHGItLXzF"
   },
   "source": [
    "set.seed(2056)\n",
    "# Use elbow method to determine optimum number of clusters\n",
    "kclusts %>% \n",
    "  ggplot(mapping = aes(x = k, y = tot.withinss)) +\n",
    "  geom_line(size = 1.2, alpha = 0.8, color = \"#FF7F0EFF\") +\n",
    "  geom_point(size = 2, color = \"#FF7F0EFF\")\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pLYyt5XSLXzG"
   },
   "source": [
    "該圖顯示，當群集數量從一個增加到兩個時，WCSS（因此*緊密度*）大幅減少，從兩個增加到三個群集時也有明顯的減少。之後，減少的幅度變得不那麼明顯，導致圖表在大約三個群集處出現一個「肘部」💪。這是一個很好的指標，表明數據點可以合理地分為兩到三個明顯分離的群集。\n",
    "\n",
    "現在我們可以繼續提取 `k = 3` 的群集模型：\n",
    "\n",
    "> `pull()`：用於提取單一列\n",
    ">\n",
    "> `pluck()`：用於索引像列表這樣的數據結構\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "JP_JPKBILXzG"
   },
   "source": [
    "# Extract k = 3 clustering\n",
    "final_kmeans <- kclusts %>% \n",
    "  filter(k == 3) %>% \n",
    "  pull(model) %>% \n",
    "  pluck(1)\n",
    "\n",
    "\n",
    "final_kmeans\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "l_PDTu8tLXzI"
   },
   "source": [
    "太好了！讓我們來看看獲得的群集。想用 `plotly` 增加一些互動性嗎？\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "dNcleFe-LXzJ"
   },
   "source": [
    "# Add predicted cluster assignment to data set\n",
    "results <-  augment(final_kmeans, df_numeric_select) %>% \n",
    "  bind_cols(df_numeric %>% select(artist_top_genre)) \n",
    "\n",
    "# Plot cluster assignments\n",
    "clust_plt <- results %>% \n",
    "  ggplot(mapping = aes(x = popularity, y = danceability, color = .cluster, shape = artist_top_genre)) +\n",
    "  geom_point(size = 2, alpha = 0.8) +\n",
    "  paletteer::scale_color_paletteer_d(\"ggthemes::Tableau_10\")\n",
    "\n",
    "ggplotly(clust_plt)\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6JUM_51VLXzK"
   },
   "source": [
    "或許我們原本預期，每個群集（以不同顏色表示）都會有明顯不同的類型（以不同形狀表示）。\n",
    "\n",
    "讓我們來看看模型的準確性。\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "HdIMUGq7LXzL"
   },
   "source": [
    "# Assign genres to predefined integers\n",
    "label_count <- results %>% \n",
    "  group_by(artist_top_genre) %>% \n",
    "  mutate(id = cur_group_id()) %>% \n",
    "  ungroup() %>% \n",
    "  summarise(correct_labels = sum(.cluster == id))\n",
    "\n",
    "\n",
    "# Print results  \n",
    "cat(\"Result:\", label_count$correct_labels, \"out of\", nrow(results), \"samples were correctly labeled.\")\n",
    "\n",
    "cat(\"\\nAccuracy score:\", label_count$correct_labels/nrow(results))\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "C50wvaAOLXzM"
   },
   "source": [
    "這個模型的準確性還不錯，但並不完美。可能是因為這些數據並不適合用於 K-Means 聚類。這些數據過於不平衡，相關性太低，而且各列數值之間的變異性太大，導致難以形成良好的聚類。事實上，形成的聚類可能會受到我們之前定義的三個類型分類的嚴重影響或偏斜。\n",
    "\n",
    "儘管如此，這仍然是一個很好的學習過程！\n",
    "\n",
    "在 Scikit-learn 的文檔中，你可以看到像這樣的模型，聚類並不明顯，存在一個「變異性」問題：\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/problems.png\"\n",
    "   width=\"500\"/>\n",
    "   <figcaption>來自 Scikit-learn 的資訊圖表</figcaption>\n",
    "\n",
    "\n",
    "\n",
    "## **變異性**\n",
    "\n",
    "變異性被定義為「與平均值的平方差的平均值」[來源](https://www.mathsisfun.com/data/standard-deviation.html)。在這個聚類問題的背景下，它指的是數據集中數值偏離平均值的程度過大。\n",
    "\n",
    "✅ 這是一個很好的時機來思考所有可能解決這個問題的方法。稍微調整數據？使用不同的列？使用不同的算法？提示：嘗試[縮放數據](https://www.mygreatlearning.com/blog/learning-data-science-with-k-means-clustering/)以進行標準化，並測試其他列。\n",
    "\n",
    "> 試試這個「[變異性計算器](https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php)」來更深入理解這個概念。\n",
    "\n",
    "------------------------------------------------------------------------\n",
    "\n",
    "## **🚀挑戰**\n",
    "\n",
    "花一些時間在這個筆記本上，調整參數。通過進一步清理數據（例如移除異常值），你能提高模型的準確性嗎？你可以使用權重來給某些數據樣本更多的權重。還有什麼方法可以用來創建更好的聚類？\n",
    "\n",
    "提示：嘗試縮放數據。在筆記本中有註解的代碼，添加了標準縮放，使數據列在範圍上更接近。你會發現，雖然輪廓分數下降了，但肘部圖中的「折點」變得更加平滑。這是因為未縮放的數據允許變異性較小的數據承擔更多的權重。可以在[這裡](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226)閱讀更多關於這個問題的內容。\n",
    "\n",
    "## [**課後測驗**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/30/)\n",
    "\n",
    "## **回顧與自學**\n",
    "\n",
    "-   看看一個 K-Means 模擬器[例如這個](https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/)。你可以使用這個工具來可視化樣本數據點並確定其中心點。你可以編輯數據的隨機性、聚類數量和中心點數量。這是否幫助你更好地理解數據如何被分組？\n",
    "\n",
    "-   另外，看看[這份來自 Stanford 的 K-Means 資料](https://stanford.edu/~cpiech/cs221/handouts/kmeans.html)。\n",
    "\n",
    "想嘗試將你新學到的聚類技能應用到適合 K-Means 聚類的數據集上嗎？請參考：\n",
    "\n",
    "-   [訓練和評估聚類模型](https://rpubs.com/eR_ic/clustering)，使用 Tidymodels 和相關工具\n",
    "\n",
    "-   [K-means 聚類分析](https://uc-r.github.io/kmeans_clustering)，UC 商業分析 R 編程指南\n",
    "\n",
    "- [基於整潔數據原則的 K-means 聚類](https://www.tidymodels.org/learn/statistics/k-means/)\n",
    "\n",
    "## **作業**\n",
    "\n",
    "[嘗試不同的聚類方法](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/2-K-Means/assignment.md)\n",
    "\n",
    "## 特別感謝：\n",
    "\n",
    "[Jen Looper](https://www.twitter.com/jenlooper) 創建了這個模組的原始 Python 版本 ♥️\n",
    "\n",
    "[`Allison Horst`](https://twitter.com/allison_horst/) 創作了這些令人驚嘆的插圖，使 R 更加親切和有趣。可以在她的[畫廊](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM)中找到更多插圖。\n",
    "\n",
    "祝學習愉快，\n",
    "\n",
    "[Eric](https://twitter.com/ericntay)，Gold Microsoft Learn 學生大使。\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/r_learners_sm.jpeg\"\n",
    "   width=\"500\"/>\n",
    "   <figcaption>由 @allison_horst 創作的藝術作品</figcaption>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**免責聲明**：  \n本文件已使用 AI 翻譯服務 [Co-op Translator](https://github.com/Azure/co-op-translator) 進行翻譯。儘管我們努力確保翻譯的準確性，但請注意，自動翻譯可能包含錯誤或不準確之處。原始語言的文件應被視為權威來源。對於關鍵資訊，建議使用專業人工翻譯。我們對因使用此翻譯而引起的任何誤解或錯誤解釋不承擔責任。\n"
   ]
  }
 ]
}