# K-Means clustering [![Andrew Ng explains Clustering](https://img.youtube.com/vi/hDmNF9JG3lo/0.jpg)](https://youtu.be/hDmNF9JG3lo "Andrew Ng explains Clustering") > ๐ŸŽฅ ์˜์ƒ์„ ๋ณด๋ ค๋ฉด ์ด๋ฏธ์ง€ ํด๋ฆญ: Andrew Ng explains clustering ## [๊ฐ•์˜ ์ „ ํ€ด์ฆˆ](https://white-water-09ec41f0f.azurestaticapps.net/quiz/29/) ์ด ๊ฐ•์˜์—์„œ, Scikit-learn๊ณผ ํ•จ๊ป˜ ์ด์ „์— ๊ฐ€์ ธ์˜จ ๋‚˜์ด์ง€๋ฆฌ์•„ ์Œ์•… ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ ์ œ์ž‘ ๋ฐฉ์‹์„ ๋ฐฐ์šธ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. Clustering์„ ์œ„ํ•œ K-Means ๊ธฐ์ดˆ๋ฅผ ๋‹ค๋ฃจ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ฐธ๊ณ ๋กœ, ์ด์ „ ๊ฐ•์˜์—์„œ ๋ฐฐ์› ๋˜๋Œ€๋กœ, ํด๋Ÿฌ์Šคํ„ฐ๋กœ ์ž‘์—…ํ•˜๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ์‹์ด ์žˆ๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜ํ•œ ๋ฐฉ์‹๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์ผ๋ฐ˜์  clustering ๊ธฐ์ˆ ์ธ K-Means์„ ์‹œ๋„ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์‹œ์ž‘ํ•ด๋ด…๋‹ˆ๋‹ค! ๋‹ค์Œ ์šฉ์–ด๋ฅผ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค: - Silhouette scoring - Elbow method - Inertia - Variance ## ์†Œ๊ฐœ [K-Means Clustering](https://wikipedia.org/wiki/K-means_clustering)์€ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ ๋„๋ฉ”์ธ์—์„œ ํŒŒ์ƒ๋œ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. observations ๊ณ„์—ด๋กœ์„œ ๋ฐ์ดํ„ฐ ๊ทธ๋ฃน์„ 'k' ํด๋Ÿฌ์Šคํ„ฐ๋กœ ๋‚˜๋ˆ„๊ณ  ๋ถ„ํ• ํ•˜๋ฉฐ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ์ž observation์€ ๊ฐ€๊นŒ์šด 'mean', ๋˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ค‘์‹ฌ ํฌ์ธํŠธ์— ์ฃผ์–ด์ง„ ์ •๋ฐ€ํ•œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ๊ธฐ ์œ„ํ•ด์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ํด๋Ÿฌ์Šคํ„ฐ๋Š” ํฌ์ธํŠธ(๋˜๋Š” 'seed')์™€ ์ผ์น˜ํ•˜๋Š” ์˜์—ญ์„ ํฌํ•จํ•œ, [Voronoi diagrams](https://wikipedia.org/wiki/Voronoi_diagram)์œผ๋กœ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ![voronoi diagram](../images/voronoi.png) > infographic by [Jen Looper](https://twitter.com/jenlooper) K-Means clustering์€ [executes in a three-step process](https://scikit-learn.org/stable/modules/clustering.html#k-means)๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค: 1. ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ƒ˜ํ”Œ๋งํ•œ ์ค‘์‹ฌ ํฌ์ธํŠธ์˜ k-number๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค: 1. ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฌด๊ฒŒ ์ค‘์‹ฌ์— ๊ฐ์ž ์ƒ˜ํ”Œ์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค. 2. ์ด์ „์˜ ๋ฌด๊ฒŒ ์ค‘์‹ฌ์—์„œ ํ• ๋‹น๋œ ๋ชจ๋“  ์ƒ˜ํ”Œ์˜ ํ‰๊ท  ๊ฐ’์„ ๊ฐ€์ง€๋ฉด์„œ ์ƒˆ๋กœ์šด ๋ฌด๊ฒŒ ์ค‘์‹ฌ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. 3. ๊ทธ๋Ÿฌ๋ฉด, ์ƒˆ๋กญ๊ณ  ์˜ค๋ž˜๋œ ๋ฌด๊ฒŒ ์ค‘์‹ฌ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๋ฌด๊ณ„ ์ค‘์‹ฌ์ด ์•ˆ์ •๋  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. K-Means์„ ์‚ฌ์šฉํ•œ ํ•œ ๊ฐ€์ง€ ์•ฝ์ ์€ ๋ฌด๊ฒŒ ์ค‘์‹ฌ์˜ ์ˆซ์ž๋ฅผ, 'k'๋กœ ํ•ด์•ผ ๋œ๋‹ค๋Š” ์‚ฌ์‹ค์ž…๋‹ˆ๋‹ค. ๋‹คํ–‰์Šค๋Ÿฝ๊ฒŒ 'elbow method'๋Š” 'k' ๊ฐ’์„ ์ข‹๊ฒŒ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ์ถ”์ •ํ•˜๋Š” ๋ฐ ๋„์›€์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ช‡ ๋ถ„๋™์•ˆ ์‹œ๋„ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ## ์ „์ œ ์กฐ๊ฑด ๋งˆ์ง€๋ง‰ ๊ฐ•์˜์—์„œ ํ–ˆ๋˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์„œ ๋ฏธ๋ฆฌ ์ •๋ฆฌํ•œ ์ด ๊ฐ•์˜์˜ _notebook.ipynb_ ํŒŒ์ผ๋กœ ์ž‘์—…ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ## ์—ฐ์Šต - ์ค€๋น„ํ•˜๊ธฐ ๋…ธ๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ๋ณด๋Š” ๊ฒƒ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. 1. ๊ฐ ์—ด์— `boxplot()`์„ ๋ถˆ๋Ÿฌ์„œ, boxplot์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค: ```python plt.figure(figsize=(20,20), dpi=200) plt.subplot(4,3,1) sns.boxplot(x = 'popularity', data = df) plt.subplot(4,3,2) sns.boxplot(x = 'acousticness', data = df) plt.subplot(4,3,3) sns.boxplot(x = 'energy', data = df) plt.subplot(4,3,4) sns.boxplot(x = 'instrumentalness', data = df) plt.subplot(4,3,5) sns.boxplot(x = 'liveness', data = df) plt.subplot(4,3,6) sns.boxplot(x = 'loudness', data = df) plt.subplot(4,3,7) sns.boxplot(x = 'speechiness', data = df) plt.subplot(4,3,8) sns.boxplot(x = 'tempo', data = df) plt.subplot(4,3,9) sns.boxplot(x = 'time_signature', data = df) plt.subplot(4,3,10) sns.boxplot(x = 'danceability', data = df) plt.subplot(4,3,11) sns.boxplot(x = 'length', data = df) plt.subplot(4,3,12) sns.boxplot(x = 'release_date', data = df) ``` ์ด ๋ฐ์ดํ„ฐ๋Š” ์•ฝ๊ฐ„์˜ ๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค: ๊ฐ ์—ด์„ boxplot์œผ๋กœ ์ง€์ผœ๋ณด๋ฉด ์•„์›ƒ๋ผ์ด์–ด๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ![outliers](../images/boxplots.png) ๋ฐ์ดํ„ฐ์…‹์„ ์ฐพ๊ณ  ์ด ์•„์›ƒ๋ผ์ด์–ด๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋Œ€์‹ ์—, ๋ฐ์ดํ„ฐ๋Š” ๊ฝค ์ž‘์•„์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 1. ์ง€๊ธˆ๋ถ€ํ„ฐ, clustering ์—ฐ์Šต์—์„œ ์‚ฌ์šฉํ•  ์—ด์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์œ ์‚ฌํ•œ ๋ฒ”์œ„๋กœ ํ•˜๋‚˜ ์„ ํƒํ•˜๊ณ  `artist_top_genre` ์—ด์„ ์ˆซ์ž ๋ฐ์ดํ„ฐ๋กœ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค: ```python from sklearn.preprocessing import LabelEncoder le = LabelEncoder() X = df.loc[:, ('artist_top_genre','popularity','danceability','acousticness','loudness','energy')] y = df['artist_top_genre'] X['artist_top_genre'] = le.fit_transform(X['artist_top_genre']) y = le.transform(y) ``` 1. ์ด์ œ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํƒ€๊ฒŸ์œผ๋กœ ์žก์„์ง€ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์—์„œ ์กฐ๊ฐ๋‚ธ 3๊ฐœ ์žฅ๋ฅด๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ, 3๊ฐœ๋ฅผ ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค: ```python from sklearn.cluster import KMeans nclusters = 3 seed = 0 km = KMeans(n_clusters=nclusters, random_state=seed) km.fit(X) # Predict the cluster for each data point y_cluster_kmeans = km.predict(X) y_cluster_kmeans ``` ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ๊ฐ ์—ด์—์„œ ์˜ˆ์ธก๋œ ํด๋Ÿฌ์Šคํ„ฐ (0, 1,๋˜๋Š” 2)๋กœ ๋ฐฐ์—ด์„ ์ถœ๋ ฅํ•ด์„œ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 1. ๋ฐฐ์—ด๋กœ 'silhouette score'๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: ```python from sklearn import metrics score = metrics.silhouette_score(X, y_cluster_kmeans) score ``` ## Silhouette score 1์— ๊ทผ์ ‘ํ•œ silhouette score๋ฅผ ์ฐพ์•„๋ด…๋‹ˆ๋‹ค. ์ด ์ ์ˆ˜๋Š” -1์—์„œ 1๊นŒ์ง€ ๋‹ค์–‘ํ•˜๋ฉฐ, ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ๋ฐ€์ ‘ํ•˜์—ฌ ๋‹ค๋ฅธ ๊ฒƒ๊ณผ ์ž˜-๋ถ„๋ฆฌ๋ฉ๋‹ˆ๋‹ค. 0 ๊ทผ์ ‘ ๊ฐ’์€ ์ฃผ๋ณ€ ํด๋Ÿฌ์Šคํ„ฐ์˜ decision boundary์— ๋งค์šฐ ๊ฐ€๊นŒ์šด ์ƒ˜ํ”Œ๊ณผ ํ•จ๊ป˜ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์˜ค๋ฒ„๋žฉํ—ค์„œ ๋‹ˆํƒ€๋ƒ…๋‹ˆ๋‹ค. [source](https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam). **.53** ์ ์ด๋ฏ€๋กœ, ์ค‘๊ฐ„์— ์œ„์น˜ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ ์ด clustering ํƒ€์ž…์— ํŠนํžˆ ์ž˜-๋งž์ง€ ์•Š๋‹ค๋Š” ์ ์„ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ์ง€๋งŒ, ๊ณ„์† ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ### ์—ฐ์Šต - ๋ชจ๋ธ ๋งŒ๋“ค๊ธฐ 1. `KMeans`์„ import ํ•˜๊ณ  clustering ์ฒ˜๋ฆฌ๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ```python from sklearn.cluster import KMeans wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(X) wcss.append(kmeans.inertia_) ``` ์—ฌ๊ธฐ ์„ค๋ช…์„ ๋’ท๋ฐ›์นจํ•  ๋ช‡ ํŒŒํŠธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. > ๐ŸŽ“ range: clustering ํ”„๋กœ์„ธ์Šค์˜ ๋ฐ˜๋ณต์ž…๋‹ˆ๋‹ค > ๐ŸŽ“ random_state: "Determines random number generation for centroid initialization."[source](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) > ๐ŸŽ“ WCSS: "within-cluster sums of squares"์€ ํด๋Ÿฌ์Šคํ„ฐ ๋ฌด๊ฒŒ ์ค‘์‹ฌ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋ชจ๋“  ํฌ์ธํŠธ์˜ squared average ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. [source](https://medium.com/@ODSC/unsupervised-learning-evaluating-clusters-bd47eed175ce). > ๐ŸŽ“ Inertia: K-Means ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ 'inertia'๋ฅผ ์ตœ์†Œ๋กœ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ฌด๊ฒŒ ์ค‘์‹ฌ์„ ์„ ํƒํ•˜๋ ค๊ณ  ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค, "a measure of how internally coherent clusters are."[source](https://scikit-learn.org/stable/modules/clustering.html). ๊ฐ’์€ ๊ฐ ๋ฐ˜๋ณต์—์„œ wcss ๋ณ€์ˆ˜๋กœ ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. > ๐ŸŽ“ k-means++: [Scikit-learn](https://scikit-learn.org/stable/modules/clustering.html#k-means)์—์„œ 'k-means++' ์ตœ์ ํ™”๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ณ , ๋ฌด๊ฒŒ ์ค‘์‹ฌ์„ (์ผ๋ฐ˜์ ์ธ) ๊ฑฐ๋ฆฌ๋กœ ๊ฐ์ž ๋–จ์–ด์ ธ์„œ ์ดˆ๊ธฐํ™”ํ•˜๋ฉด, ์•„๋งˆ ๋žœ๋ค ์ดˆ๊ธฐํ™”๋ณด๋‹ค ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ### Elbow method ์˜ˆ์ „์— ์ถ”์ธกํ–ˆ๋˜ ๊ฒƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, 3๊ฐœ ๋…ธ๋ž˜ ์žฅ๋ฅด๋ฅผ ํƒ€๊ฒŸํŒ… ํ–ˆ์œผ๋ฏ€๋กœ, 3๊ฒŒ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์„ ํƒํ•ด์•ผ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๋žฌ์–ด์•ผ๋งŒ ํ•˜๋‚˜์š”? 1. 'elbow method'์„ ์‚ฌ์šฉํ•ด์„œ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. ```python plt.figure(figsize=(10,5)) sns.lineplot(range(1, 11), wcss,marker='o',color='red') plt.title('Elbow') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show() ``` ์ด์ „ ๋‹จ๊ณ„์—์„œ ๋งŒ๋“ค์—ˆ๋˜ `wcss` ๋ณ€์ˆ˜๋กœ, ์ตœ์  ํด๋Ÿฌ์Šคํ„ฐ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ผ elbow์˜ 'bend'๊ฐ€ ์–ด๋””์žˆ๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์ฐจํŠธ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์•„๋งˆ๋„ 3 **์ž…๋‹ˆ๋‹ค**! ![elbow method](../images/elbow.png) ## ์—ฐ์Šต - ํด๋Ÿฌ์Šคํ„ฐ ๋ณด์ด๊ธฐ 1. ํ”„๋กœ์„ธ์Šค๋ฅผ ๋‹ค์‹œ ์‹œ๋„ํ•˜์—ฌ, ์ด ์‹œ์ ์— 3๊ฐœ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋‹ค์‹œ ์„ค์ •ํ•˜๊ณ , scatterplot์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: ```python from sklearn.cluster import KMeans kmeans = KMeans(n_clusters = 3) kmeans.fit(X) labels = kmeans.predict(X) plt.scatter(df['popularity'],df['danceability'],c = labels) plt.xlabel('popularity') plt.ylabel('danceability') plt.show() ``` 1. ๋ชจ๋ธ ์ •ํ™•๋„๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค: ```python labels = kmeans.labels_ correct_labels = sum(y == labels) print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size)) print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size))) ``` ์ด ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋Š” ๋งค์šฐ ์ข‹์ง€ ์•Š์œผ๋ฉฐ, ํด๋Ÿฌ์Šคํ„ฐ์˜ ํ˜•ํƒœ๊ฐ€ ์™œ ๊ทธ๋žฌ๋Š”์ง€ ํžŒํŠธ๋ฅผ ์ค๋‹ˆ๋‹ค. ![clusters](../images/clusters.png) ์ด ๋ฐ์ดํ„ฐ๋Š” ๋งค์šฐ ๋ถˆ์•ˆ์ •ํ•˜๋ฉฐ, ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ๋‚ฎ๊ณ  ์—ด ๊ฐ’ ์‚ฌ์ด์— ํŽธ์ฐจ๊ฐ€ ์ปค์„œ ์ž˜ ํด๋Ÿฌ์Šคํ„ฐ๋  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์‚ฌ์‹ค, ๋งŒ๋“ค์–ด์ง„ ํด๋Ÿฌ์Šคํ„ฐ๋Š” ์ •์˜ํ•œ 3๊ฐœ ์žฅ๋ฅด ์นดํ…Œ๊ณ ๋ฆฌ์— ํฌ๊ฒŒ ์˜ํ–ฅ๋ฐ›๊ฑฐ๋‚˜ ๋’คํ‹€๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต ํ”„๋กœ์„ธ์Šค์ž…๋‹ˆ๋‹ค! Scikit-learn ๋ฌธ์„œ์—, ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ๋งค์šฐ ๋ช…ํ™•ํ•˜์ง€ ์•Š์€ ๋ชจ๋ธ, 'variance' ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค: ![problem models](../images/problems.png) > Infographic from Scikit-learn ## Variance Variance๋Š” "the average of the squared differences from the Mean."์œผ๋กœ ์ •์˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค. [source](https://www.mathsisfun.com/data/standard-deviation.html) ์ด clustering ๋ฌธ์ œ์˜ ์ปจํ…์ŠคํŠธ์—์„œ, ๋ฐ์ดํ„ฐ์…‹ ์ˆซ์ž๊ฐ€ ํ‰๊ท ์—์„œ ๋„ˆ๋ฌด ํฌ๊ฒŒ ์ดํƒˆ๋˜์–ด ๋ฐ์ดํ„ฐ๋กœ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. โœ… ์ด ์ด์Šˆ๋ฅผ ํ•ด๊ฒฐํ•  ๋ชจ๋“  ๋ฐฉ์‹์„ ์ƒ๊ฐํ•ด๋ณด๋Š” ํ›Œ๋ฅญํ•œ ์ˆœ๊ฐ„์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ๊ธˆ ํŠธ์œ…ํ•ด๋ณผ๊นŒ์š”? ๋‹ค๋ฅธ ์—ด์„ ์‚ฌ์šฉํ•ด๋ณผ๊นŒ์š”? ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ด๋ณผ๊นŒ์š”? ํžŒํŠธ: [scaling your data](https://www.mygreatlearning.com/blog/learning-data-science-with-k-means-clustering/)๋กœ ๋…ธ๋ฉ€๋ผ์ด์ฆˆํ•˜๊ณ  ๋‹ค๋ฅธ ์ปฌ๋Ÿผ์„ ํ…Œ์ŠคํŠธํ—ค๋ด…๋‹ˆ๋‹ค. > '[variance calculator](https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php)'๋กœ ์ข€ ๋” ๊ฐœ๋…์„ ์ดํ•ดํ•ด๋ด…๋‹ˆ๋‹ค. --- ## ๐Ÿš€ ๋„์ „ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํŠธ์œ…ํ•˜๋ฉด์„œ, ๋…ธํŠธ๋ถ์œผ๋กœ ์‹œ๊ฐ„์„ ๋ณด๋ƒ…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ •๋ฆฌํ•ด์„œ (์˜ˆ์‹œ๋กœ, ์•„์›ƒ๋ผ์ด์–ด ์ œ๊ฑฐ) ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‚˜์š”? ๊ฐ€์ค‘์น˜๋กœ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ์—์„œ ๋” ๊ฐ€์ค‘์น˜๋ฅผ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ดœ์ฐฎ์€ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ—ค ์–ด๋–ค ๋‹ค๋ฅธ ์ผ์„ ํ•  ์ˆ˜ ์žˆ๋‚˜์š”? ํžŒํŠธ: ๋ฐ์ดํ„ฐ๋ฅผ ๋” ํ‚ค์›Œ๋ด…๋‹ˆ๋‹ค. ๊ฐ€๊นŒ์šด ๋ฒ”์œ„ ์กฐ๊ฑด์— ๋น„์Šทํ•œ ๋ฐ์ดํ„ฐ ์—ด์„ ๋งŒ๋“ค๊ณ ์ž ์ถ”๊ฐ€ํ•˜๋Š” ํ‘œ์ค€ ์Šค์ผ€์ผ๋ง ์ฝ”๋“œ๋ฅผ ๋…ธํŠธ๋ถ์— ์ฃผ์„์œผ๋กœ ๋‚จ๊ฒผ์Šต๋‹ˆ๋‹ค. silhouette ์ ์ˆ˜๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ๋™์•ˆ, elbow ๊ทธ๋ž˜ํ”„์˜ 'kink'๊ฐ€ ์ฃผ๋ฆ„ ํŽด์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ์ •ํ•˜์ง€ ์•Š๊ณ  ๋‚จ๊ธฐ๋ฉด ๋œ ๋ถ„์‚ฐ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋” ๋งŽ์€ ๊ฐ€์ค‘์น˜๋กœ ๋‚˜๋ฅผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ด์œ ์ž…๋‹ˆ๋‹ค. [here](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226) ์ด ๋ฌธ์ œ๋ฅผ ์กฐ๊ธˆ ๋” ์ฝ์–ด๋ด…๋‹ˆ๋‹ค. ## [๊ฐ•์˜ ํ›„ ํ€ด์ฆˆ](https://white-water-09ec41f0f.azurestaticapps.net/quiz/30/) ## ๊ฒ€ํ†  & ์ž๊ธฐ์ฃผ๋„ ํ•™์Šต [such as this one](https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/)๊ฐ™์€ K-Means ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ์ฐพ์•„๋ด…๋‹ˆ๋‹ค. ์ด ๋„๊ตฌ๋กœ ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ณ  ๋ฌด๊ฒŒ ์ค‘์‹ฌ์„ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์˜ ๋žœ๋ค์„ฑ, ํด๋Ÿฌ์Šคํ„ฐ ์ˆ˜์™€ ๋ฌด๊ฒŒ ์ค‘์‹ฌ ์ˆ˜๋ฅผ ๊ณ ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ๊ธฐ ์œ„ํ•œ ์•„์ด๋””์–ด๋ฅผ ์–ป๋Š” ๊ฒŒ ๋„์›€์ด ๋˜๋‚˜์š”? ๋˜ํ•œ, Stanford์˜ [this handout on k-means](https://stanford.edu/~cpiech/cs221/handouts/kmeans.html)์„ ์ฐพ์•„๋ด…๋‹ˆ๋‹ค. ## ๊ณผ์ œ [Try different clustering methods](../assignment.md)