{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "## **Spotify မှ စုဆောင်းထားသော နိုင်ဂျီးရီးယားဂီတ - ခွဲခြမ်းစိတ်ဖြာမှု**\n",
    "\n",
    "Clustering ဆိုသည်မှာ [Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning) အမျိုးအစားတစ်ခုဖြစ်ပြီး၊ ဒေတာစနစ်သည် အမှတ်အသားမပါရှိခြင်း သို့မဟုတ် ၎င်း၏ input များသည် ကြိုတင်သတ်မှတ်ထားသော output များနှင့် မကိုက်ညီခြင်းကို သဘောထားသည်။ ၎င်းသည် အမျိုးမျိုးသော algorithm များကို အသုံးပြု၍ အမှတ်အသားမပါရှိသော ဒေတာများကို စီစစ်ပြီး၊ ဒေတာအတွင်းရှိ ပုံစံများအရ အုပ်စုများကို ဖော်ထုတ်ပေးသည်။\n",
    "\n",
    "[**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/27/)\n",
    "\n",
    "### **နိဒါန်း**\n",
    "\n",
    "[Clustering](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-30164-8_124) သည် ဒေတာစူးစမ်းမှုအတွက် အလွန်အသုံးဝင်သည်။ နိုင်ဂျီးရီးယားပရိသတ်များ၏ ဂီတသုံးစွဲမှုပုံစံများနှင့် လမ်းကြောင်းများကို ရှာဖွေဖော်ထုတ်နိုင်မလားဆိုတာ ကြည့်ကြရအောင်။\n",
    "\n",
    "> ✅ Clustering ၏ အသုံးဝင်မှုများအကြောင်း စဉ်းစားရန် တစ်မိနစ်ယူပါ။ နေ့စဉ်ဘဝတွင် Clustering ဖြစ်ပေါ်နေသော အချိန်များကို သတိထားမိပါသလား။ ဥပမာ - အဝတ်လျှော်ပြီးနောက် မိသားစုဝတ်စုံများကို စီစစ်ထားရသောအခါ 🧦👕👖🩲။ ဒေတာသိပ္ပံတွင်တော့ Clustering သည် အသုံးပြုသူ၏ စိတ်ကြိုက်များကို ခွဲခြားစစ်ဆေးရန် သို့မဟုတ် အမှတ်အသားမပါသော ဒေတာ၏ လက္ခဏာများကို သတ်မှတ်ရန် အသုံးပြုသည်။ Clustering သည် တစ်နည်းအားဖြင့် ရှုပ်ထွေးမှုများကို အစီအစဉ်ဖြစ်အောင် ပြုလုပ်ပေးသည်။\n",
    "\n",
    "ပရော်ဖက်ရှင်နယ်အဆင့်တွင်တော့ Clustering ကို စျေးကွက်ခွဲခြားမှု (market segmentation) အတွက် အသုံးပြုနိုင်သည်။ ဥပမာအားဖြင့် ဘယ်အသက်အရွယ်အုပ်စုက ဘယ်ပစ္စည်းတွေကို ဝယ်ယူတယ်ဆိုတာ သတ်မှတ်နိုင်သည်။ ထို့အပြင် Clustering ကို အထူးသဖြင့် လိမ်လည်မှုများ (fraud detection) သို့မဟုတ် ဆေးဘက်ဆိုင်ရာ စစ်ဆေးမှုများတွင် ကင်ဆာကဲ့သို့သော ရောဂါများကို ရှာဖွေဖော်ထုတ်ရန် အသုံးပြုနိုင်သည်။\n",
    "\n",
    "✅ ဘဏ်လုပ်ငန်း၊ e-commerce သို့မဟုတ် စီးပွားရေးလုပ်ငန်းများတွင် Clustering ကို သင်ဘယ်လိုတွေ့ဖူးတယ်ဆိုတာ တစ်မိနစ်စဉ်းစားကြည့်ပါ။\n",
    "\n",
    "> 🎓 စိတ်ဝင်စားစရာကောင်းတာကတော့ Clustering ခွဲခြားစစ်ဆေးမှုသည် 1930 ခုနှစ်များတွင် လူမှုဗေဒ (Anthropology) နှင့် စိတ်ပညာ (Psychology) လောကများတွင် စတင်ပေါ်ပေါက်ခဲ့ခြင်းဖြစ်သည်။ အဲဒီအချိန်မှာ ဘယ်လိုအသုံးပြုခဲ့မလဲဆိုတာ စဉ်းစားကြည့်ပါ။\n",
    "\n",
    "ဒါ့အပြင် Clustering ကို ရှာဖွေမှုရလဒ်များကို အုပ်စုဖွဲ့ရန် အသုံးပြုနိုင်သည်။ ဥပမာအားဖြင့် စျေးဝယ်ဆိုင်လင့်များ၊ ပုံများ သို့မဟုတ် သုံးသပ်ချက်များကို အုပ်စုဖွဲ့ခြင်း။ Clustering သည် အကြီးမားသော ဒေတာစနစ်ကို လျှော့ချပြီး၊ နောက်ထပ် စစ်ဆေးမှုများအတွက် အသေးစိတ်ခွဲခြားရန် အသုံးဝင်သည်။\n",
    "\n",
    "✅ ဒေတာကို Clusters အဖြစ် စီစဉ်ပြီးနောက်၊ Cluster Id တစ်ခုကို သတ်မှတ်ပေးနိုင်သည်။ ဒေတာ၏ privacy ကို ထိန်းသိမ်းရန် Cluster Id ကို အသုံးပြုနိုင်သည်။ Cluster Id ကို အသုံးပြုခြင်း၏ အခြားသော အကျိုးကျေးဇူးများကို စဉ်းစားကြည့်ပါ။\n",
    "\n",
    "### Clustering စတင်ခြင်း\n",
    "\n",
    "> 🎓 Clusters များကို ဖန်တီးပုံသည် ဒေတာအချက်အလက်များကို အုပ်စုဖွဲ့ပုံနှင့် ဆက်စပ်နေသည်။ အချို့သော အသုံးအနှုန်းများကို ရှင်းလင်းကြည့်ရအောင်:\n",
    ">\n",
    "> 🎓 ['Transductive' နှင့် 'Inductive'](https://wikipedia.org/wiki/Transduction_(machine_learning))\n",
    ">\n",
    "> Transductive inference သည် သတ်မှတ်ထားသော training cases များမှ စတင်ပြီး၊ အထူးသတ်မှတ်ထားသော test cases များကို ချိတ်ဆက်သည်။ Inductive inference သည် training cases များမှ စတင်ပြီး၊ ယင်းမှ ရရှိသော အထွေထွေစည်းမျဉ်းများကို test cases များတွင်သာ အသုံးပြုသည်။\n",
    ">\n",
    "> ဥပမာ - သင်တွင် အချို့သော အမှတ်အသားပါရှိပြီး အချို့မှာ မပါသော ဒေတာစနစ်ရှိသည်။ Inductive approach ကို အသုံးပြုပါက 'records' နှင့် 'cds' ကို သင်ကြားပြီး၊ မသတ်မှတ်ထားသော ဒေတာများကို အဲဒီအတိုင်း သတ်မှတ်ပေးမည်။ Transductive approach သည် မသတ်မှတ်ထားသော ဒေတာများကို ပိုမိုထိရောက်စွာ စီစစ်ပေးနိုင်သည်။\n",
    ">\n",
    "> 🎓 ['Non-flat' နှင့် 'Flat' Geometry](https://datascience.stackexchange.com/questions/52260/terminology-flat-geometry-in-the-context-of-clustering)\n",
    ">\n",
    "> Flat geometry သည် [Euclidean](https://wikipedia.org/wiki/Euclidean_geometry) အပေါ် အခြေခံပြီး၊ Non-flat geometry သည် Non-Euclidean အပေါ် အခြေခံသည်။\n",
    ">\n",
    "> 🎓 ['Distances'](https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf)\n",
    ">\n",
    "> Clusters များကို Distance Matrix ဖြင့် သတ်မှတ်သည်။ Euclidean clusters သည် center point (centroid) အပေါ် အခြေခံပြီး၊ Non-Euclidean clusters သည် clustroid အပေါ် အခြေခံသည်။\n",
    ">\n",
    "> 🎓 ['Constrained'](https://wikipedia.org/wiki/Constrained_clustering)\n",
    ">\n",
    "> Constrained Clustering သည် semi-supervised learning ကို ထည့်သွင်းထားသည်။ ဒေတာအချက်အလက်များအကြား 'cannot link' သို့မဟုတ် 'must-link' စည်းမျဉ်းများကို သတ်မှတ်ပေးသည်။\n",
    ">\n",
    "> 🎓 'Density'\n",
    ">\n",
    "> 'Noisy' ဒေတာသည် 'dense' ဟု သတ်မှတ်သည်။ [ဒီဆောင်းပါး](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) သည် K-Means နှင့် HDBSCAN အကြား ကွာခြားချက်ကို ရှင်းပြထားသည်။\n",
    "\n",
    "Clustering နည်းလမ်းများကို နက်နက်ရှိုင်းရှိုင်း လေ့လာရန် [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-cluster-models?WT.mc_id=academic-77952-leestott) ကို ကြည့်ပါ။\n",
    "\n",
    "### **Clustering Algorithms**\n",
    "\n",
    "Clustering Algorithm များ ၁၀၀ ကျော်ရှိပြီး၊ ၎င်းတို့၏ အသုံးပြုမှုသည် ဒေတာ၏ လက္ခဏာအပေါ် မူတည်သည်။ အဓိက Algorithm များကို ဆွေးနွေးကြည့်ရအောင်:\n",
    "\n",
    "-   **Hierarchical Clustering**: အနီးအနားရှိ အရာဝတ္ထုများအပေါ် အခြေခံ၍ အုပ်စုဖွဲ့သည်။\n",
    "  \n",
    "-   **Centroid Clustering**: 'k' သတ်မှတ်ပြီး၊ Cluster center ကို သတ်မှတ်ကာ ဒေတာများကို စုစည်းသည်။ [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) သည် အလွန်လူကြိုက်များသည်။\n",
    "\n",
    "-   **Distribution-based Clustering**: ဒေတာအချက်အလက်တစ်ခုသည် Cluster တစ်ခုနှင့် ဆက်စပ်မှုကို သတ်မှတ်သည်။\n",
    "\n",
    "-   **Density-based Clustering**: ဒေတာများ၏ အစုအဝေးပေါ် မူတည်သည်။\n",
    "\n",
    "-   **Grid-based Clustering**: မျိုးစုံအတိုင်းအတာရှိသော ဒေတာများကို Grid ဖြင့် ခွဲခြားသည်။\n",
    "\n",
    "Clustering ကို လက်တွေ့ လေ့လာရန် အကောင်းဆုံးနည်းလမ်းမှာ ကိုယ်တိုင် လက်တွေ့လုပ်ဆောင်ခြင်းဖြစ်သည်။ \n",
    "\n",
    "ဤ module ကို ပြီးမြောက်ရန် လိုအပ်သော packages များကို `install.packages(c('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork'))` ဖြင့် ထည့်သွင်းနိုင်သည်။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
    "\r\n",
    "pacman::p_load('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork')\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## လေ့ကျင့်မှု - သင့်ဒေတာကို အစုအဖွဲ့ခွဲပါ\n",
    "\n",
    "အစုအဖွဲ့ခွဲခြင်းဆိုတာ အကောင်းဆုံးမြင်နိုင်အောင် ပြသပေးတဲ့နည်းလမ်းတစ်ခုဖြစ်ပြီး၊ သင့်ရဲ့ဂီတဒေတာကို မြင်နိုင်အောင် ပြသခြင်းဖြင့် စတင်ကြမယ်။ ဒီလေ့ကျင့်မှုက ဒီဒေတာရဲ့သဘာဝအရ အစုအဖွဲ့ခွဲခြင်းနည်းလမ်းတွေထဲက ဘယ်နည်းလမ်းကို အကျိုးရှိစွာ အသုံးပြုနိုင်မလဲဆိုတာ ဆုံးဖြတ်ဖို့ ကူညီပေးပါလိမ့်မယ်။\n",
    "\n",
    "ဒေတာကို တင်သွင်းပြီး စတင်ကြပါစို့!\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Load the core tidyverse and make it available in your current R session\r\n",
    "library(tidyverse)\r\n",
    "\r\n",
    "# Import the data into a tibble\r\n",
    "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv\")\r\n",
    "\r\n",
    "# View the first 5 rows of the data set\r\n",
    "df %>% \r\n",
    "  slice_head(n = 5)\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "တစ်ခါတစ်ရံမှာ၊ ကျွန်တော်တို့ရဲ့ဒေတာအကြောင်းနည်းနည်းပိုပြီးသိချင်တတ်ပါတယ်။ ဒေတာနဲ့ `၎င်းရဲ့ဖွဲ့စည်းမှု` ကို [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function ကိုသုံးပြီးကြည့်ရှုနိုင်ပါတယ်။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Glimpse into the data set\r\n",
    "df %>% \r\n",
    "  glimpse()\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "အလုပ်ကောင်းပါတယ်!💪\n",
    "\n",
    "`glimpse()` သည် အတန်း (အချက်အလက်များ) အရေအတွက်နှင့် ကော်လံ (အပြောင်းအလဲများ) အရေအတွက်ကို ပြသပေးပြီး၊ ထို့နောက် အပြောင်းအလဲတစ်ခုစီ၏ အမည်အောက်တွင် အတန်းတစ်ခုစီ၏ အစပိုင်းအချက်အလက်များကို ပြသပေးပါသည်။ ထို့အပြင်၊ အပြောင်းအလဲ၏ *ဒေတာအမျိုးအစား* ကို `< >` အတွင်းတွင် အပြောင်းအလဲအမည်၏ ချက်ချင်းအောက်တွင် ဖော်ပြထားပါသည်။\n",
    "\n",
    "`DataExplorer::introduce()` သည် ဤအချက်အလက်များကို သေချာစွာ အကျဉ်းချုပ်ပေးနိုင်ပါသည်။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Describe basic information for our data\r\n",
    "df %>% \r\n",
    "  introduce()\r\n",
    "\r\n",
    "# A visual display of the same\r\n",
    "df %>% \r\n",
    "  plot_intro()\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "အံ့သြဖွယ်ပါပဲ! ကျွန်တော်တို့ရဲ့ဒေတာမှာ မရှိတဲ့တန်ဖိုးတွေမရှိဘူးဆိုတာကို အခုမှသိလိုက်ရပါတယ်။\n",
    "\n",
    "ဒီအချိန်မှာပဲ၊ အလယ်အလတ်အခြေအနေဆိုင်ရာ စာရင်းအင်းများ (ဥပမာ [ပျမ်းမျှတန်ဖိုး](https://en.wikipedia.org/wiki/Arithmetic_mean) နဲ့ [အလယ်တန်းတန်ဖိုး](https://en.wikipedia.org/wiki/Median)) နဲ့ ပြန့်နှံ့မှုကိုတိုင်းတာတဲ့ အချက်အလက်များ (ဥပမာ [စံချိန်လွှာ](https://en.wikipedia.org/wiki/Standard_deviation)) ကို `summarytools::descr()` ကိုသုံးပြီး စမ်းသပ်ကြည့်နိုင်ပါတယ်။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Describe common statistics\r\n",
    "df %>% \r\n",
    "  descr(stats = \"common\")\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "ဒေတာရဲ့အထွေထွေတန်ဖိုးတွေကိုကြည့်လိုက်ရအောင်။ `0` ဖြစ်နိုင်တဲ့ပေါ်ပြူလာတန်ဖိုးကို သတိပြုပါ၊ ဒါဟာအဆင့်သတ်မှတ်မရှိတဲ့သီချင်းတွေကိုပြသပါတယ်။ အဲ့ဒီအချက်တွေကိုမကြာမီဖယ်ရှားပစ်မှာဖြစ်ပါတယ်။\n",
    "\n",
    "> 🤔 ကျွန်တော်တို့က အမှတ်အသားမလိုအပ်တဲ့ unsupervised နည်းလမ်းဖြစ်တဲ့ clustering နဲ့အလုပ်လုပ်နေတယ်ဆိုရင်၊ ဒီဒေတာကို အမှတ်အသားနဲ့ပြသတာဘာလို့လဲ? ဒေတာကိုလေ့လာတဲ့အဆင့်မှာ အဲ့ဒီအချက်တွေက အသုံးဝင်ပေမယ့် clustering algorithm တွေကိုအလုပ်လုပ်ဖို့အတွက်တော့ မလိုအပ်ပါဘူး။\n",
    "\n",
    "### 1. ပေါ်ပြူလာဂျန်းရတွေကိုလေ့လာပါ\n",
    "\n",
    "အဲ့ဒီဂျန်းရတွေ 🎶 ဘယ်လောက်များပေါ်ပြူလာလဲဆိုတာကို သိရအောင်၊ အကြိမ်ရေကိုရေတွက်ပြီးရှာဖွေလိုက်ရအောင်။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Popular genres\r\n",
    "top_genres <- df %>% \r\n",
    "  count(artist_top_genre, sort = TRUE) %>% \r\n",
    "# Encode to categorical and reorder the according to count\r\n",
    "  mutate(artist_top_genre = factor(artist_top_genre) %>% fct_inorder())\r\n",
    "\r\n",
    "# Print the top genres\r\n",
    "top_genres\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "ဒါက အဆင်ပြေပါတယ်! သူတို့ပြောကြတာက ပုံတစ်ပုံဟာ ဒေတာဖရိမ်တစ်ခုရဲ့ အတန်းထောင်ချီတဲ့တန်ဖိုးနဲ့ တန်းတူတယ်ဆိုပေမယ့် (အမှန်တကယ်တော့ ဘယ်သူမှ ဒီလိုမပြောကြပါဘူး 😅)၊ သင်နားလည်မယ်လို့ မျှော်လင့်ပါတယ်။\n",
    "\n",
    "Categorical ဒေတာ (character သို့မဟုတ် factor variable) ကို ရှင်းလင်းဖော်ပြဖို့ နည်းလမ်းတစ်ခုက barplot တွေကို အသုံးပြုတာဖြစ်ပါတယ်။ အခုတော့ ထိပ်တန်း 10 ခုရှိတဲ့ genres တွေကို barplot နဲ့ ဖော်ပြကြည့်ရအောင်:\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Change the default gray theme\r\n",
    "theme_set(theme_light())\r\n",
    "\r\n",
    "# Visualize popular genres\r\n",
    "top_genres %>%\r\n",
    "  slice(1:10) %>% \r\n",
    "  ggplot(mapping = aes(x = artist_top_genre, y = n,\r\n",
    "                       fill = artist_top_genre)) +\r\n",
    "  geom_col(alpha = 0.8) +\r\n",
    "  paletteer::scale_fill_paletteer_d(\"rcartocolor::Vivid\") +\r\n",
    "  ggtitle(\"Top genres\") +\r\n",
    "  theme(plot.title = element_text(hjust = 0.5),\r\n",
    "        # Rotates the X markers (so we can read them)\r\n",
    "    axis.text.x = element_text(angle = 90))\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "ယခုအခါမှာ `missing` အမျိုးအစားတွေကို အလွယ်တကူ သိရှိနိုင်ပါပြီ 🧐!\n",
    "\n",
    "> အကောင်းဆုံး ရုပ်ပုံဖော်ပြမှုတစ်ခုက သင်မမျှော်လင့်ထားတဲ့အရာတွေကို ပြသပေးနိုင်ရမယ်၊ ဒါမှမဟုတ် ဒေတာအပေါ်မှာ မေးခွန်းအသစ်တွေကို ထွက်ပေါ်လာစေဖို့ လိုအပ်ပါတယ် - Hadley Wickham နဲ့ Garrett Grolemund, [R For Data Science](https://r4ds.had.co.nz/introduction.html)\n",
    "\n",
    "သတိပြုပါ၊ အထက်ဆုံး အမျိုးအစားကို `Missing` လို့ ဖော်ပြထားတဲ့အခါမှာ၊ အဲဒါက Spotify က အမျိုးအစားသတ်မှတ်မထားတာဖြစ်ပါတယ်၊ ဒါကြောင့် အဲဒါကို ဖယ်ရှားလိုက်ရအောင်။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Visualize popular genres\r\n",
    "top_genres %>%\r\n",
    "  filter(artist_top_genre != \"Missing\") %>% \r\n",
    "  slice(1:10) %>% \r\n",
    "  ggplot(mapping = aes(x = artist_top_genre, y = n,\r\n",
    "                       fill = artist_top_genre)) +\r\n",
    "  geom_col(alpha = 0.8) +\r\n",
    "  paletteer::scale_fill_paletteer_d(\"rcartocolor::Vivid\") +\r\n",
    "  ggtitle(\"Top genres\") +\r\n",
    "  theme(plot.title = element_text(hjust = 0.5),\r\n",
    "        # Rotates the X markers (so we can read them)\r\n",
    "    axis.text.x = element_text(angle = 90))\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "ဒေတာကိုအနည်းငယ်လေ့လာကြည့်ပြီးနောက်၊ ထိပ်ဆုံးအမျိုးအစားသုံးမျိုးသည် ဒီဒေတာစုစည်းမှုကို အဓိကကျစေသည်ကို သိရှိရပါသည်။ `afro dancehall`, `afropop`, နှင့် `nigerian pop` ကိုအဓိကထားပြီး၊ popularity အတန်းမှာ 0 ဖြစ်သောအရာများကို ဖယ်ရှားရန်အတွက် ဒေတာကို စစ်ထုတ်ပါ (ဒါဟာ ဒီဒေတာတွင် popularity အတန်းဖြင့် သတ်မှတ်မထားသောအရာများဖြစ်ပြီး ကျွန်ုပ်တို့ရည်ရွယ်ချက်အတွက် ဆူညံမှုအဖြစ် သတ်မှတ်နိုင်ပါသည်)။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "nigerian_songs <- df %>% \r\n",
    "  # Concentrate on top 3 genres\r\n",
    "  filter(artist_top_genre %in% c(\"afro dancehall\", \"afropop\",\"nigerian pop\")) %>% \r\n",
    "  # Remove unclassified observations\r\n",
    "  filter(popularity != 0)\r\n",
    "\r\n",
    "\r\n",
    "\r\n",
    "# Visualize popular genres\r\n",
    "nigerian_songs %>%\r\n",
    "  count(artist_top_genre) %>%\r\n",
    "  ggplot(mapping = aes(x = artist_top_genre, y = n,\r\n",
    "                       fill = artist_top_genre)) +\r\n",
    "  geom_col(alpha = 0.8) +\r\n",
    "  paletteer::scale_fill_paletteer_d(\"ggsci::category10_d3\") +\r\n",
    "  ggtitle(\"Top genres\") +\r\n",
    "  theme(plot.title = element_text(hjust = 0.5))\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "အချက်အလက်များအတွဲအတွင်းရှိ ကိန်းဂဏန်းအမျိုးအစားများအကြား ရှိနိုင်သော တိုင်းတာနိုင်သော ရှုထောင့်တစ်ခုကို ကြည့်ရှုကြမည်။ ဒီဆက်နွယ်မှုကို သင်္ချာအရ [correlation statistic](https://en.wikipedia.org/wiki/Correlation) ဖြင့် တိုင်းတာနိုင်သည်။\n",
    "\n",
    "Correlation statistic သည် -1 မှ 1 အတွင်းရှိ တန်ဖိုးတစ်ခုဖြစ်ပြီး ဆက်နွယ်မှု၏ အားအင်ကို ဖော်ပြသည်။ 0 အထက်ရှိ တန်ဖိုးများသည် *အပေါင်းသက်သက်ဆက်နွယ်မှု* (variable တစ်ခု၏ တန်ဖိုးများ မြင့်တက်လာသည်နှင့်အမျှ အခြား variable ၏ တန်ဖိုးများလည်း မြင့်တက်လာသည်) ကို ဖော်ပြပြီး၊ 0 အောက်ရှိ တန်ဖိုးများသည် *အနုတ်သက်သက်ဆက်နွယ်မှု* (variable တစ်ခု၏ တန်ဖိုးများ မြင့်တက်လာသည်နှင့်အမျှ အခြား variable ၏ တန်ဖိုးများ ကျဆင်းလာသည်) ကို ဖော်ပြသည်။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Narrow down to numeric variables and fid correlation\r\n",
    "corr_mat <- nigerian_songs %>% \r\n",
    "  select(where(is.numeric)) %>% \r\n",
    "  cor()\r\n",
    "\r\n",
    "# Visualize correlation matrix\r\n",
    "corrplot(corr_mat, order = 'AOE', col = c('white', 'black'), bg = 'gold2')  \r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "အချက်အလက်များသည် `energy` နှင့် `loudness` အကြားတွင်သာ အားလုံးထဲမှာ အနည်းငယ်ဆက်စပ်မှုရှိသည်။ ဒါဟာ သက်ဆိုင်ပါတယ်၊ အကြမ်းအကျဉ်းပြောရရင် အသံကြီးတဲ့ဂီတတွေဟာ အင်အားကြီးတတ်ပါတယ်။ `Popularity` ဟာ `release date` နဲ့ ဆက်စပ်မှုရှိတယ်လို့ တွေ့ရပါတယ်၊ ဒါဟာလည်း သက်ဆိုင်ပါတယ်၊ အခုတလောထွက်တဲ့သီချင်းတွေဟာ ပိုမိုလူကြိုက်များတတ်ပါတယ်။ သီချင်းအရှည်နှင့် အင်အားကြီးမှုတို့လည်း ဆက်စပ်မှုရှိတယ်လို့ တွေ့ရပါတယ်။\n",
    "\n",
    "ဒီအချက်အလက်တွေကို clustering algorithm တစ်ခုက ဘယ်လိုအဓိပ္ပာယ်ပေးမလဲဆိုတာ စိတ်ဝင်စားဖို့ကောင်းပါတယ်!\n",
    "\n",
    "> 🎓 Correlation ဟာ causation ကို မဆိုလိုပါဘူးဆိုတာ သတိပြုပါ! ကျွန်တော်တို့မှာ correlation ရှိတယ်ဆိုတဲ့ သက်သေရှိပေမယ့် causation ရှိတယ်ဆိုတဲ့ သက်သေမရှိပါဘူး။ [အလွဲလွဲအချို့ correlation တွေ](https://tylervigen.com/spurious-correlations) ကို ပြသထားတဲ့ ဝက်ဘ်ဆိုဒ်တစ်ခုက ဒီအချက်ကို အလေးပေးပြထားပါတယ်။\n",
    "\n",
    "### 2. အချက်အလက်ဖြန့်ဝေမှုကို စူးစမ်းကြည့်ရန်\n",
    "\n",
    "အနည်းငယ်ပိုမိုနက်ရှိုင်းတဲ့မေးခွန်းတွေကို မေးကြည့်ရအောင်။ အမျိုးအစားတွေဟာ သူတို့ရဲ့ danceability perception တွေမှာ လူကြိုက်များမှုအပေါ် အရေးပါလား? ကျွန်တော်တို့ရဲ့ ထိပ်ဆုံးအမျိုးအစားသုံးမျိုးရဲ့ လူကြိုက်များမှုနှင့် danceability အချက်အလက်ဖြန့်ဝေမှုကို x axis နဲ့ y axis တစ်ခုတည်းအပေါ်မှာ [density plots](https://www.khanacademy.org/math/ap-statistics/density-curves-normal-distribution-ap/density-curves/v/density-curves) အသုံးပြုပြီး စူးစမ်းကြည့်ရအောင်။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Perform 2D kernel density estimation\r\n",
    "density_estimate_2d <- nigerian_songs %>% \r\n",
    "  ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre)) +\r\n",
    "  geom_density_2d(bins = 5, size = 1) +\r\n",
    "  paletteer::scale_color_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n",
    "  xlim(-20, 80) +\r\n",
    "  ylim(0, 1.2)\r\n",
    "\r\n",
    "# Density plot based on the popularity\r\n",
    "density_estimate_pop <- nigerian_songs %>% \r\n",
    "  ggplot(mapping = aes(x = popularity, fill = artist_top_genre, color = artist_top_genre)) +\r\n",
    "  geom_density(size = 1, alpha = 0.5) +\r\n",
    "  paletteer::scale_fill_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n",
    "  paletteer::scale_color_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n",
    "  theme(legend.position = \"none\")\r\n",
    "\r\n",
    "# Density plot based on the danceability\r\n",
    "density_estimate_dance <- nigerian_songs %>% \r\n",
    "  ggplot(mapping = aes(x = danceability, fill = artist_top_genre, color = artist_top_genre)) +\r\n",
    "  geom_density(size = 1, alpha = 0.5) +\r\n",
    "  paletteer::scale_fill_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n",
    "  paletteer::scale_color_paletteer_d(\"RSkittleBrewer::wildberry\")\r\n",
    "\r\n",
    "\r\n",
    "# Patch everything together\r\n",
    "library(patchwork)\r\n",
    "density_estimate_2d / (density_estimate_pop + density_estimate_dance)\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "ကျွန်တော်တို့မြင်ရတာက အမျိုးအစားမရွေးဘဲ အလယ်ဗဟိုရိပ်ခွဲများတန်းစီနေတယ်ဆိုတာပါ။ ဒီအမျိုးအစားအတွက် နိုင်ဂျီးရီးယားလူကြိုက်များမှုဟာ တစ်ခုတည်းသော အကအလှပြနိုင်မှုအဆင့်မှာ တွေ့ဆုံနိုင်မလား?\n",
    "\n",
    "အထွေထွေအားဖြင့် အမျိုးအစားသုံးမျိုးဟာ လူကြိုက်များမှုနဲ့ အကအလှပြနိုင်မှုအရ တန်းစီနေပါတယ်။ ဒီအနည်းငယ်တန်းစီထားတဲ့ ဒေတာထဲမှာ အစုအဖွဲ့တွေကို သတ်မှတ်ဖို့ အခက်အခဲဖြစ်နိုင်ပါတယ်။ Scatter plot တစ်ခုက ဒီအချက်ကို ထောက်ခံနိုင်မလားဆိုတာ ကြည့်ကြမယ်။\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# A scatter plot of popularity and danceability\r\n",
    "scatter_plot <- nigerian_songs %>% \r\n",
    "  ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre, shape = artist_top_genre)) +\r\n",
    "  geom_point(size = 2, alpha = 0.8) +\r\n",
    "  paletteer::scale_color_paletteer_d(\"futurevisions::mars\")\r\n",
    "\r\n",
    "# Add a touch of interactivity\r\n",
    "ggplotly(scatter_plot)\r\n"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "အတူတူသော အာရုံစူးစိုက်မှုများကို scatterplot တွင် ပြသထားသော အချို့သော အချက်အလက်များသည် တူညီသော ပုံစံကို ပြသသည်။\n",
    "\n",
    "ယေဘူယျအားဖြင့်၊ clustering အတွက် scatterplot များကို အသုံးပြု၍ အချက်အလက်များ၏ အစုအဖွဲ့များကို ပြသနိုင်သည်။ ထို့ကြောင့် visualization အမျိုးအစားကို ကျွမ်းကျင်စွာ အသုံးပြုနိုင်ခြင်းသည် အလွန်အသုံးဝင်သည်။ နောက်တန်းခန်းတွင်၊ ကျွန်ုပ်တို့သည် ဒီအချက်အလက်များကို စစ်ထုတ်ပြီး k-means clustering ကို အသုံးပြုကာ အချက်အလက်များတွင် စိတ်ဝင်စားဖွယ် overlap ဖြစ်နေသော အစုအဖွဲ့များကို ရှာဖွေမည်ဖြစ်သည်။\n",
    "\n",
    "## **🚀 စိန်ခေါ်မှု**\n",
    "\n",
    "နောက်တန်းခန်းအတွက် ပြင်ဆင်ရန်၊ သင့်အား production environment တွင် ရှာဖွေပြီး အသုံးပြုနိုင်မည့် clustering algorithm များအကြောင်းကို ရှင်းလင်းပြထားသော chart တစ်ခု ပြုလုပ်ပါ။ Clustering သည် ဘယ်လိုပြဿနာများကို ဖြေရှင်းရန် ကြိုးစားနေသလဲ?\n",
    "\n",
    "## [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/28/)\n",
    "\n",
    "## **ပြန်လည်သုံးသပ်ခြင်းနှင့် ကိုယ်တိုင်လေ့လာခြင်း**\n",
    "\n",
    "Clustering algorithm များကို အသုံးပြုမည်မတိုင်မီ၊ သင့် dataset ၏ သဘာဝကို နားလည်ထားခြင်းသည် အကောင်းဆုံးဖြစ်သည်ဟု ကျွန်ုပ်တို့ သင်ယူခဲ့သည်။ ဒီအကြောင်းအရာကို [ဒီမှာ](https://www.kdnuggets.com/2019/10/right-clustering-algorithm.html) ပိုမိုဖတ်ရှုပါ။\n",
    "\n",
    "Clustering နည်းလမ်းများကို ပိုမိုနက်နက်ရှိုင်းရှိုင်း နားလည်ပါ:\n",
    "\n",
    "-   [Tidymodels နှင့် သူငယ်ချင်းများကို အသုံးပြု၍ Clustering Models များကို လေ့ကျင့်ပြီး အကဲဖြတ်ခြင်း](https://rpubs.com/eR_ic/clustering)\n",
    "\n",
    "-   Bradley Boehmke & Brandon Greenwell, [*Hands-On Machine Learning with R*](https://bradleyboehmke.github.io/HOML/)*.*\n",
    "\n",
    "## **အလုပ်ပေးစာ**\n",
    "\n",
    "[Clustering အတွက် အခြားသော visualization များကို ရှာဖွေပါ](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/assignment.md)\n",
    "\n",
    "## ကျေးဇူးတင်စကား:\n",
    "\n",
    "[Jen Looper](https://www.twitter.com/jenlooper) သည် ဒီ module ၏ Python version ကို စတင်ဖန်တီးပေးခဲ့သည် ♥️\n",
    "\n",
    "[`Dasani Madipalli`](https://twitter.com/dasani_decoded) သည် machine learning concept များကို ပိုမိုနားလည်ရလွယ်ကူစေရန် အံ့ဩဖွယ် illustration များ ဖန်တီးပေးခဲ့သည်။\n",
    "\n",
    "ပညာရပ်များကို ပျော်ရွှင်စွာ လေ့လာပါ,\n",
    "\n",
    "[Eric](https://twitter.com/ericntay), Gold Microsoft Learn Student Ambassador.\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**ဝက်ဘ်ဆိုက်မှတ်ချက်**:  \nဤစာရွက်စာတမ်းကို AI ဘာသာပြန်ဝန်ဆောင်မှု [Co-op Translator](https://github.com/Azure/co-op-translator) ကို အသုံးပြု၍ ဘာသာပြန်ထားပါသည်။ ကျွန်ုပ်တို့သည် တိကျမှန်ကန်မှုအတွက် ကြိုးစားနေသော်လည်း၊ အလိုအလျောက်ဘာသာပြန်ခြင်းတွင် အမှားများ သို့မဟုတ် မမှန်ကန်မှုများ ပါဝင်နိုင်ကြောင်း သတိပြုပါ။ မူလဘာသာစကားဖြင့် ရေးသားထားသော စာရွက်စာတမ်းကို အာဏာတည်သော ရင်းမြစ်အဖြစ် သတ်မှတ်သင့်ပါသည်။ အရေးကြီးသော အချက်အလက်များအတွက် လူ့ဘာသာပြန်ပညာရှင်များကို အသုံးပြုရန် အကြံပြုပါသည်။ ဤဘာသာပြန်ကို အသုံးပြုခြင်းမှ ဖြစ်ပေါ်လာသော နားလည်မှုမှားများ သို့မဟုတ် အဓိပ္ပာယ်မှားများအတွက် ကျွန်ုပ်တို့သည် တာဝန်မယူပါ။\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": "",
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.4.1"
  },
  "coopTranslator": {
   "original_hash": "99c36449cad3708a435f6798cfa39972",
   "translation_date": "2025-09-06T12:10:28+00:00",
   "source_file": "5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb",
   "language_code": "my"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}