ML-For-Beginners/translations/my/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb

{
 "nbformat": 4,
 "nbformat_minor": 2,
 "metadata": {
  "colab": {
   "name": "lesson_10-R.ipynb",
   "provenance": [],
   "collapsed_sections": []
  },
  "kernelspec": {
   "name": "ir",
   "display_name": "R"
  },
  "language_info": {
   "name": "R"
  },
  "coopTranslator": {
   "original_hash": "2621e24705e8100893c9bf84e0fc8aef",
   "translation_date": "2025-09-06T12:37:36+00:00",
   "source_file": "4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb",
   "language_code": "my"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "ItETB4tSFprR"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## အမျိုးအစားသတ်မှတ်ခြင်းကိုနားလည်ခြင်း - ဒေတာကိုသန့်စင်၊ ပြင်ဆင်၊ ရှုထောင့်မှကြည့်ရှုခြင်း\n",
    "\n",
    "ဒီသင်ခန်းစာလေး ၄ ခုမှာ သင်သည် ရိုးရာစက်မှုသင်ယူမှု၏ အခြေခံအချက်တစ်ခုဖြစ်သော *အမျိုးအစားသတ်မှတ်ခြင်း* ကိုလေ့လာပါမည်။ အာရှနှင့်အိန္ဒိယ၏ အံ့ဩဖွယ်အစားအစာများနှင့်ပတ်သက်သော ဒေတာစဉ်ကို အသုံးပြု၍ အမျိုးအစားသတ်မှတ်ခြင်းအယ်လဂိုရီသမ်များကို သင်ကြားပေးပါမည်။ အစားအသောက်အတွက် အဆာပြေဖို့ ပြင်ဆင်ထားပါ!\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/pinch.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>ဒီသင်ခန်းစာများတွင် အာရှအစားအစာများကို ကျေးဇူးတင်ပါ။ ဓာတ်ပုံ - Jen Looper</figcaption>\n",
    "\n",
    "<!--![Celebrate pan-Asian cuisines in these lessons! Image by Jen Looper](../../../../../../4-Classification/1-Introduction/solution/R/images/pinch.png)-->\n",
    "\n",
    "အမျိုးအစားသတ်မှတ်ခြင်းသည် [supervised learning](https://wikipedia.org/wiki/Supervised_learning) ၏ အမျိုးအစားတစ်ခုဖြစ်ပြီး regression နည်းလမ်းများနှင့် ဆင်တူသော အချက်များစွာပါရှိသည်။ အမျိုးအစားသတ်မှတ်ခြင်းတွင် သင်သည် `category` တစ်ခုကို အရာဝတ္ထုတစ်ခုက ဘယ်အမျိုးအစားတွင် ပါဝင်မည်ကို ခန့်မှန်းရန် မော်ဒယ်ကို လေ့ကျင့်သည်။ စက်မှုသင်ယူမှုသည် ဒေတာစဉ်များကို အသုံးပြု၍ တန်ဖိုးများ သို့မဟုတ် အမည်များကို ခန့်မှန်းခြင်းနှင့် ပတ်သက်သည်ဆိုပါက အမျိုးအစားသတ်မှတ်ခြင်းသည် *binary classification* နှင့် *multiclass classification* ဆိုသော အုပ်စု ၂ ခုအတွင်းတွင် ကျရောက်သည်။\n",
    "\n",
    "သတိပြုပါ-\n",
    "\n",
    "-   **Linear regression** သည် variable များအကြား ဆက်နွယ်မှုများကို ခန့်မှန်းရန်နှင့် ဒေတာအချက်အလက်အသစ်တစ်ခုသည် အဆိုပါလိုင်းနှင့် ဆက်နွယ်မှုအတွင်း ဘယ်နေရာတွင် ကျရောက်မည်ကို မှန်ကန်စွာခန့်မှန်းရန် ကူညီပေးသည်။ ဥပမာအားဖြင့် *သွားရည်တစ်ခု၏ စျေးနှုန်းသည် စက်တင်ဘာနှင့် ဒီဇင်ဘာတွင် ဘယ်လိုဖြစ်မည်* ဆိုသည်ကို ခန့်မှန်းနိုင်သည်။\n",
    "\n",
    "-   **Logistic regression** သည် \"binary categories\" ကို ရှာဖွေရာတွင် ကူညီပေးသည်။ ဥပမာအားဖြင့် *ဤစျေးနှုန်းတွင် သွားရည်သည် လိမ္မော်ရောင်ဖြစ်မည် သို့မဟုတ် မဖြစ်မည်*?\n",
    "\n",
    "အမျိုးအစားသတ်မှတ်ခြင်းသည် ဒေတာအချက်အလက်၏ label သို့မဟုတ် class ကို သတ်မှတ်ရန် အခြားနည်းလမ်းများကို သတ်မှတ်ရန် အယ်လဂိုရီသမ်များကို အသုံးပြုသည်။ ဒီအစားအစာဒေတာကို အသုံးပြု၍ အဖွဲ့တစ်ခု၏ အစိတ်အပိုင်းများကို ကြည့်ရှုခြင်းဖြင့် အစားအစာ၏ မူရင်းကို သတ်မှတ်နိုင်မည်ဖြစ်သည်။\n",
    "\n",
    "### [**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/19/)\n",
    "\n",
    "### **နိဒါန်း**\n",
    "\n",
    "အမျိုးအစားသတ်မှတ်ခြင်းသည် စက်မှုသင်ယူမှုသုတေသနရှင်နှင့် ဒေတာသိပ္ပံပညာရှင်၏ အခြေခံလုပ်ငန်းစဉ်များထဲမှ တစ်ခုဖြစ်သည်။ binary value (\"ဤအီးမေးလ်သည် spam ဖြစ်ပါသလား မဖြစ်ပါသလား\") ကို ရိုးရှင်းစွာ သတ်မှတ်ခြင်းမှစ၍ computer vision ကို အသုံးပြု၍ ရုပ်ပုံအမျိုးအစားသတ်မှတ်ခြင်းနှင့် segmentation အထိ၊ ဒေတာကို အမျိုးအစားများအလိုက် သတ်မှတ်ရန်နှင့် မေးခွန်းများမေးရန် အမြဲအသုံးဝင်သည်။\n",
    "\n",
    "သိပ္ပံပညာရပ်ဆန်သော နည်းလမ်းဖြင့် ပြောရမည်ဆိုပါက သင်၏ အမျိုးအစားသတ်မှတ်ခြင်းနည်းလမ်းသည် input variable များနှင့် output variable များအကြား ဆက်နွယ်မှုကို map လုပ်ရန် ခန့်မှန်းမော်ဒယ်တစ်ခုကို ဖန်တီးပေးသည်။\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/binary-multiclass.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>အမျိုးအစားသတ်မှတ်ခြင်းအယ်လဂိုရီသမ်များကို ကိုင်တွယ်ရန် binary နှင့် multiclass ပြဿနာများ။ Infographic - Jen Looper</figcaption>\n",
    "\n",
    "ဒေတာကို သန့်စင်ခြင်း၊ ရှုထောင့်မှကြည့်ရှုခြင်းနှင့် ML လုပ်ငန်းစဉ်များအတွက် ပြင်ဆင်ခြင်းလုပ်ငန်းစဉ်ကို စတင်မလုပ်မီ၊ ဒေတာကို အမျိုးအစားသတ်မှတ်ရန် စက်မှုသင်ယူမှုကို အသုံးပြုနိုင်သော နည်းလမ်းများအကြောင်းကို နည်းနည်းလေ့လာကြည့်ပါ။\n",
    "\n",
    "[statistics](https://wikipedia.org/wiki/Statistical_classification) မှ ဆင်းသက်လာသော classic machine learning ကို အသုံးပြု၍ classification သည် `smoker`, `weight`, `age` ကဲ့သို့သော features များကို အသုံးပြု၍ *X ရောဂါဖြစ်ပွားနိုင်မှု* ကို သတ်မှတ်သည်။ သင်၏ဒေတာသည် label လုပ်ထားပြီး ML အယ်လဂိုရီသမ်များသည် အဆိုပါ label များကို အသုံးပြု၍ ဒေတာစဉ်၏ အမျိုးအစားများ (သို့မဟုတ် 'features') ကို ခန့်မှန်းခြင်းနှင့် အုပ်စု သို့မဟုတ် ရလဒ်တစ်ခုသို့ assign လုပ်ပေးသည်။\n",
    "\n",
    "✅ အစားအစာများနှင့်ပတ်သက်သော ဒေတာစဉ်ကို စဉ်းစားရန် အချိန်ယူပါ။ multiclass မော်ဒယ်တစ်ခုက ဘာကို ဖြေရှင်းနိုင်မလဲ? binary မော်ဒယ်တစ်ခုက ဘာကို ဖြေရှင်းနိုင်မလဲ? fenugreek ကို အသုံးပြုမည်ဖြစ်သော အစားအစာကို သတ်မှတ်လိုပါက ဘာဖြစ်မည်? star anise, artichokes, cauliflower, horseradish တို့ပါဝင်သော အစားအစာအိတ်တစ်ခုကို သင်ရရှိပါက အိန္ဒိယအစားအစာတစ်ခုကို ဖန်တီးနိုင်မည်ဖြစ်ပါသလား?\n",
    "\n",
    "### **Hello 'classifier'**\n",
    "\n",
    "ဤအစားအစာဒေတာစဉ်အပေါ် မေးလိုသောမေးခွန်းသည် **multiclass question** တစ်ခုဖြစ်သည်၊ အမျိုးအစားများစွာနှင့်အလုပ်လုပ်ရန် အမျိုးအစားများစွာရှိသည်။ အစိတ်အပိုင်းများအစုတစ်ခုကို ကြည့်ရှု၍ အဆိုပါဒေတာသည် အမျိုးအစားများထဲမှ ဘယ်အမျိုးအစားတွင် ပါဝင်မည်ကို သတ်မှတ်နိုင်မည်။\n",
    "\n",
    "Tidymodels သည် အမျိုးအစားသတ်မှတ်ရန် သင်လိုချင်သော ပြဿနာအမျိုးအစားပေါ်မူတည်၍ အယ်လဂိုရီသမ်များစွာကို ပေးသည်။ နောက်ထပ်သင်ခန်းစာ ၂ ခုတွင် သင်သည် အယ်လဂိုရီသမ်များအကြောင်းကို လေ့လာပါမည်။\n",
    "\n",
    "#### **လိုအပ်ချက်**\n",
    "\n",
    "ဒီသင်ခန်းစာအတွက် ဒေတာကို သန့်စင်ခြင်း၊ ပြင်ဆင်ခြင်းနှင့် ရှုထောင့်မှကြည့်ရှုရန် အောက်ပါ packages များလိုအပ်ပါမည်-\n",
    "\n",
    "-   `tidyverse`: [tidyverse](https://www.tidyverse.org/) သည် [R packages](https://www.tidyverse.org/packages) များစုစည်းမှုဖြစ်ပြီး ဒေတာသိပ္ပံကို ပိုမိုလျင်မြန်စေပြီး ပိုမိုလွယ်ကူစေသည်။\n",
    "\n",
    "-   `tidymodels`: [tidymodels](https://www.tidymodels.org/) framework သည် [packages](https://www.tidymodels.org/packages/) များစုစည်းမှုဖြစ်ပြီး မော်ဒယ်ဖန်တီးခြင်းနှင့် စက်မှုသင်ယူမှုအတွက် အသုံးပြုသည်။\n",
    "\n",
    "-   `DataExplorer`: [DataExplorer package](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html) သည် EDA လုပ်ငန်းစဉ်နှင့် အစီရင်ခံစာဖန်တီးမှုကို လွယ်ကူစေပြီး အလိုအလျောက်လုပ်ဆောင်သည်။\n",
    "\n",
    "-   `themis`: [themis package](https://themis.tidymodels.org/) သည် Unbalanced Data ကို ကိုင်တွယ်ရန် Extra Recipes Steps များပေးသည်။\n",
    "\n",
    "သင်သည် အောက်ပါအတိုင်း install လုပ်နိုင်သည်-\n",
    "\n",
    "`install.packages(c(\"tidyverse\", \"tidymodels\", \"DataExplorer\", \"here\"))`\n",
    "\n",
    "အခြားနည်းလမ်းအနေနှင့် အောက်ပါ script သည် module ကို ပြီးစီးရန်လိုအပ်သော packages များရှိမရှိ စစ်ဆေးပြီး မရှိပါက install လုပ်ပေးပါမည်။\n"
   ],
   "metadata": {
    "id": "ri5bQxZ-Fz_0"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n",
    "\r\n",
    "pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
   ],
   "outputs": [],
   "metadata": {
    "id": "KIPxa4elGAPI"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ကျွန်တော်တို့ ဒီအံ့သြဖွယ်ပက်ကေ့ဂျ်တွေကို နောက်ပိုင်းမှာ load လုပ်ပြီး လက်ရှိ R session မှာ အသုံးပြုနိုင်အောင် ပြင်ဆင်ပေးပါမယ်။ (ဒါက ဥပမာပြရန်သာဖြစ်ပြီး၊ `pacman::p_load()` က အဲဒီအလုပ်ကို ရှင်းပြီးသားဖြစ်ပါတယ်)\n"
   ],
   "metadata": {
    "id": "YkKAxOJvGD4C"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## လေ့ကျင့်မှု - သင့်ဒေတာကို သန့်ရှင်းပြီး ညီမျှအောင် ပြုလုပ်ပါ\n",
    "\n",
    "ဒီပရောဂျက်ကို စတင်မတိုင်မီ ပထမဆုံးလုပ်ဆောင်ရမည့်အလုပ်က သင့်ဒေတာကို **သန့်ရှင်း**ပြီး **ညီမျှ**အောင် ပြုလုပ်ခြင်းဖြစ်ပါတယ်။ ဒါက ပိုမိုကောင်းမွန်တဲ့ရလဒ်တွေ ရရှိစေမှာပါ။\n",
    "\n",
    "အရင်ဆုံး ဒေတာနဲ့ မိတ်ဆက်ကြစို့! 🕵️\n"
   ],
   "metadata": {
    "id": "PFkQDlk0GN5O"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Import data\r\n",
    "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n",
    "\r\n",
    "# View the first 5 rows\r\n",
    "df %>% \r\n",
    "  slice_head(n = 5)\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "Qccw7okxGT0S"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "XrWnlgSrGVmR"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Basic information about the data\r\n",
    "df %>%\r\n",
    "  introduce()\r\n",
    "\r\n",
    "# Visualize basic information above\r\n",
    "df %>% \r\n",
    "  plot_intro(ggtheme = theme_light())"
   ],
   "outputs": [],
   "metadata": {
    "id": "4UcGmxRxGieA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "အထွေထွေ အချက်အလက်များအရ၊ ကျွန်ုပ်တို့တွင် `2448` အတန်းနှင့် `385` ကော်လံများရှိပြီး၊ `0` မရှိသောတန်ဖိုးများဖြစ်ကြောင်း မြင်နိုင်ပါသည်။ ထို့အပြင်၊ *cuisine* ဟုခေါ်သော ၁ ခုသော discrete ကော်လံလည်း ပါဝင်ပါသည်။\n",
    "\n",
    "## လေ့ကျင့်မှု - အစားအစာအမျိုးအစားများကို လေ့လာခြင်း\n",
    "\n",
    "ယခုအချိန်တွင် အလုပ်များ ပိုမိုစိတ်ဝင်စားဖွယ် ဖြစ်လာပါသည်။ အစားအစာအမျိုးအစားအလိုက် ဒေတာဖြန့်ဖြူးမှုကို ရှာဖွေကြည့်ရအောင်။\n"
   ],
   "metadata": {
    "id": "AaPubl__GmH5"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Count observations per cuisine\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(n)\r\n",
    "\r\n",
    "# Plot the distribution\r\n",
    "theme_set(theme_light())\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n",
    "  geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
    "  ylab(\"cuisine\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "FRsBVy5eGrrv"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "အစားအစာအမျိုးအစားများသည် အရေအတွက်ကန့်သတ်ထားပြီး၊ ဒေတာဖြန့်ဝေမှုမှာ မညီမျှပါ။ ဒါကို သင်ပြင်ဆင်နိုင်ပါတယ်! ပြင်ဆင်မလုပ်ခင်မှာ အရင်ဆုံး နည်းနည်းလေ့လာကြည့်ပါ။\n",
    "\n",
    "အခုတော့ အစားအစာအမျိုးအစားတစ်ခုချင်းစီကို သူ့ရဲ့ tibble ထဲမှာ သတ်မှတ်ပြီး၊ အစားအစာအမျိုးအစားတစ်ခုချင်းစီအတွက် ရရှိနိုင်တဲ့ ဒေတာအရေအတွက် (အတန်း၊ ကော်လံ) ကို ရှာဖွေကြည့်ပါ။\n",
    "\n",
    "> [tibble](https://tibble.tidyverse.org/) ဆိုတာ ခေတ်မီသော data frame တစ်ခုဖြစ်ပါတယ်။\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/dplyr_filter.jpg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>ပန်းချီရေးဆွဲသူ @allison_horst</figcaption>\n"
   ],
   "metadata": {
    "id": "vVvyDb1kG2in"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Create individual tibble for the cuisines\r\n",
    "thai_df <- df %>% \r\n",
    "  filter(cuisine == \"thai\")\r\n",
    "japanese_df <- df %>% \r\n",
    "  filter(cuisine == \"japanese\")\r\n",
    "chinese_df <- df %>% \r\n",
    "  filter(cuisine == \"chinese\")\r\n",
    "indian_df <- df %>% \r\n",
    "  filter(cuisine == \"indian\")\r\n",
    "korean_df <- df %>% \r\n",
    "  filter(cuisine == \"korean\")\r\n",
    "\r\n",
    "\r\n",
    "# Find out how much data is available per cuisine\r\n",
    "cat(\" thai df:\", dim(thai_df), \"\\n\",\r\n",
    "    \"japanese df:\", dim(japanese_df), \"\\n\",\r\n",
    "    \"chinese_df:\", dim(chinese_df), \"\\n\",\r\n",
    "    \"indian_df:\", dim(indian_df), \"\\n\",\r\n",
    "    \"korean_df:\", dim(korean_df))"
   ],
   "outputs": [],
   "metadata": {
    "id": "0TvXUxD3G8Bk"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## **လေ့ကျင့်ခန်း - dplyr ကို အသုံးပြု၍ အစားအစာအမျိုးအစားအလိုက် ထိပ်တန်းပါဝင်ပစ္စည်းများ ရှာဖွေခြင်း**\n",
    "\n",
    "အခုတော့ ဒေတာကို ပိုမိုနက်နက်ရှိုင်းရှိုင်း လေ့လာပြီး အစားအစာအမျိုးအစားတစ်ခုစီအတွက် သာမန်ပါဝင်ပစ္စည်းများကို သိနိုင်ပါပြီ။ အစားအစာအမျိုးအစားများအကြား ရှုပ်ထွေးမှုကို ဖြစ်စေသော ထပ်တလဲလဲ ဒေတာများကို ဖယ်ရှားသင့်ပါသည်။ ဒါကြောင့် ဒီပြဿနာအကြောင်းကို လေ့လာကြမယ်။\n",
    "\n",
    "R မှာ `create_ingredient()` ဆိုတဲ့ function တစ်ခုကို ဖန်တီးပြီး ပါဝင်ပစ္စည်းများအတွက် dataframe တစ်ခုကို ပြန်ပေးနိုင်ပါမယ်။ ဒီ function က အသုံးမဝင်တဲ့ column တစ်ခုကို drop လုပ်ပြီး ပါဝင်ပစ္စည်းများကို count အလိုက် စီမည်ဖြစ်သည်။\n",
    "\n",
    "R function တစ်ခုရဲ့ အခြေခံဖွဲ့စည်းပုံကတော့:\n",
    "\n",
    "`myFunction <- function(arglist){`\n",
    "\n",
    "**`...`**\n",
    "\n",
    "**`return`**`(value)`\n",
    "\n",
    "`}`\n",
    "\n",
    "R functions အကြောင်းကို tidy အနေနဲ့ မိတ်ဆက်ထားတဲ့ [ဒီနေရာ](https://skirmer.github.io/presentations/functions_with_r.html#1) မှာ ရှာဖွေကြည့်နိုင်ပါတယ်။\n",
    "\n",
    "အခုတော့ စတင်လိုက်ရအောင်! [dplyr verbs](https://dplyr.tidyverse.org/) ကို အသုံးပြုမယ်။ အရင်စာရင်းတွေမှာ သင်ယူခဲ့တဲ့အတိုင်း:\n",
    "\n",
    "-   `dplyr::select()`: **columns** များကို ထည့်သွင်းရန် သို့မဟုတ် ဖယ်ရှားရန် ကူညီပေးသည်။\n",
    "\n",
    "-   `dplyr::pivot_longer()`: ဒေတာကို \"အရှည်ပိုင်း\" ပြောင်းလဲရန် ကူညီပေးပြီး rows အရေအတွက်ကို တိုးစေပြီး columns အရေအတွက်ကို လျှော့စေသည်။\n",
    "\n",
    "-   `dplyr::group_by()` နှင့် `dplyr::summarise()`: အုပ်စုများအလိုက် အကျဉ်းချုပ် စာရင်းအင်းများကို ရှာဖွေပြီး အဆင်ပြေတဲ့ table တစ်ခုအဖြစ် ထည့်သွင်းပေးသည်။\n",
    "\n",
    "-   `dplyr::filter()`: သင့်ရဲ့ အခြေအနေများကို ဖြည့်ဆည်းသော rows များသာ ပါဝင်သော ဒေတာ subset တစ်ခုကို ဖန်တီးသည်။\n",
    "\n",
    "-   `dplyr::mutate()`: columns များကို ဖန်တီးရန် သို့မဟုတ် ပြင်ဆင်ရန် ကူညီပေးသည်။\n",
    "\n",
    "Allison Horst ရဲ့ [*အနုပညာ*-ပြည့် learnr tutorial](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome) ကို ကြည့်ပါ။ dplyr *(Tidyverse ရဲ့ အစိတ်အပိုင်း)* မှာ အသုံးဝင်တဲ့ ဒေတာကို စီမံခန့်ခွဲနိုင်စေတဲ့ function များကို မိတ်ဆက်ပေးထားပါတယ်။\n"
   ],
   "metadata": {
    "id": "K3RF5bSCHC76"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Creates a functions that returns the top ingredients by class\r\n",
    "\r\n",
    "create_ingredient <- function(df){\r\n",
    "  \r\n",
    "  # Drop the id column which is the first colum\r\n",
    "  ingredient_df = df %>% select(-1) %>% \r\n",
    "  # Transpose data to a long format\r\n",
    "    pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \r\n",
    "  # Find the top most ingredients for a particular cuisine\r\n",
    "    group_by(ingredients) %>% \r\n",
    "    summarise(n_instances = sum(count)) %>% \r\n",
    "    filter(n_instances != 0) %>% \r\n",
    "  # Arrange by descending order\r\n",
    "    arrange(desc(n_instances)) %>% \r\n",
    "    mutate(ingredients = factor(ingredients) %>% fct_inorder())\r\n",
    "  \r\n",
    "  \r\n",
    "  return(ingredient_df)\r\n",
    "} # End of function"
   ],
   "outputs": [],
   "metadata": {
    "id": "uB_0JR82HTPa"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "အခုတော့ ဒီ function ကို အသုံးပြုပြီး အစားအစာအမျိုးအစားအလိုက် အများဆုံးလူကြိုက်များတဲ့ ပစ္စည်းအစိတ်အပိုင်း ၁၀ ခုကို သိနိုင်ပါပြီ။ `thai_df` နဲ့ စမ်းကြည့်ရအောင်!\n"
   ],
   "metadata": {
    "id": "h9794WF8HWmc"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Call create_ingredient and display popular ingredients\r\n",
    "thai_ingredient_df <- create_ingredient(df = thai_df)\r\n",
    "\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10)"
   ],
   "outputs": [],
   "metadata": {
    "id": "agQ-1HrcHaEA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ယခင်အပိုင်းတွင် `geom_col()` ကိုအသုံးပြုခဲ့ပြီးဖြစ်သည်၊ `geom_bar` ကိုလည်း ဘားဇယားများဖန်တီးရန် မည်သို့အသုံးပြုနိုင်သည်ကို ကြည့်ရှုကြမည်။ နောက်ထပ်ဖတ်ရှုရန် `?geom_bar` ကိုအသုံးပြုပါ။\n"
   ],
   "metadata": {
    "id": "kHu9ffGjHdcX"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Make a bar chart for popular thai cuisines\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10) %>% \r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "fb3Bx_3DHj6e"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "RHP_xgdkHnvM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Japanese cuisines and make bar chart\r\n",
    "create_ingredient(df = japanese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "019v8F0XHrRU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "တရုတ်အစားအစာတွေကော?\n"
   ],
   "metadata": {
    "id": "iIGM7vO8Hu3v"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Chinese cuisines and make bar chart\r\n",
    "create_ingredient(df = chinese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lHd9_gd2HyzU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "ir8qyQbNH1c7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Indian cuisines and make bar chart\r\n",
    "create_ingredient(df = indian_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "ApukQtKjH5FO"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "qv30cwY1H-FM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Korean cuisines and make bar chart\r\n",
    "create_ingredient(df = korean_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lumgk9cHIBie"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ဒေတာဗျူအာလိုင်းဇေးရှင်းများမှ, အခုတော့ `dplyr::select()` ကို အသုံးပြုပြီး ချက်ပြုတ်မှုအမျိုးအစားများအကြား ရှုပ်ထွေးမှု ဖြစ်စေသော အများဆုံးတွေ့ရသော ပစ္စည်းများကို ဖယ်ရှားနိုင်ပါပြီ။\n",
    "\n",
    "ဆန်၊ ကြက်သီးနဲ့ ဂျင်းကို လူတိုင်းချစ်ကြပါတယ်!\n"
   ],
   "metadata": {
    "id": "iO4veMXuIEta"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Drop id column, rice, garlic and ginger from our original data set\r\n",
    "df_select <- df %>% \r\n",
    "  select(-c(1, rice, garlic, ginger))\r\n",
    "\r\n",
    "# Display new data set\r\n",
    "df_select %>% \r\n",
    "  slice_head(n = 5)"
   ],
   "outputs": [],
   "metadata": {
    "id": "iHJPiG6rIUcK"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## အချက်အလက်များကို ကြိုတင်အဆင်သင့်ပြုလုပ်ခြင်း 👩‍🍳👨‍🍳 - အချက်အလက်မညီမျှမှုကို ကိုင်တွယ်ခြင်း ⚖️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/recipes.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>ပုံပန်းချီရေးဆွဲသူ @allison_horst</figcaption>\n",
    "\n",
    "ဒီသင်ခန်းစာက အစားအစာအမျိုးမျိုးနဲ့ ပတ်သက်တာဖြစ်တဲ့အတွက် `recipes` ကို အခြေခံပြီး ဆွေးနွေးရပါမယ်။\n",
    "\n",
    "Tidymodels က အချက်အလက်များကို ကြိုတင်အဆင်သင့်ပြုလုပ်ဖို့အတွက် `recipes` ဆိုတဲ့ အဆင်ပြေတဲ့ package တစ်ခုကို ထပ်မံပေးထားပါတယ်။\n"
   ],
   "metadata": {
    "id": "kkFd-JxdIaL6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ကျွန်တော်တို့ရဲ့ အစားအစာအမျိုးအစားများရဲ့ ဖြန့်ဝေမှုကို နောက်တစ်ကြိမ် ပြန်လည်ကြည့်ရှုကြပါစို့။\n"
   ],
   "metadata": {
    "id": "6l2ubtTPJAhY"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "old_label_count <- df_select %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "old_label_count"
   ],
   "outputs": [],
   "metadata": {
    "id": "1e-E9cb7JDVi"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "အစားအစာအမျိုးအစားအရေအတွက်တွင် မညီမျှမှုများရှိနေသည်ကို သတိပြုမိပါသည်။ ကိုရီးယားအစားအစာများသည် ထိုင်းအစားအစာများထက် ၃ ဆနီးပါး ပိုများနေသည်။ မညီမျှသောဒေတာများသည် မော်ဒယ်၏စွမ်းဆောင်ရည်အပေါ် အနုတ်လက္ခဏာများပေးနိုင်သည်။ ဥပမာအားဖြင့် binary classification ကိုစဉ်းစားကြည့်ပါ။ ဒေတာအများစုသည် တစ်မျိုးတည်းသောအတန်းဖြစ်နေပါက ML မော်ဒယ်သည် အတန်းအမျိုးအစားကို ပိုမိုခန့်မှန်းမည်ဖြစ်ပြီး၊ ဒေတာများပိုမိုရှိနေသောကြောင့်ဖြစ်သည်။ ဒေတာကိုညီမျှအောင်လုပ်ခြင်းသည် skewed data များကိုဖယ်ရှားပြီး မညီမျှမှုကိုဖယ်ရှားပေးသည်။ မော်ဒယ်များအများစုသည် အချက်အလက်အရေအတွက်များညီမျှသောအခါ အကောင်းဆုံးစွမ်းဆောင်ရည်ပြသနိုင်ပြီး၊ မညီမျှသောဒေတာများနှင့်ရင်ဆိုင်ရသည့်အခါ အခက်အခဲများရှိတတ်သည်။\n",
    "\n",
    "မညီမျှသောဒေတာအစုများကို ကိုင်တွယ်ရန် နည်းလမ်းနှစ်မျိုးအဓိကရှိသည်-\n",
    "\n",
    "-   အနည်းဆုံးအတန်းအမျိုးအစားတွင် observation များထည့်ခြင်း: `Over-sampling` ဥပမာ SMOTE algorithm ကိုအသုံးပြုခြင်း\n",
    "\n",
    "-   အများဆုံးအတန်းအမျိုးအစားမှ observation များဖယ်ရှားခြင်း: `Under-sampling`\n",
    "\n",
    "အခုတော့ `recipe` ကိုအသုံးပြုပြီး မညီမျှသောဒေတာအစုများကို ကိုင်တွယ်ပုံကို ပြသပါမည်။ recipe ဆိုသည်မှာ ဒေတာအစုကို ဒေတာခွဲခြမ်းစိတ်ဖြာလုပ်ရန်အဆင့်များကို ဖော်ပြထားသော အခြေခံအစီအစဉ်တစ်ခုအဖြစ် စဉ်းစားနိုင်သည်။\n"
   ],
   "metadata": {
    "id": "soAw6826JKx9"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Load themis package for dealing with imbalanced data\r\n",
    "library(themis)\r\n",
    "\r\n",
    "# Create a recipe for preprocessing data\r\n",
    "cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \r\n",
    "  step_smote(cuisine)\r\n",
    "\r\n",
    "cuisines_recipe"
   ],
   "outputs": [],
   "metadata": {
    "id": "HS41brUIJVJy"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "အကြိုတင်လုပ်ဆောင်မှုအဆင့်များကို ခွဲခြားကြည့်ပါစို့။\n",
    "\n",
    "-   `recipe()` ကို formula နဲ့ခေါ်သုံးတဲ့အခါ `df_select` ဒေတာကို အခြေခံပြီး variable တွေရဲ့ *roles* ကို recipe ကိုပြောပြပေးပါတယ်။ ဥပမာ `cuisine` column ကို `outcome` role အဖြစ် သတ်မှတ်ထားပြီး အခြား column တွေကို `predictor` role အဖြစ် သတ်မှတ်ထားပါတယ်။\n",
    "\n",
    "-   [`step_smote(cuisine)`](https://themis.tidymodels.org/reference/step_smote.html) က minority class ရဲ့ အသစ်ထပ်ထွက်လာတဲ့ ဥပမာတွေကို nearest neighbors ကို အသုံးပြုပြီး စက်မှုတုနည်းဖြင့် ဖန်တီးပေးတဲ့ recipe step ရဲ့ *specification* ကို ဖန်တီးပေးပါတယ်။\n",
    "\n",
    "အခုတော့ preprocessed data ကို ကြည့်ချင်ရင် [**`prep()`**](https://recipes.tidymodels.org/reference/prep.html) နဲ့ [**`bake()`**](https://recipes.tidymodels.org/reference/bake.html) ကို အသုံးပြုရပါမယ်။\n",
    "\n",
    "`prep()`: training set ကနေ လိုအပ်တဲ့ parameters တွေကို ခန့်မှန်းပြီး နောက်ထပ် data set တွေမှာ အသုံးပြုနိုင်အောင် ပြင်ဆင်ပေးပါတယ်။\n",
    "\n",
    "`bake()`: prepped recipe ကို ယူပြီး operation တွေကို data set တစ်ခုခုမှာ အကောင်အထည်ဖော်ပေးပါတယ်။\n"
   ],
   "metadata": {
    "id": "Yb-7t7XcJaC8"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Prep and bake the recipe\r\n",
    "preprocessed_df <- cuisines_recipe %>% \r\n",
    "  prep() %>% \r\n",
    "  bake(new_data = NULL) %>% \r\n",
    "  relocate(cuisine)\r\n",
    "\r\n",
    "# Display data\r\n",
    "preprocessed_df %>% \r\n",
    "  slice_head(n = 5)\r\n",
    "\r\n",
    "# Quick summary stats\r\n",
    "preprocessed_df %>% \r\n",
    "  introduce()"
   ],
   "outputs": [],
   "metadata": {
    "id": "9QhSgdpxJl44"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "dmidELh_LdV7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "new_label_count <- preprocessed_df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "list(new_label_count = new_label_count,\r\n",
    "     old_label_count = old_label_count)"
   ],
   "outputs": [],
   "metadata": {
    "id": "aSh23klBLwDz"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "အရသာရှိတယ်! ဒေတာက သန့်ရှင်းပြီး၊ ထိန်းညှိထားပြီး၊ အရသာလည်း အရမ်းကောင်းပါတယ် 😋!\n",
    "\n",
    "> အများအားဖြင့်၊ recipe ဆိုတာက မော်ဒယ်တစ်ခုကို ပြင်ဆင်ဖို့အတွက် အဆင့်တွေကို သတ်မှတ်ပေးတဲ့ preprocessor အနေနဲ့ အသုံးပြုလေ့ရှိပါတယ်။ ဒီအခါမှာတော့ `workflow()` ကို အသုံးပြုလေ့ရှိပါတယ် (ကျွန်တော်တို့ရဲ့ အတန်းတွေမှာ ရှေ့မှာ ကြည့်ဖူးပြီးသား)၊ recipe ကို ကိုယ်တိုင် ခန့်မှန်းစရာမလိုဘဲ။\n",
    ">\n",
    "> ထို့ကြောင့် tidymodels ကို အသုံးပြုတဲ့အခါမှာ **`prep()`** နဲ့ **`bake()`** ကို မဖြစ်မနေ အသုံးပြုစရာမလိုပေမယ့်၊ recipe တွေက မျှော်လင့်ထားတဲ့အတိုင်း အလုပ်လုပ်နေလားဆိုတာ အတည်ပြုဖို့အတွက် အသုံးဝင်တဲ့ function တွေဖြစ်ပါတယ်၊ ကျွန်တော်တို့ရဲ့ အခန်းကဏ္ဍမှာလိုပဲ။\n",
    ">\n",
    "> **`new_data = NULL`** နဲ့ prepped recipe ကို **`bake()`** လုပ်တဲ့အခါမှာ၊ recipe ကို သတ်မှတ်တဲ့အချိန်မှာ ပေးထားတဲ့ ဒေတာကို ပြန်ရမှာဖြစ်ပေမယ့်၊ preprocessing အဆင့်တွေကို ဖြတ်သွားပြီးသား ဖြစ်ပါတယ်။\n",
    "\n",
    "အခုတော့ ဒီဒေတာကို နောက်အတန်းတွေမှာ အသုံးပြုဖို့အတွက် ကူးထားလိုက်ရအောင်:\n"
   ],
   "metadata": {
    "id": "HEu80HZ8L7ae"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Save preprocessed data\r\n",
    "write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "cBmCbIgrMOI6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ဤအသစ်ထွက် CSV ကို အခု root data folder မှာ ရှာတွေ့နိုင်ပါပြီ။\n",
    "\n",
    "**🚀စိန်ခေါ်မှု**\n",
    "\n",
    "ဒီသင်ခန်းစာမှာ စိတ်ဝင်စားဖွယ် dataset အများအပြား ပါဝင်ပါတယ်။ `data` folder တွေကို စူးစမ်းကြည့်ပြီး binary classification ဒါမှမဟုတ် multi-class classification အတွက် သင့်လျော်တဲ့ dataset တွေ ရှိမရှိ စစ်ဆေးပါ။ ဒီ dataset ကို အသုံးပြုပြီး ဘယ်လိုမေးခွန်းတွေ မေးနိုင်မလဲ?\n",
    "\n",
    "## [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/20/)\n",
    "\n",
    "## **ပြန်လည်သုံးသပ်ခြင်းနှင့် ကိုယ်တိုင်လေ့လာခြင်း**\n",
    "\n",
    "-   [package themis](https://github.com/tidymodels/themis) ကို ကြည့်ပါ။ Imbalanced data ကို ကိုင်တွယ်ဖို့ ဘယ်လိုနည်းလမ်းတွေ အသုံးပြုနိုင်မလဲ?\n",
    "\n",
    "-   Tidy models [reference website](https://www.tidymodels.org/start/) ကို လေ့လာပါ။\n",
    "\n",
    "-   H. Wickham နှင့် G. Grolemund ရေးသားထားသော [*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/) ကို ဖတ်ရှုပါ။\n",
    "\n",
    "#### ကျေးဇူးတင်စကား:\n",
    "\n",
    "[`Allison Horst`](https://twitter.com/allison_horst/) ကို R ကို ပိုမိုကြိုဆိုဖွယ်ကောင်းပြီး စိတ်ဝင်စားဖွယ်ကောင်းအောင် ဖန်တီးထားတဲ့ အံ့ဩဖွယ်ပုံရိပ်တွေကို ဖန်တီးပေးထားတဲ့အတွက် ကျေးဇူးတင်ပါတယ်။ သူမရဲ့ [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM) မှာ ပိုမိုများစွာသော ပုံရိပ်တွေကို ရှာဖွေကြည့်နိုင်ပါတယ်။\n",
    "\n",
    "[Cassie Breviu](https://www.twitter.com/cassieview) နှင့် [Jen Looper](https://www.twitter.com/jenlooper) ကို ဒီ module ရဲ့ Python version ကို ဖန်တီးပေးထားတဲ့အတွက် ♥️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/r_learners_sm.jpeg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>Artwork by @allison_horst</figcaption>\n"
   ],
   "metadata": {
    "id": "WQs5621pMGwf"
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**ဝက်ဘ်ဆိုက်မှတ်ချက်**:  \nဤစာရွက်စာတမ်းကို AI ဘာသာပြန်ဝန်ဆောင်မှု [Co-op Translator](https://github.com/Azure/co-op-translator) ကို အသုံးပြု၍ ဘာသာပြန်ထားပါသည်။ ကျွန်ုပ်တို့သည် တိကျမှန်ကန်မှုအတွက် ကြိုးစားနေပါသော်လည်း၊ အလိုအလျောက်ဘာသာပြန်ဆိုမှုများတွင် အမှားများ သို့မဟုတ် မမှန်ကန်မှုများ ပါဝင်နိုင်သည်ကို ကျေးဇူးပြု၍ သတိပြုပါ။ မူရင်းစာရွက်စာတမ်းကို ၎င်း၏ မူလဘာသာစကားဖြင့် အာဏာတည်သောရင်းမြစ်အဖြစ် သတ်မှတ်သင့်ပါသည်။ အရေးကြီးသောအချက်အလက်များအတွက် လူပညာရှင်များမှ လက်တွေ့ဘာသာပြန်ဆိုမှုကို အကြံပြုပါသည်။ ဤဘာသာပြန်ဆိုမှုကို အသုံးပြုခြင်းမှ ဖြစ်ပေါ်လာသော နားလည်မှုမှားများ သို့မဟုတ် အဓိပ္ပါယ်မှားများအတွက် ကျွန်ုပ်တို့သည် တာဝန်မယူပါ။\n"
   ]
  }
 ]
}