{
 "nbformat": 4,
 "nbformat_minor": 2,
 "metadata": {
  "colab": {
   "name": "lesson_10-R.ipynb",
   "provenance": [],
   "collapsed_sections": []
  },
  "kernelspec": {
   "name": "ir",
   "display_name": "R"
  },
  "language_info": {
   "name": "R"
  },
  "coopTranslator": {
   "original_hash": "2621e24705e8100893c9bf84e0fc8aef",
   "translation_date": "2025-09-06T15:00:03+00:00",
   "source_file": "4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb",
   "language_code": "th"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "ItETB4tSFprR"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## บทนำสู่การจำแนกประเภท: ทำความสะอาด เตรียม และแสดงข้อมูลของคุณ\n",
    "\n",
    "ในบทเรียนทั้งสี่นี้ คุณจะได้สำรวจหัวข้อพื้นฐานของการเรียนรู้ด้วยเครื่องแบบคลาสสิก - *การจำแนกประเภท* เราจะเดินทางผ่านการใช้อัลกอริธึมการจำแนกประเภทต่าง ๆ กับชุดข้อมูลเกี่ยวกับอาหารที่ยอดเยี่ยมของเอเชียและอินเดีย หวังว่าคุณจะหิวแล้ว!\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/pinch.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>เฉลิมฉลองอาหารเอเชียในบทเรียนเหล่านี้! ภาพโดย Jen Looper</figcaption>\n",
    "\n",
    "การจำแนกประเภทเป็นรูปแบบหนึ่งของ [การเรียนรู้แบบมีผู้สอน](https://wikipedia.org/wiki/Supervised_learning) ซึ่งมีความคล้ายคลึงกับเทคนิคการถดถอย ในการจำแนกประเภท คุณจะฝึกโมเดลเพื่อทำนายว่า `หมวดหมู่` ใดที่รายการนั้นอยู่ หากการเรียนรู้ด้วยเครื่องเกี่ยวกับการทำนายค่าหรือชื่อของสิ่งต่าง ๆ โดยใช้ชุดข้อมูล การจำแนกประเภทมักจะแบ่งออกเป็นสองกลุ่ม: *การจำแนกประเภทแบบทวิภาค* และ *การจำแนกประเภทแบบหลายคลาส*\n",
    "\n",
    "จำไว้ว่า:\n",
    "\n",
    "-   **การถดถอยเชิงเส้น** ช่วยให้คุณทำนายความสัมพันธ์ระหว่างตัวแปรและทำการทำนายที่แม่นยำเกี่ยวกับตำแหน่งที่จุดข้อมูลใหม่จะตกอยู่ในความสัมพันธ์กับเส้นนั้น ตัวอย่างเช่น คุณสามารถทำนายค่าตัวเลข เช่น *ราคาของฟักทองในเดือนกันยายนเทียบกับเดือนธันวาคม*\n",
    "\n",
    "-   **การถดถอยโลจิสติก** ช่วยให้คุณค้นพบ \"หมวดหมู่ทวิภาค\": ที่จุดราคานี้ *ฟักทองนี้เป็นสีส้มหรือไม่เป็นสีส้ม*?\n",
    "\n",
    "การจำแนกประเภทใช้หลากหลายอัลกอริธึมเพื่อกำหนดวิธีอื่น ๆ ในการกำหนดฉลากหรือคลาสของจุดข้อมูล ลองทำงานกับข้อมูลอาหารนี้เพื่อดูว่า โดยการสังเกตกลุ่มของส่วนผสม เราสามารถกำหนดแหล่งกำเนิดของอาหารได้หรือไม่\n",
    "\n",
    "### [**แบบทดสอบก่อนการบรรยาย**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/19/)\n",
    "\n",
    "### **บทนำ**\n",
    "\n",
    "การจำแนกประเภทเป็นหนึ่งในกิจกรรมพื้นฐานของนักวิจัยและนักวิทยาศาสตร์ข้อมูลด้านการเรียนรู้ด้วยเครื่อง ตั้งแต่การจำแนกค่าทวิภาคพื้นฐาน (\"อีเมลนี้เป็นสแปมหรือไม่?\") ไปจนถึงการจำแนกภาพและการแบ่งส่วนที่ซับซ้อนโดยใช้การมองเห็นด้วยคอมพิวเตอร์ การสามารถจัดเรียงข้อมูลเป็นคลาสและตั้งคำถามกับมันเป็นสิ่งที่มีประโยชน์เสมอ\n",
    "\n",
    "หากจะกล่าวถึงกระบวนการในเชิงวิทยาศาสตร์ วิธีการจำแนกประเภทของคุณจะสร้างโมเดลการทำนายที่ช่วยให้คุณสามารถจับคู่ความสัมพันธ์ระหว่างตัวแปรอินพุตกับตัวแปรเอาต์พุตได้\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/binary-multiclass.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>ปัญหาแบบทวิภาคและแบบหลายคลาสสำหรับอัลกอริธึมการจำแนกประเภท ภาพประกอบโดย Jen Looper</figcaption>\n",
    "\n",
    "ก่อนเริ่มกระบวนการทำความสะอาดข้อมูลของเรา การแสดงภาพ และการเตรียมข้อมูลสำหรับงาน ML ของเรา ลองเรียนรู้เกี่ยวกับวิธีต่าง ๆ ที่การเรียนรู้ด้วยเครื่องสามารถนำมาใช้เพื่อจำแนกข้อมูลได้\n",
    "\n",
    "การจำแนกประเภทที่ได้มาจาก [สถิติ](https://wikipedia.org/wiki/Statistical_classification) ใช้คุณลักษณะ เช่น `smoker`, `weight`, และ `age` เพื่อกำหนด *ความน่าจะเป็นในการพัฒนาโรค X* ในฐานะเทคนิคการเรียนรู้แบบมีผู้สอนที่คล้ายกับการฝึกถดถอยที่คุณทำมาก่อนหน้านี้ ข้อมูลของคุณจะถูกติดป้ายกำกับ และอัลกอริธึม ML จะใช้ป้ายกำกับเหล่านั้นเพื่อจำแนกและทำนายคลาส (หรือ 'คุณลักษณะ') ของชุดข้อมูลและกำหนดให้กับกลุ่มหรือผลลัพธ์\n",
    "\n",
    "✅ ลองใช้เวลาสักครู่เพื่อจินตนาการถึงชุดข้อมูลเกี่ยวกับอาหาร โมเดลแบบหลายคลาสจะสามารถตอบคำถามอะไรได้บ้าง? โมเดลแบบทวิภาคจะสามารถตอบคำถามอะไรได้บ้าง? ถ้าคุณต้องการกำหนดว่าอาหารที่กำหนดมีแนวโน้มที่จะใช้ลูกซัดหรือไม่? หรือถ้าคุณต้องการดูว่า หากคุณได้รับของขวัญเป็นถุงช้อปปิ้งที่เต็มไปด้วยโป๊ยกั๊ก อาร์ติโชก กะหล่ำดอก และฮอร์สแรดิช คุณจะสามารถสร้างอาหารอินเดียทั่วไปได้หรือไม่?\n",
    "\n",
    "### **สวัสดี 'ตัวจำแนก'**\n",
    "\n",
    "คำถามที่เราต้องการถามจากชุดข้อมูลอาหารนี้เป็นคำถามแบบ **หลายคลาส** เนื่องจากเรามีอาหารประจำชาติหลายประเภทที่สามารถทำงานได้ เมื่อพิจารณากลุ่มของส่วนผสมแล้ว ข้อมูลจะเข้ากับคลาสใดในหลาย ๆ คลาสนี้?\n",
    "\n",
    "Tidymodels มีอัลกอริธึมหลายแบบให้เลือกใช้เพื่อจำแนกข้อมูล ขึ้นอยู่กับประเภทของปัญหาที่คุณต้องการแก้ไข ในสองบทเรียนถัดไป คุณจะได้เรียนรู้เกี่ยวกับอัลกอริธึมเหล่านี้\n",
    "\n",
    "#### **ข้อกำหนดเบื้องต้น**\n",
    "\n",
    "สำหรับบทเรียนนี้ เราจะต้องใช้แพ็กเกจต่อไปนี้เพื่อทำความสะอาด เตรียม และแสดงข้อมูลของเรา:\n",
    "\n",
    "-   `tidyverse`: [tidyverse](https://www.tidyverse.org/) เป็น [ชุดของแพ็กเกจ R](https://www.tidyverse.org/packages) ที่ออกแบบมาเพื่อทำให้วิทยาศาสตร์ข้อมูลเร็วขึ้น ง่ายขึ้น และสนุกขึ้น!\n",
    "\n",
    "-   `tidymodels`: [tidymodels](https://www.tidymodels.org/) เป็นกรอบงาน [ชุดของแพ็กเกจ](https://www.tidymodels.org/packages/) สำหรับการสร้างโมเดลและการเรียนรู้ด้วยเครื่อง\n",
    "\n",
    "-   `DataExplorer`: [แพ็กเกจ DataExplorer](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html) มีไว้เพื่อทำให้กระบวนการ EDA และการสร้างรายงานง่ายขึ้นและอัตโนมัติ\n",
    "\n",
    "-   `themis`: [แพ็กเกจ themis](https://themis.tidymodels.org/) ให้ขั้นตอนเพิ่มเติมสำหรับการจัดการข้อมูลที่ไม่สมดุล\n",
    "\n",
    "คุณสามารถติดตั้งแพ็กเกจเหล่านี้ได้โดยใช้:\n",
    "\n",
    "`install.packages(c(\"tidyverse\", \"tidymodels\", \"DataExplorer\", \"here\"))`\n",
    "\n",
    "หรือใช้สคริปต์ด้านล่างเพื่อตรวจสอบว่าคุณมีแพ็กเกจที่จำเป็นสำหรับการทำโมดูลนี้หรือไม่ และติดตั้งให้คุณในกรณีที่ขาดหายไป\n"
   ],
   "metadata": {
    "id": "ri5bQxZ-Fz_0"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n",
    "\r\n",
    "pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
   ],
   "outputs": [],
   "metadata": {
    "id": "KIPxa4elGAPI"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "เราจะโหลดแพ็กเกจที่ยอดเยี่ยมเหล่านี้ในภายหลังและทำให้พร้อมใช้งานในเซสชัน R ปัจจุบันของเรา (นี่เป็นเพียงการแสดงตัวอย่าง `pacman::p_load()` ได้ทำสิ่งนี้ให้คุณแล้ว)\n"
   ],
   "metadata": {
    "id": "YkKAxOJvGD4C"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## แบบฝึกหัด - ทำความสะอาดและปรับสมดุลข้อมูลของคุณ\n",
    "\n",
    "งานแรกที่ต้องทำก่อนเริ่มโครงการนี้คือการทำความสะอาดและ **ปรับสมดุล** ข้อมูลของคุณเพื่อให้ได้ผลลัพธ์ที่ดียิ่งขึ้น\n",
    "\n",
    "มาทำความรู้จักกับข้อมูลกันเถอะ!🕵️\n"
   ],
   "metadata": {
    "id": "PFkQDlk0GN5O"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Import data\r\n",
    "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n",
    "\r\n",
    "# View the first 5 rows\r\n",
    "df %>% \r\n",
    "  slice_head(n = 5)\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "Qccw7okxGT0S"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "น่าสนใจ! จากลักษณะของมัน คอลัมน์แรกดูเหมือนจะเป็นคอลัมน์ประเภท `id` ลองมาหาข้อมูลเพิ่มเติมเกี่ยวกับข้อมูลนี้กันเถอะ\n"
   ],
   "metadata": {
    "id": "XrWnlgSrGVmR"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Basic information about the data\r\n",
    "df %>%\r\n",
    "  introduce()\r\n",
    "\r\n",
    "# Visualize basic information above\r\n",
    "df %>% \r\n",
    "  plot_intro(ggtheme = theme_light())"
   ],
   "outputs": [],
   "metadata": {
    "id": "4UcGmxRxGieA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "จากผลลัพธ์ เราสามารถเห็นได้ทันทีว่าเรามี `2448` แถว และ `385` คอลัมน์ และไม่มีค่าที่หายไปเลย (`0` missing values) นอกจากนี้ เรายังมีคอลัมน์แบบไม่ต่อเนื่อง 1 คอลัมน์ คือ *cuisine*\n",
    "\n",
    "## แบบฝึกหัด - เรียนรู้เกี่ยวกับประเภทอาหาร\n",
    "\n",
    "ตอนนี้งานเริ่มน่าสนใจมากขึ้นแล้ว มาค้นพบการกระจายของข้อมูลในแต่ละประเภทอาหารกันเถอะ\n"
   ],
   "metadata": {
    "id": "AaPubl__GmH5"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Count observations per cuisine\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(n)\r\n",
    "\r\n",
    "# Plot the distribution\r\n",
    "theme_set(theme_light())\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n",
    "  geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
    "  ylab(\"cuisine\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "FRsBVy5eGrrv"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "มีจำนวนอาหารที่จำกัด แต่การกระจายของข้อมูลไม่เท่ากัน คุณสามารถแก้ไขได้! ก่อนที่จะทำเช่นนั้น ลองสำรวจเพิ่มเติมอีกเล็กน้อย\n",
    "\n",
    "ต่อไป เรามาแบ่งอาหารแต่ละประเภทออกเป็น tibble ของตัวเอง และตรวจสอบว่ามีข้อมูลมากน้อยแค่ไหน (จำนวนแถวและคอลัมน์) ต่ออาหารแต่ละประเภท\n",
    "\n",
    "> [tibble](https://tibble.tidyverse.org/) คือรูปแบบข้อมูลเฟรมที่ทันสมัย\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/dplyr_filter.jpg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>ภาพประกอบโดย @allison_horst</figcaption>\n"
   ],
   "metadata": {
    "id": "vVvyDb1kG2in"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Create individual tibble for the cuisines\r\n",
    "thai_df <- df %>% \r\n",
    "  filter(cuisine == \"thai\")\r\n",
    "japanese_df <- df %>% \r\n",
    "  filter(cuisine == \"japanese\")\r\n",
    "chinese_df <- df %>% \r\n",
    "  filter(cuisine == \"chinese\")\r\n",
    "indian_df <- df %>% \r\n",
    "  filter(cuisine == \"indian\")\r\n",
    "korean_df <- df %>% \r\n",
    "  filter(cuisine == \"korean\")\r\n",
    "\r\n",
    "\r\n",
    "# Find out how much data is available per cuisine\r\n",
    "cat(\" thai df:\", dim(thai_df), \"\\n\",\r\n",
    "    \"japanese df:\", dim(japanese_df), \"\\n\",\r\n",
    "    \"chinese_df:\", dim(chinese_df), \"\\n\",\r\n",
    "    \"indian_df:\", dim(indian_df), \"\\n\",\r\n",
    "    \"korean_df:\", dim(korean_df))"
   ],
   "outputs": [],
   "metadata": {
    "id": "0TvXUxD3G8Bk"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## **แบบฝึกหัด - ค้นหาเครื่องปรุงยอดนิยมตามประเภทอาหารด้วย dplyr**\n",
    "\n",
    "ตอนนี้คุณสามารถเจาะลึกลงไปในข้อมูลและเรียนรู้ว่าเครื่องปรุงที่เป็นเอกลักษณ์ของแต่ละประเภทอาหารคืออะไร คุณควรทำความสะอาดข้อมูลที่ซ้ำซ้อนซึ่งอาจสร้างความสับสนระหว่างประเภทอาหาร ดังนั้นมาทำความเข้าใจปัญหานี้กันเถอะ\n",
    "\n",
    "สร้างฟังก์ชัน `create_ingredient()` ใน R ที่จะคืนค่าเป็น dataframe ของเครื่องปรุง ฟังก์ชันนี้จะเริ่มต้นด้วยการลบคอลัมน์ที่ไม่เป็นประโยชน์ออก และจัดเรียงเครื่องปรุงตามจำนวนครั้งที่ปรากฏ\n",
    "\n",
    "โครงสร้างพื้นฐานของฟังก์ชันใน R คือ:\n",
    "\n",
    "`myFunction <- function(arglist){`\n",
    "\n",
    "**`...`**\n",
    "\n",
    "**`return`**`(value)`\n",
    "\n",
    "`}`\n",
    "\n",
    "สามารถดูการแนะนำเบื้องต้นเกี่ยวกับฟังก์ชันใน R ได้ [ที่นี่](https://skirmer.github.io/presentations/functions_with_r.html#1)\n",
    "\n",
    "มาเริ่มกันเลย! เราจะใช้ [คำกริยาใน dplyr](https://dplyr.tidyverse.org/) ที่เราได้เรียนรู้ในบทเรียนก่อนหน้า เพื่อทบทวน:\n",
    "\n",
    "-   `dplyr::select()`: ช่วยให้คุณเลือกว่าจะเก็บหรือไม่เก็บ **คอลัมน์** ใด\n",
    "\n",
    "-   `dplyr::pivot_longer()`: ช่วยให้คุณ \"ยืด\" ข้อมูล เพิ่มจำนวนแถวและลดจำนวนคอลัมน์\n",
    "\n",
    "-   `dplyr::group_by()` และ `dplyr::summarise()`: ช่วยให้คุณหาสถิติสรุปสำหรับกลุ่มต่าง ๆ และจัดให้อยู่ในตารางที่ดูดี\n",
    "\n",
    "-   `dplyr::filter()`: สร้างชุดข้อมูลย่อยที่มีเฉพาะแถวที่ตรงตามเงื่อนไขของคุณ\n",
    "\n",
    "-   `dplyr::mutate()`: ช่วยให้คุณสร้างหรือแก้ไขคอลัมน์\n",
    "\n",
    "ลองดู [บทเรียน learnr ที่เต็มไปด้วยศิลปะ](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome) โดย Allison Horst ที่แนะนำฟังก์ชันการจัดการข้อมูลที่มีประโยชน์ใน dplyr *(ส่วนหนึ่งของ Tidyverse)*\n"
   ],
   "metadata": {
    "id": "K3RF5bSCHC76"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Creates a functions that returns the top ingredients by class\r\n",
    "\r\n",
    "create_ingredient <- function(df){\r\n",
    "  \r\n",
    "  # Drop the id column which is the first colum\r\n",
    "  ingredient_df = df %>% select(-1) %>% \r\n",
    "  # Transpose data to a long format\r\n",
    "    pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \r\n",
    "  # Find the top most ingredients for a particular cuisine\r\n",
    "    group_by(ingredients) %>% \r\n",
    "    summarise(n_instances = sum(count)) %>% \r\n",
    "    filter(n_instances != 0) %>% \r\n",
    "  # Arrange by descending order\r\n",
    "    arrange(desc(n_instances)) %>% \r\n",
    "    mutate(ingredients = factor(ingredients) %>% fct_inorder())\r\n",
    "  \r\n",
    "  \r\n",
    "  return(ingredient_df)\r\n",
    "} # End of function"
   ],
   "outputs": [],
   "metadata": {
    "id": "uB_0JR82HTPa"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ตอนนี้เราสามารถใช้ฟังก์ชันนี้เพื่อดูแนวโน้มของส่วนผสมยอดนิยมสิบอันดับแรกตามประเภทของอาหารได้แล้ว ลองนำไปใช้กับ `thai_df` กันดู\n"
   ],
   "metadata": {
    "id": "h9794WF8HWmc"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Call create_ingredient and display popular ingredients\r\n",
    "thai_ingredient_df <- create_ingredient(df = thai_df)\r\n",
    "\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10)"
   ],
   "outputs": [],
   "metadata": {
    "id": "agQ-1HrcHaEA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ในส่วนก่อนหน้านี้ เราได้ใช้ `geom_col()` มาดูกันว่าคุณสามารถใช้ `geom_bar` ได้อย่างไรบ้างในการสร้างแผนภูมิแท่ง ใช้ `?geom_bar` เพื่ออ่านเพิ่มเติม\n"
   ],
   "metadata": {
    "id": "kHu9ffGjHdcX"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Make a bar chart for popular thai cuisines\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10) %>% \r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "fb3Bx_3DHj6e"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "RHP_xgdkHnvM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Japanese cuisines and make bar chart\r\n",
    "create_ingredient(df = japanese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "019v8F0XHrRU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "แล้วอาหารจีนล่ะ?\n"
   ],
   "metadata": {
    "id": "iIGM7vO8Hu3v"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Chinese cuisines and make bar chart\r\n",
    "create_ingredient(df = chinese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lHd9_gd2HyzU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "ir8qyQbNH1c7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Indian cuisines and make bar chart\r\n",
    "create_ingredient(df = indian_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "ApukQtKjH5FO"
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "id": "qv30cwY1H-FM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Korean cuisines and make bar chart\r\n",
    "create_ingredient(df = korean_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lumgk9cHIBie"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "จากการวิเคราะห์ข้อมูลด้วยภาพ เราสามารถตัดส่วนผสมที่พบบ่อยที่สุดซึ่งสร้างความสับสนระหว่างอาหารที่แตกต่างกันออกได้ โดยใช้ `dplyr::select()`\n",
    "\n",
    "ใครๆ ก็ชอบข้าว กระเทียม และขิง!\n"
   ],
   "metadata": {
    "id": "iO4veMXuIEta"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Drop id column, rice, garlic and ginger from our original data set\r\n",
    "df_select <- df %>% \r\n",
    "  select(-c(1, rice, garlic, ginger))\r\n",
    "\r\n",
    "# Display new data set\r\n",
    "df_select %>% \r\n",
    "  slice_head(n = 5)"
   ],
   "outputs": [],
   "metadata": {
    "id": "iHJPiG6rIUcK"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## การเตรียมข้อมูลด้วย Recipes 👩‍🍳👨‍🍳 - การจัดการข้อมูลที่ไม่สมดุล ⚖️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/recipes.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>ภาพประกอบโดย @allison_horst</figcaption>\n",
    "\n",
    "เนื่องจากบทเรียนนี้เกี่ยวกับอาหาร เราจึงต้องนำ `recipes` มาใช้ในบริบทที่เหมาะสม\n",
    "\n",
    "Tidymodels มีอีกหนึ่งแพ็กเกจที่น่าสนใจ: `recipes` - แพ็กเกจสำหรับการเตรียมข้อมูลก่อนการวิเคราะห์\n"
   ],
   "metadata": {
    "id": "kkFd-JxdIaL6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "มาดูการกระจายของอาหารของเราอีกครั้ง\n"
   ],
   "metadata": {
    "id": "6l2ubtTPJAhY"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "old_label_count <- df_select %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "old_label_count"
   ],
   "outputs": [],
   "metadata": {
    "id": "1e-E9cb7JDVi"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ดังที่คุณเห็น มีการกระจายจำนวนอาหารที่ไม่เท่ากันอย่างชัดเจน อาหารเกาหลีมีจำนวนเกือบ 3 เท่าของอาหารไทย ข้อมูลที่ไม่สมดุลมักส่งผลเสียต่อประสิทธิภาพของโมเดล ลองนึกถึงการจำแนกประเภทแบบสองค่า หากข้อมูลส่วนใหญ่เป็นคลาสเดียว โมเดลการเรียนรู้ของเครื่อง (ML) จะมีแนวโน้มที่จะทำนายคลาสนั้นบ่อยขึ้น เพียงเพราะมีข้อมูลสำหรับคลาสนั้นมากกว่า การปรับสมดุลข้อมูลจะช่วยแก้ไขความไม่สมดุลนี้โดยการปรับข้อมูลที่มีการกระจายไม่เท่ากัน หลายโมเดลทำงานได้ดีที่สุดเมื่อจำนวนการสังเกตเท่ากัน และมักจะประสบปัญหาเมื่อข้อมูลไม่สมดุล\n",
    "\n",
    "มีวิธีหลักสองวิธีในการจัดการกับชุดข้อมูลที่ไม่สมดุล:\n",
    "\n",
    "-   เพิ่มจำนวนการสังเกตในคลาสที่มีจำนวนน้อย: `Over-sampling` เช่น การใช้ SMOTE algorithm\n",
    "\n",
    "-   ลดจำนวนการสังเกตในคลาสที่มีจำนวนมาก: `Under-sampling`\n",
    "\n",
    "ตอนนี้เรามาแสดงวิธีจัดการกับชุดข้อมูลที่ไม่สมดุลโดยใช้ `recipe` กัน `Recipe` สามารถมองว่าเป็นแผนงานที่อธิบายขั้นตอนที่ควรนำไปใช้กับชุดข้อมูลเพื่อเตรียมพร้อมสำหรับการวิเคราะห์ข้อมูล\n"
   ],
   "metadata": {
    "id": "soAw6826JKx9"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Load themis package for dealing with imbalanced data\r\n",
    "library(themis)\r\n",
    "\r\n",
    "# Create a recipe for preprocessing data\r\n",
    "cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \r\n",
    "  step_smote(cuisine)\r\n",
    "\r\n",
    "cuisines_recipe"
   ],
   "outputs": [],
   "metadata": {
    "id": "HS41brUIJVJy"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "มาดูขั้นตอนการเตรียมข้อมูลของเรากัน\n",
    "\n",
    "-   การเรียกใช้ `recipe()` พร้อมสูตรจะบอกให้ recipe กำหนด *บทบาท* ของตัวแปรโดยใช้ข้อมูล `df_select` เป็นข้อมูลอ้างอิง ตัวอย่างเช่น คอลัมน์ `cuisine` ถูกกำหนดให้มีบทบาทเป็น `outcome` ในขณะที่คอลัมน์อื่นๆ ถูกกำหนดให้มีบทบาทเป็น `predictor`\n",
    "\n",
    "-   [`step_smote(cuisine)`](https://themis.tidymodels.org/reference/step_smote.html) สร้าง *สเปค* ของขั้นตอนใน recipe ที่สร้างตัวอย่างใหม่ของคลาสที่มีจำนวนน้อยโดยใช้เพื่อนบ้านที่ใกล้ที่สุดของกรณีเหล่านี้\n",
    "\n",
    "ตอนนี้ หากเราต้องการดูข้อมูลที่ผ่านการเตรียมแล้ว เราจะต้อง [**`prep()`**](https://recipes.tidymodels.org/reference/prep.html) และ [**`bake()`**](https://recipes.tidymodels.org/reference/bake.html) recipe ของเรา\n",
    "\n",
    "`prep()`: ประเมินพารามิเตอร์ที่จำเป็นจากชุดข้อมูลการฝึกที่สามารถนำไปใช้กับชุดข้อมูลอื่นในภายหลัง\n",
    "\n",
    "`bake()`: ใช้ recipe ที่ผ่านการเตรียมแล้วและดำเนินการกับชุดข้อมูลใดๆ\n"
   ],
   "metadata": {
    "id": "Yb-7t7XcJaC8"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Prep and bake the recipe\r\n",
    "preprocessed_df <- cuisines_recipe %>% \r\n",
    "  prep() %>% \r\n",
    "  bake(new_data = NULL) %>% \r\n",
    "  relocate(cuisine)\r\n",
    "\r\n",
    "# Display data\r\n",
    "preprocessed_df %>% \r\n",
    "  slice_head(n = 5)\r\n",
    "\r\n",
    "# Quick summary stats\r\n",
    "preprocessed_df %>% \r\n",
    "  introduce()"
   ],
   "outputs": [],
   "metadata": {
    "id": "9QhSgdpxJl44"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ตอนนี้เรามาตรวจสอบการกระจายของอาหารของเราและเปรียบเทียบกับข้อมูลที่ไม่สมดุลกัน\n"
   ],
   "metadata": {
    "id": "dmidELh_LdV7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "new_label_count <- preprocessed_df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "list(new_label_count = new_label_count,\r\n",
    "     old_label_count = old_label_count)"
   ],
   "outputs": [],
   "metadata": {
    "id": "aSh23klBLwDz"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "อร่อย! ข้อมูลสะอาด สมดุล และน่าทานมาก 😋!\n",
    "\n",
    "> โดยปกติแล้ว สูตร (recipe) มักถูกใช้เป็นตัวเตรียมข้อมูลก่อนการสร้างโมเดล ซึ่งจะกำหนดขั้นตอนที่ควรนำไปใช้กับชุดข้อมูลเพื่อเตรียมให้พร้อมสำหรับการสร้างโมเดล ในกรณีนี้ `workflow()` มักจะถูกใช้งาน (อย่างที่เราได้เห็นในบทเรียนก่อนหน้านี้) แทนที่จะประเมินสูตรด้วยตนเอง\n",
    ">\n",
    "> ดังนั้น โดยทั่วไปคุณไม่จำเป็นต้องใช้ **`prep()`** และ **`bake()`** กับสูตรเมื่อคุณใช้ tidymodels แต่ฟังก์ชันเหล่านี้มีประโยชน์ในกรณีที่คุณต้องการยืนยันว่าสูตรทำงานตามที่คุณคาดหวังไว้ เช่นในกรณีของเรา\n",
    ">\n",
    "> เมื่อคุณใช้ **`bake()`** กับสูตรที่ผ่านการ **`prep()`** แล้ว โดยกำหนด **`new_data = NULL`** คุณจะได้ข้อมูลที่คุณให้ไว้ตอนกำหนดสูตรกลับมา แต่ข้อมูลนั้นจะผ่านขั้นตอนการเตรียมข้อมูลแล้ว\n",
    "\n",
    "ตอนนี้เรามาบันทึกสำเนาของข้อมูลนี้ไว้เพื่อใช้ในบทเรียนถัดไป:\n"
   ],
   "metadata": {
    "id": "HEu80HZ8L7ae"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Save preprocessed data\r\n",
    "write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "cBmCbIgrMOI6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "ไฟล์ CSV ใหม่สามารถพบได้ในโฟลเดอร์ข้อมูลหลัก\n",
    "\n",
    "**🚀ความท้าทาย**\n",
    "\n",
    "หลักสูตรนี้มีชุดข้อมูลที่น่าสนใจหลายชุด ลองค้นหาในโฟลเดอร์ `data` และดูว่ามีชุดข้อมูลใดที่เหมาะสมสำหรับการจัดประเภทแบบไบนารีหรือหลายคลาสหรือไม่? คุณจะตั้งคำถามอะไรกับชุดข้อมูลนี้?\n",
    "\n",
    "## [**แบบทดสอบหลังการบรรยาย**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/20/)\n",
    "\n",
    "## **ทบทวนและศึกษาด้วยตนเอง**\n",
    "\n",
    "-   ลองดู [แพ็กเกจ themis](https://github.com/tidymodels/themis) มีเทคนิคอื่นใดที่เราสามารถใช้เพื่อจัดการกับข้อมูลที่ไม่สมดุลได้บ้าง?\n",
    "\n",
    "-   เว็บไซต์อ้างอิงของ Tidy models [เว็บไซต์อ้างอิง](https://www.tidymodels.org/start/)\n",
    "\n",
    "-   H. Wickham และ G. Grolemund, [*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/)\n",
    "\n",
    "#### ขอขอบคุณ:\n",
    "\n",
    "[`Allison Horst`](https://twitter.com/allison_horst/) สำหรับการสร้างภาพประกอบที่ยอดเยี่ยมซึ่งทำให้ R น่าสนใจและเข้าถึงได้มากขึ้น ค้นหาภาพประกอบเพิ่มเติมได้ที่ [แกลเลอรี](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM)\n",
    "\n",
    "[Cassie Breviu](https://www.twitter.com/cassieview) และ [Jen Looper](https://www.twitter.com/jenlooper) สำหรับการสร้างเวอร์ชัน Python ดั้งเดิมของโมดูลนี้ ♥️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/r_learners_sm.jpeg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>ภาพประกอบโดย @allison_horst</figcaption>\n"
   ],
   "metadata": {
    "id": "WQs5621pMGwf"
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**ข้อจำกัดความรับผิดชอบ**:  \nเอกสารนี้ได้รับการแปลโดยใช้บริการแปลภาษา AI [Co-op Translator](https://github.com/Azure/co-op-translator) แม้ว่าเราจะพยายามให้การแปลมีความถูกต้อง แต่โปรดทราบว่าการแปลอัตโนมัติอาจมีข้อผิดพลาดหรือความไม่แม่นยำ เอกสารต้นฉบับในภาษาต้นทางควรถือเป็นแหล่งข้อมูลที่เชื่อถือได้ สำหรับข้อมูลที่สำคัญ ขอแนะนำให้ใช้บริการแปลภาษามนุษย์มืออาชีพ เราจะไม่รับผิดชอบต่อความเข้าใจผิดหรือการตีความที่ผิดพลาดซึ่งเกิดจากการใช้การแปลนี้\n"
   ]
  }
 ]
}