{
 "nbformat": 4,
 "nbformat_minor": 2,
 "metadata": {
  "colab": {
   "name": "lesson_10-R.ipynb",
   "provenance": [],
   "collapsed_sections": []
  },
  "kernelspec": {
   "name": "ir",
   "display_name": "R"
  },
  "language_info": {
   "name": "R"
  },
  "coopTranslator": {
   "original_hash": "2621e24705e8100893c9bf84e0fc8aef",
   "translation_date": "2025-09-04T08:55:59+00:00",
   "source_file": "4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb",
   "language_code": "he"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# בנה מודל סיווג: מטבחים אסייתיים והודיים טעימים\n"
   ],
   "metadata": {
    "id": "ItETB4tSFprR"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## מבוא לסיווג: ניקוי, הכנה והדמיה של הנתונים שלך\n",
    "\n",
    "בארבעת השיעורים הללו, תחקור את אחד הנושאים המרכזיים בלמידת מכונה קלאסית - *סיווג*. נעבור יחד על שימוש באלגוריתמים שונים לסיווג עם מערך נתונים על כל המטבחים המדהימים של אסיה והודו. מקווים שאתה רעב!\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/pinch.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>חגיגה של מטבחים פאן-אסייתיים בשיעורים הללו! תמונה מאת ג'ן לופר</figcaption>\n",
    "\n",
    "\n",
    "סיווג הוא צורה של [למידה מונחית](https://wikipedia.org/wiki/Supervised_learning) שיש לה הרבה מן המשותף עם טכניקות רגרסיה. בסיווג, אתה מאמן מודל כדי לחזות לאיזו `קטגוריה` פריט שייך. אם למידת מכונה עוסקת בניבוי ערכים או שמות לדברים באמצעות מערכי נתונים, אז סיווג בדרך כלל מתחלק לשתי קבוצות: *סיווג בינארי* ו-*סיווג רב-קטגורי*.\n",
    "\n",
    "זכור:\n",
    "\n",
    "-   **רגרסיה ליניארית** עזרה לך לחזות קשרים בין משתנים ולבצע תחזיות מדויקות על היכן נקודת נתונים חדשה תיפול ביחס לקו. כך, למשל, יכולת לחזות ערכים מספריים כמו *מה יהיה מחיר הדלעת בספטמבר לעומת דצמבר*.\n",
    "\n",
    "-   **רגרסיה לוגיסטית** עזרה לך לגלות \"קטגוריות בינאריות\": בנקודת מחיר זו, *האם הדלעת כתומה או לא-כתומה*?\n",
    "\n",
    "סיווג משתמש באלגוריתמים שונים כדי לקבוע דרכים אחרות לזיהוי התווית או הקטגוריה של נקודת נתונים. בואו נעבוד עם נתוני המטבחים הללו כדי לראות האם, על ידי התבוננות בקבוצת מרכיבים, נוכל לקבוע את מקור המטבח.\n",
    "\n",
    "### [**שאלון לפני השיעור**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/19/)\n",
    "\n",
    "### **מבוא**\n",
    "\n",
    "סיווג הוא אחת הפעילויות המרכזיות של חוקרי למידת מכונה ומדעני נתונים. החל מסיווג בסיסי של ערך בינארי (\"האם האימייל הזה הוא ספאם או לא?\"), ועד סיווג תמונות מורכב ופילוח באמצעות ראייה ממוחשבת, תמיד מועיל להיות מסוגל למיין נתונים לקבוצות ולשאול שאלות עליהם.\n",
    "\n",
    "במונחים מדעיים יותר, שיטת הסיווג שלך יוצרת מודל חיזוי שמאפשר לך למפות את הקשר בין משתני הקלט למשתני הפלט.\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/binary-multiclass.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>בעיות בינאריות לעומת רב-קטגוריות עבור אלגוריתמי סיווג. אינפוגרפיקה מאת ג'ן לופר</figcaption>\n",
    "\n",
    "\n",
    "\n",
    "לפני שנתחיל בתהליך ניקוי הנתונים שלנו, הדמייתם והכנתם למשימות הלמידה שלנו, בואו נלמד מעט על הדרכים השונות שבהן ניתן להשתמש בלמידת מכונה כדי לסווג נתונים.\n",
    "\n",
    "נגזר מ[סטטיסטיקה](https://wikipedia.org/wiki/Statistical_classification), סיווג באמצעות למידת מכונה קלאסית משתמש בתכונות כמו `מעשן`, `משקל`, ו-`גיל` כדי לקבוע *סבירות לפתח מחלה X*. כטכניקת למידה מונחית הדומה לתרגילי הרגרסיה שביצעתם קודם לכן, הנתונים שלכם מתויגים והאלגוריתמים של הלמידה משתמשים בתוויות הללו כדי לסווג ולחזות קטגוריות (או 'תכונות') של מערך נתונים ולשייך אותם לקבוצה או תוצאה.\n",
    "\n",
    "✅ הקדש רגע לדמיין מערך נתונים על מטבחים. מה מודל רב-קטגורי יוכל לענות? מה מודל בינארי יוכל לענות? מה אם היית רוצה לקבוע האם מטבח מסוים נוטה להשתמש בחילבה? ומה אם היית רוצה לבדוק האם, בהינתן שקית מצרכים מלאה באניס כוכבים, ארטישוק, כרובית וחזרת, תוכל ליצור מנה הודית טיפוסית?\n",
    "\n",
    "### **שלום 'מסווג'**\n",
    "\n",
    "השאלה שאנחנו רוצים לשאול ממערך הנתונים של המטבחים היא למעשה שאלה **רב-קטגורית**, שכן יש לנו כמה מטבחים לאומיים פוטנציאליים לעבוד איתם. בהינתן קבוצת מרכיבים, לאיזו מהקטגוריות הרבות הנתונים יתאימו?\n",
    "\n",
    "Tidymodels מציעה מספר אלגוריתמים שונים לשימוש בסיווג נתונים, בהתאם לסוג הבעיה שברצונך לפתור. בשני השיעורים הבאים, תלמד על כמה מהאלגוריתמים הללו.\n",
    "\n",
    "#### **דרישות מקדימות**\n",
    "\n",
    "לשיעור זה, נדרוש את החבילות הבאות לניקוי, הכנה והדמיה של הנתונים שלנו:\n",
    "\n",
    "-   `tidyverse`: [tidyverse](https://www.tidyverse.org/) הוא [אוסף של חבילות R](https://www.tidyverse.org/packages) שנועד להפוך את מדע הנתונים למהיר, קל ומהנה יותר!\n",
    "\n",
    "-   `tidymodels`: [tidymodels](https://www.tidymodels.org/) הוא מסגרת [אוסף חבילות](https://www.tidymodels.org/packages/) למידול ולמידת מכונה.\n",
    "\n",
    "-   `DataExplorer`: [חבילת DataExplorer](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html) נועדה לפשט ולהפוך את תהליך ה-EDA ודיווח אוטומטי.\n",
    "\n",
    "-   `themis`: [חבילת themis](https://themis.tidymodels.org/) מספקת שלבים נוספים לטיפול בנתונים לא מאוזנים.\n",
    "\n",
    "ניתן להתקין אותם כך:\n",
    "\n",
    "`install.packages(c(\"tidyverse\", \"tidymodels\", \"DataExplorer\", \"here\"))`\n",
    "\n",
    "לחילופין, הסקריפט הבא בודק האם יש לך את החבילות הנדרשות להשלמת מודול זה ומתקין אותן עבורך במקרה שהן חסרות.\n"
   ],
   "metadata": {
    "id": "ri5bQxZ-Fz_0"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n",
    "\r\n",
    "pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)"
   ],
   "outputs": [],
   "metadata": {
    "id": "KIPxa4elGAPI"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "בהמשך נטען את החבילות המדהימות הללו ונעשה אותן זמינות בסשן R הנוכחי שלנו. (זה רק להמחשה, `pacman::p_load()` כבר עשה זאת עבורך)\n"
   ],
   "metadata": {
    "id": "YkKAxOJvGD4C"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## תרגיל - ניקוי ואיזון הנתונים שלך\n",
    "\n",
    "המשימה הראשונה, לפני שמתחילים את הפרויקט הזה, היא לנקות ול**אזן** את הנתונים שלך כדי לקבל תוצאות טובות יותר.\n",
    "\n",
    "בואו נכיר את הנתונים! 🕵️\n"
   ],
   "metadata": {
    "id": "PFkQDlk0GN5O"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Import data\r\n",
    "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n",
    "\r\n",
    "# View the first 5 rows\r\n",
    "df %>% \r\n",
    "  slice_head(n = 5)\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "Qccw7okxGT0S"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "מעניין! מהמראה של זה, העמודה הראשונה היא סוג של עמודת `id`. בואו נקבל קצת יותר מידע על הנתונים.\n"
   ],
   "metadata": {
    "id": "XrWnlgSrGVmR"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Basic information about the data\r\n",
    "df %>%\r\n",
    "  introduce()\r\n",
    "\r\n",
    "# Visualize basic information above\r\n",
    "df %>% \r\n",
    "  plot_intro(ggtheme = theme_light())"
   ],
   "outputs": [],
   "metadata": {
    "id": "4UcGmxRxGieA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "מהתוצאה, אנו יכולים לראות מיד שיש לנו `2448` שורות ו-`385` עמודות ו-`0` ערכים חסרים. בנוסף, יש לנו עמודה אחת דיסקרטית, *cuisine*.\n",
    "\n",
    "## תרגיל - ללמוד על סוגי המטבחים\n",
    "\n",
    "עכשיו העבודה מתחילה להיות יותר מעניינת. בואו נגלה את התפלגות הנתונים, לפי סוג המטבח.\n"
   ],
   "metadata": {
    "id": "AaPubl__GmH5"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Count observations per cuisine\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(n)\r\n",
    "\r\n",
    "# Plot the distribution\r\n",
    "theme_set(theme_light())\r\n",
    "df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n",
    "  geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
    "  ylab(\"cuisine\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "FRsBVy5eGrrv"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "יש מספר מוגבל של מטבחים, אך חלוקת הנתונים אינה אחידה. אתם יכולים לתקן את זה! לפני כן, חקרו קצת יותר.\n",
    "\n",
    "כעת, בואו נייחס כל מטבח לטיבּל משלו ונגלה כמה נתונים זמינים (שורות, עמודות) לכל מטבח.\n",
    "\n",
    "> [טיבּל](https://tibble.tidyverse.org/) הוא מסגרת נתונים מודרנית.\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/dplyr_filter.jpg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>יצירה מאת @allison_horst</figcaption>\n"
   ],
   "metadata": {
    "id": "vVvyDb1kG2in"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Create individual tibble for the cuisines\r\n",
    "thai_df <- df %>% \r\n",
    "  filter(cuisine == \"thai\")\r\n",
    "japanese_df <- df %>% \r\n",
    "  filter(cuisine == \"japanese\")\r\n",
    "chinese_df <- df %>% \r\n",
    "  filter(cuisine == \"chinese\")\r\n",
    "indian_df <- df %>% \r\n",
    "  filter(cuisine == \"indian\")\r\n",
    "korean_df <- df %>% \r\n",
    "  filter(cuisine == \"korean\")\r\n",
    "\r\n",
    "\r\n",
    "# Find out how much data is available per cuisine\r\n",
    "cat(\" thai df:\", dim(thai_df), \"\\n\",\r\n",
    "    \"japanese df:\", dim(japanese_df), \"\\n\",\r\n",
    "    \"chinese_df:\", dim(chinese_df), \"\\n\",\r\n",
    "    \"indian_df:\", dim(indian_df), \"\\n\",\r\n",
    "    \"korean_df:\", dim(korean_df))"
   ],
   "outputs": [],
   "metadata": {
    "id": "0TvXUxD3G8Bk"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## **תרגיל - גילוי מרכיבים מובילים לפי מטבח באמצעות dplyr**\n",
    "\n",
    "עכשיו תוכלו להעמיק בנתונים ולגלות מהם המרכיבים האופייניים לכל מטבח. כדאי לנקות נתונים חוזרים שיוצרים בלבול בין מטבחים, אז בואו נלמד על הבעיה הזו.\n",
    "\n",
    "צרו פונקציה בשם `create_ingredient()` ב-R שמחזירה מסגרת נתונים של מרכיבים. הפונקציה תתחיל בהסרת עמודה שאינה מועילה ותמיין את המרכיבים לפי הספירה שלהם.\n",
    "\n",
    "המבנה הבסיסי של פונקציה ב-R הוא:\n",
    "\n",
    "`myFunction <- function(arglist){`\n",
    "\n",
    "**`...`**\n",
    "\n",
    "**`return`**`(value)`\n",
    "\n",
    "`}`\n",
    "\n",
    "הקדמה מסודרת לפונקציות ב-R ניתן למצוא [כאן](https://skirmer.github.io/presentations/functions_with_r.html#1).\n",
    "\n",
    "בואו נצלול ישר לעניין! נשתמש בפעולות של [dplyr](https://dplyr.tidyverse.org/) שלמדנו בשיעורים הקודמים. כתזכורת:\n",
    "\n",
    "-   `dplyr::select()`: עוזרת לכם לבחור אילו **עמודות** לשמור או להסיר.\n",
    "\n",
    "-   `dplyr::pivot_longer()`: עוזרת \"להאריך\" נתונים, להגדיל את מספר השורות ולהקטין את מספר העמודות.\n",
    "\n",
    "-   `dplyr::group_by()` ו-`dplyr::summarise()`: עוזרות לכם למצוא סטטיסטיקות סיכום עבור קבוצות שונות ולהציג אותן בטבלה מסודרת.\n",
    "\n",
    "-   `dplyr::filter()`: יוצרת תת-קבוצה של הנתונים שמכילה רק שורות שעומדות בתנאים שלכם.\n",
    "\n",
    "-   `dplyr::mutate()`: עוזרת לכם ליצור או לשנות עמודות.\n",
    "\n",
    "עיינו במדריך [*האמנותי*](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome) של אליסון הורסט, שמציג כמה פונקציות שימושיות לעיבוד נתונים ב-dplyr *(חלק מה-Tidyverse)*.\n"
   ],
   "metadata": {
    "id": "K3RF5bSCHC76"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Creates a functions that returns the top ingredients by class\r\n",
    "\r\n",
    "create_ingredient <- function(df){\r\n",
    "  \r\n",
    "  # Drop the id column which is the first colum\r\n",
    "  ingredient_df = df %>% select(-1) %>% \r\n",
    "  # Transpose data to a long format\r\n",
    "    pivot_longer(!cuisine, names_to = \"ingredients\", values_to = \"count\") %>% \r\n",
    "  # Find the top most ingredients for a particular cuisine\r\n",
    "    group_by(ingredients) %>% \r\n",
    "    summarise(n_instances = sum(count)) %>% \r\n",
    "    filter(n_instances != 0) %>% \r\n",
    "  # Arrange by descending order\r\n",
    "    arrange(desc(n_instances)) %>% \r\n",
    "    mutate(ingredients = factor(ingredients) %>% fct_inorder())\r\n",
    "  \r\n",
    "  \r\n",
    "  return(ingredient_df)\r\n",
    "} # End of function"
   ],
   "outputs": [],
   "metadata": {
    "id": "uB_0JR82HTPa"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "עכשיו נוכל להשתמש בפונקציה כדי לקבל מושג על עשרת המרכיבים הפופולריים ביותר לפי מטבח. בואו ננסה את זה עם `thai_df`.\n"
   ],
   "metadata": {
    "id": "h9794WF8HWmc"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Call create_ingredient and display popular ingredients\r\n",
    "thai_ingredient_df <- create_ingredient(df = thai_df)\r\n",
    "\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10)"
   ],
   "outputs": [],
   "metadata": {
    "id": "agQ-1HrcHaEA"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "בקטע הקודם השתמשנו ב-`geom_col()`, בואו נראה כיצד ניתן להשתמש גם ב-`geom_bar` ליצירת תרשימי עמודות. השתמשו ב-`?geom_bar` לקריאה נוספת.\n"
   ],
   "metadata": {
    "id": "kHu9ffGjHdcX"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Make a bar chart for popular thai cuisines\r\n",
    "thai_ingredient_df %>% \r\n",
    "  slice_head(n = 10) %>% \r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"steelblue\") +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "fb3Bx_3DHj6e"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "בואו נעשה את אותו הדבר עבור הנתונים היפניים\n"
   ],
   "metadata": {
    "id": "RHP_xgdkHnvM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Japanese cuisines and make bar chart\r\n",
    "create_ingredient(df = japanese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"darkorange\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")\r\n"
   ],
   "outputs": [],
   "metadata": {
    "id": "019v8F0XHrRU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "מה לגבי המטבחים הסיניים?\n"
   ],
   "metadata": {
    "id": "iIGM7vO8Hu3v"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Chinese cuisines and make bar chart\r\n",
    "create_ingredient(df = chinese_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"cyan4\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lHd9_gd2HyzU"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "בואו נסתכל על המטבחים ההודיים 🌶️.\n"
   ],
   "metadata": {
    "id": "ir8qyQbNH1c7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Indian cuisines and make bar chart\r\n",
    "create_ingredient(df = indian_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#041E42FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "ApukQtKjH5FO"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "לבסוף, שרטט את המרכיבים הקוריאניים.\n"
   ],
   "metadata": {
    "id": "qv30cwY1H-FM"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get popular ingredients for Korean cuisines and make bar chart\r\n",
    "create_ingredient(df = korean_df) %>% \r\n",
    "  slice_head(n = 10) %>%\r\n",
    "  ggplot(aes(x = n_instances, y = ingredients)) +\r\n",
    "  geom_bar(stat = \"identity\", width = 0.5, fill = \"#852419FF\", alpha = 0.8) +\r\n",
    "  xlab(\"\") + ylab(\"\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "lumgk9cHIBie"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "מתוך הוויזואליזציות של הנתונים, כעת נוכל להסיר את המרכיבים הנפוצים ביותר שיוצרים בלבול בין מטבחים שונים, באמצעות `dplyr::select()`.\n",
    "\n",
    "כולם אוהבים אורז, שום וג'ינג'ר!\n"
   ],
   "metadata": {
    "id": "iO4veMXuIEta"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Drop id column, rice, garlic and ginger from our original data set\r\n",
    "df_select <- df %>% \r\n",
    "  select(-c(1, rice, garlic, ginger))\r\n",
    "\r\n",
    "# Display new data set\r\n",
    "df_select %>% \r\n",
    "  slice_head(n = 5)"
   ],
   "outputs": [],
   "metadata": {
    "id": "iHJPiG6rIUcK"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## עיבוד נתונים באמצעות מתכונים 👩‍🍳👨‍🍳 - התמודדות עם נתונים לא מאוזנים ⚖️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/recipes.png\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>יצירה מאת @allison_horst</figcaption>\n",
    "\n",
    "מכיוון שהשיעור הזה עוסק במטבחים, עלינו לשים את `recipes` בהקשר.\n",
    "\n",
    "Tidymodels מספקת עוד חבילה נהדרת: `recipes` - חבילה לעיבוד מקדים של נתונים.\n"
   ],
   "metadata": {
    "id": "kkFd-JxdIaL6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "בואו נבחן שוב את התפלגות המטבחים שלנו.\n"
   ],
   "metadata": {
    "id": "6l2ubtTPJAhY"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "old_label_count <- df_select %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "old_label_count"
   ],
   "outputs": [],
   "metadata": {
    "id": "1e-E9cb7JDVi"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "כפי שניתן לראות, ישנה חלוקה לא שוויונית במספר סוגי המטבחים. מטבח קוריאני מופיע כמעט פי שלושה ממטבח תאילנדי. נתונים לא מאוזנים משפיעים לרוב באופן שלילי על ביצועי המודל. חשבו על סיווג בינארי: אם רוב הנתונים שלכם שייכים למחלקה אחת, מודל למידת מכונה ייטה לנבא את המחלקה הזו בתדירות גבוהה יותר, פשוט כי יש יותר נתונים עבורה. איזון הנתונים לוקח נתונים מוטים ועוזר להסיר את חוסר האיזון הזה. רבים מהמודלים מתפקדים בצורה הטובה ביותר כאשר מספר התצפיות שווה, ולכן מתקשים עם נתונים לא מאוזנים.\n",
    "\n",
    "ישנן שתי דרכים עיקריות להתמודד עם מערכי נתונים לא מאוזנים:\n",
    "\n",
    "-   הוספת תצפיות למחלקה המיעוט: `Over-sampling`, לדוגמה שימוש באלגוריתם SMOTE\n",
    "\n",
    "-   הסרת תצפיות ממחלקת הרוב: `Under-sampling`\n",
    "\n",
    "כעת נדגים כיצד להתמודד עם מערכי נתונים לא מאוזנים באמצעות `recipe`. ניתן לחשוב על recipe כעל תכנית פעולה שמתארת אילו שלבים יש ליישם על מערך נתונים כדי להכין אותו לניתוח נתונים.\n"
   ],
   "metadata": {
    "id": "soAw6826JKx9"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Load themis package for dealing with imbalanced data\r\n",
    "library(themis)\r\n",
    "\r\n",
    "# Create a recipe for preprocessing data\r\n",
    "cuisines_recipe <- recipe(cuisine ~ ., data = df_select) %>% \r\n",
    "  step_smote(cuisine)\r\n",
    "\r\n",
    "cuisines_recipe"
   ],
   "outputs": [],
   "metadata": {
    "id": "HS41brUIJVJy"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "בואו נפרק את שלבי העיבוד המוקדם שלנו.\n",
    "\n",
    "-   הקריאה ל-`recipe()` עם נוסחה מגדירה ל-recipe את *התפקידים* של המשתנים תוך שימוש בנתוני `df_select` כנקודת ייחוס. לדוגמה, לעמודה `cuisine` הוקצה תפקיד של `outcome`, בעוד שלשאר העמודות הוקצה תפקיד של `predictor`.\n",
    "\n",
    "-   [`step_smote(cuisine)`](https://themis.tidymodels.org/reference/step_smote.html) יוצרת *מפרט* של שלב ב-recipe שמייצר באופן סינתטי דוגמאות חדשות של הקטגוריה המיעוטית תוך שימוש בשכנים הקרובים של המקרים הללו.\n",
    "\n",
    "עכשיו, אם נרצה לראות את הנתונים שעברו עיבוד מוקדם, נצטרך [**`prep()`**](https://recipes.tidymodels.org/reference/prep.html) ו-[**`bake()`**](https://recipes.tidymodels.org/reference/bake.html) עבור ה-recipe שלנו.\n",
    "\n",
    "`prep()`: מעריך את הפרמטרים הנדרשים מתוך קבוצת אימון, שניתן ליישם מאוחר יותר על קבוצות נתונים אחרות.\n",
    "\n",
    "`bake()`: לוקח recipe שעבר הכנה ומבצע את הפעולות על כל קבוצת נתונים.\n"
   ],
   "metadata": {
    "id": "Yb-7t7XcJaC8"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Prep and bake the recipe\r\n",
    "preprocessed_df <- cuisines_recipe %>% \r\n",
    "  prep() %>% \r\n",
    "  bake(new_data = NULL) %>% \r\n",
    "  relocate(cuisine)\r\n",
    "\r\n",
    "# Display data\r\n",
    "preprocessed_df %>% \r\n",
    "  slice_head(n = 5)\r\n",
    "\r\n",
    "# Quick summary stats\r\n",
    "preprocessed_df %>% \r\n",
    "  introduce()"
   ],
   "outputs": [],
   "metadata": {
    "id": "9QhSgdpxJl44"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "בואו נבדוק כעת את התפלגות המטבחים ונשווה אותם לנתונים הלא מאוזנים.\n"
   ],
   "metadata": {
    "id": "dmidELh_LdV7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Distribution of cuisines\r\n",
    "new_label_count <- preprocessed_df %>% \r\n",
    "  count(cuisine) %>% \r\n",
    "  arrange(desc(n))\r\n",
    "\r\n",
    "list(new_label_count = new_label_count,\r\n",
    "     old_label_count = old_label_count)"
   ],
   "outputs": [],
   "metadata": {
    "id": "aSh23klBLwDz"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "יאמי! הנתונים נקיים, מאוזנים, וממש טעימים 😋!\n",
    "\n",
    "> בדרך כלל, מתכון משמש כמעבד מקדים למידול, שבו הוא מגדיר אילו שלבים יש ליישם על מערך נתונים כדי להכין אותו למידול. במקרה כזה, בדרך כלל משתמשים ב-`workflow()` (כפי שכבר ראינו בשיעורים הקודמים) במקום להעריך מתכון באופן ידני.\n",
    ">\n",
    "> לכן, בדרך כלל אין צורך להשתמש ב-**`prep()`** ו-**`bake()`** כשמשתמשים ב-tidymodels, אבל אלו פונקציות מועילות שיהיה בארגז הכלים שלכם כדי לוודא שהמתכונים עושים את מה שציפיתם, כמו במקרה שלנו.\n",
    ">\n",
    "> כשאתם משתמשים ב-**`bake()`** על מתכון שעבר הכנה עם **`new_data = NULL`**, אתם מקבלים בחזרה את הנתונים שסיפקתם בעת הגדרת המתכון, אבל לאחר שעברו את שלבי העיבוד המקדים.\n",
    "\n",
    "בואו נשמור עכשיו עותק של הנתונים האלה לשימוש בשיעורים עתידיים:\n"
   ],
   "metadata": {
    "id": "HEu80HZ8L7ae"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Save preprocessed data\r\n",
    "write_csv(preprocessed_df, \"../../../data/cleaned_cuisines_R.csv\")"
   ],
   "outputs": [],
   "metadata": {
    "id": "cBmCbIgrMOI6"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "קובץ ה-CSV החדש הזה נמצא כעת בתיקיית הנתונים הראשית.\n",
    "\n",
    "**🚀אתגר**\n",
    "\n",
    "תוכנית הלימודים הזו מכילה כמה מערכי נתונים מעניינים. חפשו בתיקיות `data` ובדקו אם יש מערכי נתונים שמתאימים לסיווג בינארי או רב-קטגורי. אילו שאלות הייתם שואלים על מערך הנתונים הזה?\n",
    "\n",
    "## [**שאלון לאחר ההרצאה**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/20/)\n",
    "\n",
    "## **סקירה ולימוד עצמי**\n",
    "\n",
    "-   בדקו את [חבילת themis](https://github.com/tidymodels/themis). אילו טכניקות נוספות אפשר להשתמש בהן כדי להתמודד עם נתונים לא מאוזנים?\n",
    "\n",
    "-   אתר [העזר למודלים מסודרים](https://www.tidymodels.org/start/).\n",
    "\n",
    "-   ה. וויקהאם וג. גרולמונד, [*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/).\n",
    "\n",
    "#### תודה ל:\n",
    "\n",
    "[`אליסון הורסט`](https://twitter.com/allison_horst/) על יצירת האיורים המדהימים שהופכים את R למזמינה ומרתקת יותר. מצאו עוד איורים בגלריה שלה [כאן](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\n",
    "\n",
    "[קסי ברוויו](https://www.twitter.com/cassieview) ו[ג'ן לופר](https://www.twitter.com/jenlooper) על יצירת הגרסה המקורית של המודול הזה בפייתון ♥️\n",
    "\n",
    "<p >\n",
    "   <img src=\"../../images/r_learners_sm.jpeg\"\n",
    "   width=\"600\"/>\n",
    "   <figcaption>יצירת אמנות מאת @allison_horst</figcaption>\n"
   ],
   "metadata": {
    "id": "WQs5621pMGwf"
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n---\n\n**כתב ויתור**:  \nמסמך זה תורגם באמצעות שירות תרגום מבוסס בינה מלאכותית [Co-op Translator](https://github.com/Azure/co-op-translator). למרות שאנו שואפים לדיוק, יש לקחת בחשבון שתרגומים אוטומטיים עשויים להכיל שגיאות או אי-דיוקים. המסמך המקורי בשפתו המקורית נחשב למקור הסמכותי. למידע קריטי, מומלץ להשתמש בתרגום מקצועי על ידי מתרגם אנושי. איננו נושאים באחריות לכל אי-הבנה או פרשנות שגויה הנובעת משימוש בתרגום זה.\n"
   ]
  }
 ]
}