diff --git a/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb b/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb index 251bbe082..4592429f9 100644 --- a/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb +++ b/4-Classification/1-Introduction/solution/R/lesson_10-R.ipynb @@ -103,8 +103,8 @@ "cell_type": "code", "execution_count": null, "source": [ - "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n", - "\n", + "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\r\n", + "\r\n", "pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)" ], "outputs": [], @@ -138,12 +138,12 @@ "cell_type": "code", "execution_count": null, "source": [ - "# Import data\n", - "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\n", - "\n", - "# View the first 5 rows\n", - "df %>% \n", - " slice_head(n = 5)\n" + "# Import data\r\n", + "df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\r\n", + "\r\n", + "# View the first 5 rows\r\n", + "df %>% \r\n", + " slice_head(n = 5)\r\n" ], "outputs": [], "metadata": { @@ -163,12 +163,12 @@ "cell_type": "code", "execution_count": null, "source": [ - "# Basic information about the data\n", - "df %>%\n", - " introduce()\n", - "\n", - "# Visualize basic information above\n", - "df %>% \n", + "# Basic information about the data\r\n", + "df %>%\r\n", + " introduce()\r\n", + "\r\n", + "# Visualize basic information above\r\n", + "df %>% \r\n", " plot_intro(ggtheme = theme_light())" ], "outputs": [], @@ -193,17 +193,17 @@ "cell_type": "code", "execution_count": null, "source": [ - "# Count observations per cuisine\n", - "df %>% \n", - " count(cuisine) %>% \n", - " arrange(n)\n", - "\n", - "# Plot the distribution\n", - "theme_set(theme_light())\n", - "df %>% \n", - " count(cuisine) %>% \n", - " ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\n", - " geom_col(fill = \"midnightblue\", alpha = 0.7) +\n", + "# Count observations per cuisine\r\n", + "df %>% \r\n", + " count(cuisine) %>% \r\n", + " arrange(n)\r\n", + "\r\n", + "# Plot the distribution\r\n", + "theme_set(theme_light())\r\n", + "df %>% \r\n", + " count(cuisine) %>% \r\n", + " ggplot(mapping = aes(x = n, y = reorder(cuisine, -n))) +\r\n", + " geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n", " ylab(\"cuisine\")" ], "outputs": [], @@ -214,15 +214,17 @@ { "cell_type": "markdown", "source": [ - "There are a finite number of cuisines, but the distribution of data is uneven. You can fix that! Before doing so, explore a little more.\n", - "\n", - "Next, let's assign each cuisine into its individual table and find out how much data is available (rows, columns) per cuisine.\n", - "\n", - "
\n",
- "
\n",
- "
\r\n",
+ "
\r\n",
+ "
\n",
- "
\n",
- "
\r\n",
+ "
\r\n",
+ "
\n",
- "
\n",
- "
\r\n",
+ " \r\n",
+ " \r\n",
+ "
\r\n",
+ " \n",
+ "
\n"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " cuisine n \n",
+ "1 korean 799\n",
+ "2 indian 598\n",
+ "3 chinese 442\n",
+ "4 japanese 320\n",
+ "5 thai 289"
+ ],
+ "text/markdown": [
+ "\n",
+ "A tibble: 5 Γ 2\n",
+ "\n",
+ "| cuisine <fct> | n <int> |\n",
+ "|---|---|\n",
+ "| korean | 799 |\n",
+ "| indian | 598 |\n",
+ "| chinese | 442 |\n",
+ "| japanese | 320 |\n",
+ "| thai | 289 |\n",
+ "\n"
+ ],
+ "text/latex": [
+ "A tibble: 5 Γ 2\n",
+ "\\begin{tabular}{ll}\n",
+ " cuisine & n\\\\\n",
+ " \n",
+ "\tcuisine almond angelica anise anise_seed apple apple_brandy apricot armagnac artemisia β― whiskey white_bread white_wine whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n",
+ "\n",
+ "\n",
+ "\t<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> β― <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> \n",
+ "\tindian 0 0 0 0 0 0 0 0 0 β― 0 0 0 0 0 0 0 0 0 0 \n",
+ "\tindian 1 0 0 0 0 0 0 0 0 β― 0 0 0 0 0 0 0 0 0 0 \n",
+ "\tindian 0 0 0 0 0 0 0 0 0 β― 0 0 0 0 0 0 0 0 0 0 \n",
+ "\tindian 0 0 0 0 0 0 0 0 0 β― 0 0 0 0 0 0 0 0 0 0 \n",
+ "\n",
+ "indian 0 0 0 0 0 0 0 0 0 β― 0 0 0 0 0 0 0 0 1 0 \n",
+ "
\n"
+ ]
+ },
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 735
+ },
+ "id": "jhCrrH22IWVR",
+ "outputId": "d444a85c-1d8b-485f-bc4f-8be2e8f8217c"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Perfect! Now, time to split the data such that 70% of the data goes to training and 30% goes to testing. We'll also apply a `stratification` technique when splitting the data to `maintain the proportion of each cuisine` in the training and validation datasets.\n",
+ "\n",
+ "[rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:"
+ ],
+ "metadata": {
+ "id": "AYTjVyajIdny"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "source": [
+ "# Load the core Tidymodels packages into R session\r\n",
+ "library(tidymodels)\r\n",
+ "\r\n",
+ "# Create split specification\r\n",
+ "set.seed(2056)\r\n",
+ "cuisines_split <- initial_split(data = df_select,\r\n",
+ " strata = cuisine,\r\n",
+ " prop = 0.7)\r\n",
+ "\r\n",
+ "# Extract the data in each split\r\n",
+ "cuisines_train <- training(cuisines_split)\r\n",
+ "cuisines_test <- testing(cuisines_split)\r\n",
+ "\r\n",
+ "# Print the number of cases in each split\r\n",
+ "cat(\"Training cases: \", nrow(cuisines_train), \"\\n\",\r\n",
+ " \"Test cases: \", nrow(cuisines_test), sep = \"\")\r\n",
+ "\r\n",
+ "# Display the first few rows of the training set\r\n",
+ "cuisines_train %>% \r\n",
+ " slice_head(n = 5)\r\n",
+ "\r\n",
+ "\r\n",
+ "# Display distribution of cuisines in the training set\r\n",
+ "cuisines_train %>% \r\n",
+ " count(cuisine) %>% \r\n",
+ " arrange(desc(n))"
+ ],
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Training cases: 1712\n",
+ "Test cases: 736"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " cuisine almond angelica anise anise_seed apple apple_brandy apricot armagnac\n",
+ "1 chinese 0 0 0 0 0 0 0 0 \n",
+ "2 chinese 0 0 0 0 0 0 0 0 \n",
+ "3 chinese 0 0 0 0 0 0 0 0 \n",
+ "4 chinese 0 0 0 0 0 0 0 0 \n",
+ "5 chinese 0 0 0 0 0 0 0 0 \n",
+ " artemisia β― whiskey white_bread white_wine whole_grain_wheat_flour wine wood\n",
+ "1 0 β― 0 0 0 0 1 0 \n",
+ "2 0 β― 0 0 0 0 1 0 \n",
+ "3 0 β― 0 0 0 0 0 0 \n",
+ "4 0 β― 0 0 0 0 0 0 \n",
+ "5 0 β― 0 0 0 0 0 0 \n",
+ " yam yeast yogurt zucchini\n",
+ "1 0 0 0 0 \n",
+ "2 0 0 0 0 \n",
+ "3 0 0 0 0 \n",
+ "4 0 0 0 0 \n",
+ "5 0 0 0 0 "
+ ],
+ "text/markdown": [
+ "\n",
+ "A tibble: 5 Γ 381\n",
+ "\n",
+ "| cuisine <fct> | almond <dbl> | angelica <dbl> | anise <dbl> | anise_seed <dbl> | apple <dbl> | apple_brandy <dbl> | apricot <dbl> | armagnac <dbl> | artemisia <dbl> | β― β― | whiskey <dbl> | white_bread <dbl> | white_wine <dbl> | whole_grain_wheat_flour <dbl> | wine <dbl> | wood <dbl> | yam <dbl> | yeast <dbl> | yogurt <dbl> | zucchini <dbl> |\n",
+ "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n",
+ "| chinese | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | β― | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |\n",
+ "| chinese | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | β― | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |\n",
+ "| chinese | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | β― | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n",
+ "| chinese | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | β― | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n",
+ "| chinese | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | β― | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n",
+ "\n"
+ ],
+ "text/latex": [
+ "A tibble: 5 Γ 381\n",
+ "\\begin{tabular}{lllllllllllllllllllll}\n",
+ " cuisine & almond & angelica & anise & anise\\_seed & apple & apple\\_brandy & apricot & armagnac & artemisia & β― & whiskey & white\\_bread & white\\_wine & whole\\_grain\\_wheat\\_flour & wine & wood & yam & yeast & yogurt & zucchini\\\\\n",
+ " \n",
+ "\tcuisine n \n",
+ "\n",
+ "\n",
+ "\t<fct> <int> \n",
+ "\tkorean 799 \n",
+ "\tindian 598 \n",
+ "\tchinese 442 \n",
+ "\tjapanese 320 \n",
+ "\n",
+ "thai 289 \n",
+ "
\n"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " cuisine n \n",
+ "1 korean 559\n",
+ "2 indian 418\n",
+ "3 chinese 309\n",
+ "4 japanese 224\n",
+ "5 thai 202"
+ ],
+ "text/markdown": [
+ "\n",
+ "A tibble: 5 Γ 2\n",
+ "\n",
+ "| cuisine <fct> | n <int> |\n",
+ "|---|---|\n",
+ "| korean | 559 |\n",
+ "| indian | 418 |\n",
+ "| chinese | 309 |\n",
+ "| japanese | 224 |\n",
+ "| thai | 202 |\n",
+ "\n"
+ ],
+ "text/latex": [
+ "A tibble: 5 Γ 2\n",
+ "\\begin{tabular}{ll}\n",
+ " cuisine & n\\\\\n",
+ " \n",
+ "\tcuisine almond angelica anise anise_seed apple apple_brandy apricot armagnac artemisia β― whiskey white_bread white_wine whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n",
+ "\n",
+ "\n",
+ "\t<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> β― <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> \n",
+ "\tchinese 0 0 0 0 0 0 0 0 0 β― 0 0 0 0 1 0 0 0 0 0 \n",
+ "\tchinese 0 0 0 0 0 0 0 0 0 β― 0 0 0 0 1 0 0 0 0 0 \n",
+ "\tchinese 0 0 0 0 0 0 0 0 0 β― 0 0 0 0 0 0 0 0 0 0 \n",
+ "\tchinese 0 0 0 0 0 0 0 0 0 β― 0 0 0 0 0 0 0 0 0 0 \n",
+ "\n",
+ "chinese 0 0 0 0 0 0 0 0 0 β― 0 0 0 0 0 0 0 0 0 0 \n",
+ "
\n"
+ ]
+ },
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 535
+ },
+ "id": "w5FWIkEiIjdN",
+ "outputId": "2e195fd9-1a8f-4b91-9573-cce5582242df"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## 2. Deal with imbalanced data\n",
+ "\n",
+ "As you might have noticed in the original data set as well as in our training set, there is quite an unequal distribution in the number of cuisines. Korean cuisines are *almost* 3 times Thai cuisines. Imbalanced data often has negative effects on the model performance. Many models perform best when the number of observations is equal and, thus, tend to struggle with unbalanced data.\n",
+ "\n",
+ "There are majorly two ways of dealing with imbalanced data sets:\n",
+ "\n",
+ "- adding observations to the minority class: `Over-sampling` e.g using a SMOTE algorithm which synthetically generates new examples of the minority class using nearest neighbors of these cases.\n",
+ "\n",
+ "- removing observations from majority class: `Under-sampling`\n",
+ "\n",
+ "In our previous lesson, we demonstrated how to deal with imbalanced data sets using a `recipe`. A recipe can be thought of as a blueprint that describes what steps should be applied to a data set in order to get it ready for data analysis. In our case, we want to have an equal distribution in the number of our cuisines for our `training set`. Let's get right into it."
+ ],
+ "metadata": {
+ "id": "daBi9qJNIwqW"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "source": [
+ "# Load themis package for dealing with imbalanced data\r\n",
+ "library(themis)\r\n",
+ "\r\n",
+ "# Create a recipe for preprocessing training data\r\n",
+ "cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>% \r\n",
+ " step_smote(cuisine)\r\n",
+ "\r\n",
+ "# Print recipe\r\n",
+ "cuisines_recipe"
+ ],
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "Data Recipe\n",
+ "\n",
+ "Inputs:\n",
+ "\n",
+ " role #variables\n",
+ " outcome 1\n",
+ " predictor 380\n",
+ "\n",
+ "Operations:\n",
+ "\n",
+ "SMOTE based on cuisine"
+ ]
+ },
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 200
+ },
+ "id": "Az6LFBGxI1X0",
+ "outputId": "29d71d85-64b0-4e62-871e-bcd5398573b6"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "You can of course go ahead and confirm (using prep+bake) that the recipe will work as you expect it - all the cuisine labels having `559` observations.\r\n",
+ "\r\n",
+ "Since we'll be using this recipe as a preprocessor for modeling, a `workflow()` will do all the prep and bake for us, so we won't have to manually estimate the recipe.\r\n",
+ "\r\n",
+ "Now we are ready to train a model π©βπ»π¨βπ»!\r\n",
+ "\r\n",
+ "## 3. Choosing your classifier\r\n",
+ "\r\n",
+ " \n",
+ "\tcuisine n \n",
+ "\n",
+ "\n",
+ "\t<fct> <int> \n",
+ "\tkorean 559 \n",
+ "\tindian 418 \n",
+ "\tchinese 309 \n",
+ "\tjapanese 224 \n",
+ "\n",
+ "thai 202
\r\n",
+ "
\r\n",
+ " \n",
+ "
\n"
+ ]
+ },
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 248
+ },
+ "id": "CqtckvtsKqax",
+ "outputId": "e57fe557-6a68-4217-fe82-173328c5436d"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Great job! In Tidymodels, evaluating model performance can be done using [yardstick](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics. As we did in our logistic regression lesson, let's begin by computing a confusion matrix."
+ ],
+ "metadata": {
+ "id": "8w5N6XsBKss7"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "source": [
+ "# Confusion matrix for categorical data\n",
+ "conf_mat(data = results, truth = cuisine, estimate = .pred_class)\n"
+ ],
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " Truth\n",
+ "Prediction chinese indian japanese korean thai\n",
+ " chinese 83 1 8 15 10\n",
+ " indian 4 163 1 2 6\n",
+ " japanese 21 5 73 25 1\n",
+ " korean 15 0 11 191 0\n",
+ " thai 10 11 3 7 70"
+ ]
+ },
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 133
+ },
+ "id": "YvODvsLkK0iG",
+ "outputId": "bb69da84-1266-47ad-b174-d43b88ca2988"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "When dealing with multiple classes, it's generally more intuitive to visualize this as a heat map, like this:"
+ ],
+ "metadata": {
+ "id": "c0HfPL16Lr6U"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "source": [
+ "update_geom_defaults(geom = \"tile\", new = list(color = \"black\", alpha = 0.7))\n",
+ "# Visualize confusion matrix\n",
+ "results %>% \n",
+ " conf_mat(cuisine, .pred_class) %>% \n",
+ " autoplot(type = \"heatmap\")"
+ ],
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "plot without title"
+ ],
+ "image/png": ""
+ },
+ "metadata": {
+ "image/png": {
+ "width": 420,
+ "height": 420
+ }
+ }
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 436
+ },
+ "id": "HsAtwukyLsvt",
+ "outputId": "3032a224-a2c8-4270-b4f2-7bb620317400"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The darker squares in the confusion matrix plot indicate high numbers of cases, and you can hopefully see a diagonal line of darker squares indicating cases where the predicted and actual label are the same.\n",
+ "\n",
+ "Let's now calculate summary statistics for the confusion matrix."
+ ],
+ "metadata": {
+ "id": "oOJC87dkLwPr"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "source": [
+ "# Summary stats for confusion matrix\n",
+ "conf_mat(data = results, truth = cuisine, estimate = .pred_class) %>% \n",
+ "summary()"
+ ],
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " .metric .estimator .estimate\n",
+ "1 accuracy multiclass 0.7880435\n",
+ "2 kap multiclass 0.7276583\n",
+ "3 sens macro 0.7780927\n",
+ "4 spec macro 0.9477598\n",
+ "5 ppv macro 0.7585583\n",
+ "6 npv macro 0.9460080\n",
+ "7 mcc multiclass 0.7292724\n",
+ "8 j_index macro 0.7258524\n",
+ "9 bal_accuracy macro 0.8629262\n",
+ "10 detection_prevalence macro 0.2000000\n",
+ "11 precision macro 0.7585583\n",
+ "12 recall macro 0.7780927\n",
+ "13 f_meas macro 0.7641862"
+ ],
+ "text/markdown": [
+ "\n",
+ "A tibble: 13 Γ 3\n",
+ "\n",
+ "| .metric <chr> | .estimator <chr> | .estimate <dbl> |\n",
+ "|---|---|---|\n",
+ "| accuracy | multiclass | 0.7880435 |\n",
+ "| kap | multiclass | 0.7276583 |\n",
+ "| sens | macro | 0.7780927 |\n",
+ "| spec | macro | 0.9477598 |\n",
+ "| ppv | macro | 0.7585583 |\n",
+ "| npv | macro | 0.9460080 |\n",
+ "| mcc | multiclass | 0.7292724 |\n",
+ "| j_index | macro | 0.7258524 |\n",
+ "| bal_accuracy | macro | 0.8629262 |\n",
+ "| detection_prevalence | macro | 0.2000000 |\n",
+ "| precision | macro | 0.7585583 |\n",
+ "| recall | macro | 0.7780927 |\n",
+ "| f_meas | macro | 0.7641862 |\n",
+ "\n"
+ ],
+ "text/latex": [
+ "A tibble: 13 Γ 3\n",
+ "\\begin{tabular}{lll}\n",
+ " .metric & .estimator & .estimate\\\\\n",
+ " \n",
+ "\tcuisine .pred_class \n",
+ "\n",
+ "\n",
+ "\t<fct> <fct> \n",
+ "\tindian thai \n",
+ "\tindian indian \n",
+ "\tindian indian \n",
+ "\tindian indian \n",
+ "\n",
+ "indian indian \n",
+ "
\n"
+ ]
+ },
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 494
+ },
+ "id": "OYqetUyzL5Wz",
+ "outputId": "6a84d65e-113d-4281-dfc1-16e8b70f37e6"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "If we narrow down to some metrics such as accuracy, sensitivity, ppv, we are not badly off for a start π₯³!\n",
+ "\n",
+ "## 4. Digging Deeper\n",
+ "\n",
+ "Let's ask one subtle question: What criteria is used to settle for a given type of cuisine as the predicted outcome?\n",
+ "\n",
+ "Well, Statistical machine learning algorithms, like logistic regression, are based on `probability`; so what actually gets predicted by a classifier is a probability distribution over a set of possible outcomes. The class with the highest probability is then chosen as the most likely outcome for the given observations.\n",
+ "\n",
+ "Let's see this in action by making both hard class predictions and probabilities."
+ ],
+ "metadata": {
+ "id": "43t7vz8vMJtW"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "source": [
+ "# Make hard class prediction and probabilities\n",
+ "results_prob <- cuisines_test %>%\n",
+ " select(cuisine) %>% \n",
+ " bind_cols(mr_fit %>% predict(new_data = cuisines_test)) %>% \n",
+ " bind_cols(mr_fit %>% predict(new_data = cuisines_test, type = \"prob\"))\n",
+ "\n",
+ "# Print out results\n",
+ "results_prob %>% \n",
+ " slice_head(n = 5)"
+ ],
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " cuisine .pred_class .pred_chinese .pred_indian .pred_japanese .pred_korean\n",
+ "1 indian thai 1.551259e-03 0.4587877 5.988039e-04 2.428503e-04\n",
+ "2 indian indian 2.637133e-05 0.9999488 6.648651e-07 2.259993e-05\n",
+ "3 indian indian 1.049433e-03 0.9909982 1.060937e-03 1.644947e-05\n",
+ "4 indian indian 6.237482e-02 0.4763035 9.136702e-02 3.660913e-01\n",
+ "5 indian indian 1.431745e-02 0.9418551 2.945239e-02 8.721782e-03\n",
+ " .pred_thai \n",
+ "1 5.388194e-01\n",
+ "2 1.577948e-06\n",
+ "3 6.874989e-03\n",
+ "4 3.863391e-03\n",
+ "5 5.653283e-03"
+ ],
+ "text/markdown": [
+ "\n",
+ "A tibble: 5 Γ 7\n",
+ "\n",
+ "| cuisine <fct> | .pred_class <fct> | .pred_chinese <dbl> | .pred_indian <dbl> | .pred_japanese <dbl> | .pred_korean <dbl> | .pred_thai <dbl> |\n",
+ "|---|---|---|---|---|---|---|\n",
+ "| indian | thai | 1.551259e-03 | 0.4587877 | 5.988039e-04 | 2.428503e-04 | 5.388194e-01 |\n",
+ "| indian | indian | 2.637133e-05 | 0.9999488 | 6.648651e-07 | 2.259993e-05 | 1.577948e-06 |\n",
+ "| indian | indian | 1.049433e-03 | 0.9909982 | 1.060937e-03 | 1.644947e-05 | 6.874989e-03 |\n",
+ "| indian | indian | 6.237482e-02 | 0.4763035 | 9.136702e-02 | 3.660913e-01 | 3.863391e-03 |\n",
+ "| indian | indian | 1.431745e-02 | 0.9418551 | 2.945239e-02 | 8.721782e-03 | 5.653283e-03 |\n",
+ "\n"
+ ],
+ "text/latex": [
+ "A tibble: 5 Γ 7\n",
+ "\\begin{tabular}{lllllll}\n",
+ " cuisine & .pred\\_class & .pred\\_chinese & .pred\\_indian & .pred\\_japanese & .pred\\_korean & .pred\\_thai\\\\\n",
+ " \n",
+ "\t.metric .estimator .estimate \n",
+ "\n",
+ "\n",
+ "\t<chr> <chr> <dbl> \n",
+ "\taccuracy multiclass 0.7880435 \n",
+ "\tkap multiclass 0.7276583 \n",
+ "\tsens macro 0.7780927 \n",
+ "\tspec macro 0.9477598 \n",
+ "\tppv macro 0.7585583 \n",
+ "\tnpv macro 0.9460080 \n",
+ "\tmcc multiclass 0.7292724 \n",
+ "\tj_index macro 0.7258524 \n",
+ "\tbal_accuracy macro 0.8629262 \n",
+ "\tdetection_prevalence macro 0.2000000 \n",
+ "\tprecision macro 0.7585583 \n",
+ "\trecall macro 0.7780927 \n",
+ "\n",
+ "f_meas macro 0.7641862 \n",
+ "
\n"
+ ]
+ },
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 248
+ },
+ "id": "xdKNs-ZPMTJL",
+ "outputId": "68f6ac5a-725a-4eff-9ea6-481fef00e008"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Much better!\n",
+ "\n",
+ "β
Can you explain why the model is pretty sure that the first observation is Thai?\n",
+ "\n",
+ "## **πChallenge**\n",
+ "\n",
+ "In this lesson, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the [many options](https://www.tidymodels.org/find/parsnip/#models) Tidymodels provides to classify data and [other ways](https://parsnip.tidymodels.org/articles/articles/Examples.html#multinom_reg-models) to fit multinomial regression.\n",
+ "\n",
+ "#### THANK YOU TO:\n",
+ "\n",
+ "[`Allison Horst`](https://twitter.com/allison_horst/) for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\n",
+ "\n",
+ "[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module β₯οΈ\n",
+ "\n",
+ " \n",
+ "\tcuisine .pred_class .pred_chinese .pred_indian .pred_japanese .pred_korean .pred_thai \n",
+ "\n",
+ "\n",
+ "\t<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> \n",
+ "\tindian thai 1.551259e-03 0.4587877 5.988039e-04 2.428503e-04 5.388194e-01 \n",
+ "\tindian indian 2.637133e-05 0.9999488 6.648651e-07 2.259993e-05 1.577948e-06 \n",
+ "\tindian indian 1.049433e-03 0.9909982 1.060937e-03 1.644947e-05 6.874989e-03 \n",
+ "\tindian indian 6.237482e-02 0.4763035 9.136702e-02 3.660913e-01 3.863391e-03 \n",
+ "\n",
+ "indian indian 1.431745e-02 0.9418551 2.945239e-02 8.721782e-03 5.653283e-03
\n",
+ "Would have thrown in some jokes but I donut understand food puns π
.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "Happy Learning,\n",
+ "\n",
+ "[Eric](https://twitter.com/ericntay), Gold Microsoft Learn Student Ambassador.\n"
+ ],
+ "metadata": {
+ "id": "2tWVHMeLMYdM"
+ }
+ }
+ ]
+}
\ No newline at end of file
diff --git a/4-Classification/2-Classifiers-1/solution/R/lesson_11.Rmd b/4-Classification/2-Classifiers-1/solution/R/lesson_11.Rmd
new file mode 100644
index 000000000..a4221217b
--- /dev/null
+++ b/4-Classification/2-Classifiers-1/solution/R/lesson_11.Rmd
@@ -0,0 +1,349 @@
+---
+title: 'Build a classification model: Delicious Asian and Indian Cuisines'
+output:
+ html_document:
+ df_print: paged
+ theme: flatly
+ highlight: breezedark
+ toc: yes
+ toc_float: yes
+ code_download: yes
+---
+
+## Cuisine classifiers 1
+
+In this lesson, we'll explore a variety of classifiers to *predict a given national cuisine based on a group of ingredients.* While doing so, we'll learn more about some of the ways that algorithms can be leveraged for classification tasks.
+
+### [**Pre-lecture quiz**](https://white-water-09ec41f0f.azurestaticapps.net/quiz/21/)
+
+### **Preparation**
+
+This lesson builds up on our [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/1-Introduction/solution/lesson_10-R.ipynb) where we:
+
+- Made a gentle introduction to classifications using a dataset about all the brilliant cuisines of Asia and India π.
+
+- Explored some [dplyr verbs](https://dplyr.tidyverse.org/) to prep and clean our data.
+
+- Made beautiful visualizations using ggplot2.
+
+- Demonstrated how to deal with imbalanced data by preprocessing it using [recipes](https://recipes.tidymodels.org/articles/Simple_Example.html).
+
+- Demonstrated how to `prep` and `bake` our recipe to confirm that it will work as supposed to.
+
+#### **Prerequisite**
+
+For this lesson, we'll require the following packages to clean, prep and visualize our data:
+
+- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!
+
+- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.
+
+- `DataExplorer`: The [DataExplorer package](https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html) is meant to simplify and automate EDA process and report generation.
+
+- `themis`: The [themis package](https://themis.tidymodels.org/) provides Extra Recipes Steps for Dealing with Unbalanced Data.
+
+- `nnet`: The [nnet package](https://cran.r-project.org/web/packages/nnet/nnet.pdf) provides functions for estimating feed-forward neural networks with a single hidden layer, and for multinomial logistic regression models.
+
+You can have them installed as:
+
+`install.packages(c("tidyverse", "tidymodels", "DataExplorer", "here"))`
+
+Alternatively, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing.
+
+```{r, message=F, warning=F}
+suppressWarnings(if (!require("pacman"))install.packages("pacman"))
+
+pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)
+```
+
+Now, let's hit the ground running!
+
+## 1. Split the data into training and test sets.
+
+We'll start by picking a few steps from our previous lesson.
+
+### Drop the most common ingredients that create confusion between distinct cuisines, using `dplyr::select()`.
+
+Everyone loves rice, garlic and ginger!
+
+```{r recap_drop}
+# Load the original cuisines data
+df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")
+
+# Drop id column, rice, garlic and ginger from our original data set
+df_select <- df %>%
+ select(-c(1, rice, garlic, ginger)) %>%
+ # Encode cuisine column as categorical
+ mutate(cuisine = factor(cuisine))
+
+# Display new data set
+df_select %>%
+ slice_head(n = 5)
+
+# Display distribution of cuisines
+df_select %>%
+ count(cuisine) %>%
+ arrange(desc(n))
+```
+
+Perfect! Now, time to split the data such that 70% of the data goes to training and 30% goes to testing. We'll also apply a `stratification` technique when splitting the data to `maintain the proportion of each cuisine` in the training and validation datasets.
+
+[rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:
+
+```{r data_split}
+# Load the core Tidymodels packages into R session
+library(tidymodels)
+
+# Create split specification
+set.seed(2056)
+cuisines_split <- initial_split(data = df_select,
+ strata = cuisine,
+ prop = 0.7)
+
+# Extract the data in each split
+cuisines_train <- training(cuisines_split)
+cuisines_test <- testing(cuisines_split)
+
+# Print the number of cases in each split
+cat("Training cases: ", nrow(cuisines_train), "\n",
+ "Test cases: ", nrow(cuisines_test), sep = "")
+
+# Display the first few rows of the training set
+cuisines_train %>%
+ slice_head(n = 5)
+
+
+# Display distribution of cuisines in the training set
+cuisines_train %>%
+ count(cuisine) %>%
+ arrange(desc(n))
+
+
+```
+
+## 2. Deal with imbalanced data
+
+As you might have noticed in the original data set as well as in our training set, there is quite an unequal distribution in the number of cuisines. Korean cuisines are *almost* 3 times Thai cuisines. Imbalanced data often has negative effects on the model performance. Many models perform best when the number of observations is equal and, thus, tend to struggle with unbalanced data.
+
+There are majorly two ways of dealing with imbalanced data sets:
+
+- adding observations to the minority class: `Over-sampling` e.g using a SMOTE algorithm which synthetically generates new examples of the minority class using nearest neighbors of these cases.
+
+- removing observations from majority class: `Under-sampling`
+
+In our previous lesson, we demonstrated how to deal with imbalanced data sets using a `recipe`. A recipe can be thought of as a blueprint that describes what steps should be applied to a data set in order to get it ready for data analysis. In our case, we want to have an equal distribution in the number of our cuisines for our `training set`. Let's get right into it.
+
+```{r recap_balance}
+# Load themis package for dealing with imbalanced data
+library(themis)
+
+# Create a recipe for preprocessing training data
+cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%
+ step_smote(cuisine)
+
+# Print recipe
+cuisines_recipe
+
+```
+
+You can of course go ahead and confirm (using prep+bake) that the recipe will work as you expect it - all the cuisine labels having `559` observations.
+
+Since we'll be using this recipe as a preprocessor for modeling, a `workflow()` will do all the prep and bake for us, so we won't have to manually estimate the recipe.
+
+Now we are ready to train a model π©βπ»π¨βπ»!
+
+## 3. Choosing your classifier
+
+{width="600"}
+
+Now we have to decide which algorithm to use for the job π€.
+
+In Tidymodels, the [`parsnip package`](https://parsnip.tidymodels.org/index.html) provides consistent interface for working with models across different engines (packages). Please see the parsnip documentation to explore [model types & engines](https://www.tidymodels.org/find/parsnip/#models) and their corresponding [model arguements](https://www.tidymodels.org/find/parsnip/#model-args). The variety is quite bewildering at first sight. For instance, the following methods all include classification techniques:
+
+- C5.0 Rule-Based Classification Models
+
+- Flexible Discriminant Models
+
+- Linear Discriminant Models
+
+- Regularized Discriminant Models
+
+- Logistic Regression Models
+
+- Multinomial Regression Models
+
+- Naive Bayes Models
+
+- Support Vector Machines
+
+- Nearest Neighbors
+
+- Decision Trees
+
+- Ensemble methods
+
+- Neural Networks
+
+The list goes on!
+
+### **What classifier to go with?**
+
+So, which classifier should you choose? Often, running through several and looking for a good result is a way to test.
+
+> AutoML solves this problem neatly by running these comparisons in the cloud, allowing you to choose the best algorithm for your data. Try it [here](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-15963-cxa)
+
+Also the choice of classifier depends on our problem. For instance, when the outcome can be categorized into `more than two classes`, like in our case, you must use a `multiclass classification algorithm` as opposed to `binary classification.`
+
+### **A better approach**
+
+A better way than wildly guessing, however, is to follow the ideas on this downloadable [ML Cheat sheet](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-15963-cxa). Here, we discover that, for our multiclass problem, we have some choices:
+
+{width="500"}
+
+### **Reasoning**
+
+Let's see if we can reason our way through different approaches given the constraints we have:
+
+- **Deep Neural networks are too heavy**. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, deep neural networks are too heavyweight for this task.
+
+- **No two-class classifier**. We do not use a two-class classifier, so that rules out one-vs-all.
+
+- **Decision tree or logistic regression could work**. A decision tree might work, or multinomial regression/multiclass logistic regression for multiclass data.
+
+- **Multiclass Boosted Decision Trees solve a different problem**. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us.
+
+Also, normally before embarking on more complex machine learning models e.g ensemble methods, it's a good idea to build the simplest possible model to get an idea of what is going on. So for this lesson, we'll start with a `multinomial logistic regression` model.
+
+> Logistic regression is a technique used when the outcome variable is categorical (or nominal). For Binary logistic regression the number of outcome variables is two, whereas the number of outcome variables for multinomial logistic regression is more than two. See [Advanced Regression Methods](https://bookdown.org/chua/ber642_advanced_regression/multinomial-logistic-regression.html) for further reading.
+
+## 4. Train and evaluate a Multinomial logistic regression model.
+
+In Tidymodels, `parsnip::multinom_reg()`, defines a model that uses linear predictors to predict multiclass data using the multinomial distribution. See `?multinom_reg()` for the different ways/engines you can use to fit this model.
+
+For this example, we'll fit a Multinomial regression model via the default [nnet](https://cran.r-project.org/web/packages/nnet/nnet.pdf) engine.
+
+> I picked a value for `penalty` sort of randomly. There are better ways to choose this value that is, by using `resampling` and `tuning` the model which we'll discuss later.
+>
+> See [Tidymodels: Get Started](https://www.tidymodels.org/start/tuning/) in case you want to learn more on how to tune model hyperparameters.
+
+```{r multinorm_reg}
+# Create a multinomial regression model specification
+mr_spec <- multinom_reg(penalty = 1) %>%
+ set_engine("nnet", MaxNWts = 2086) %>%
+ set_mode("classification")
+
+# Print model specification
+mr_spec
+
+```
+
+Great job π₯³! Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data then fit the model on the preprocessed data and also allow for potential post-processing activities. In Tidymodels, this convenient object is called a [`workflow`](https://workflows.tidymodels.org/) and conveniently holds your modeling components! This is what we'd call *pipelines* in *Python*.
+
+So let's bundle everything up into a workflow!π¦
+
+```{r workflow}
+# Bundle recipe and model specification
+mr_wf <- workflow() %>%
+ add_recipe(cuisines_recipe) %>%
+ add_model(mr_spec)
+
+# Print out workflow
+mr_wf
+
+```
+
+Workflows ππ! A **`workflow()`** can be fit in much the same way a model can. So, time to train a model!
+
+```{r train}
+# Train a multinomial regression model
+mr_fit <- fit(object = mr_wf, data = cuisines_train)
+
+mr_fit
+```
+
+The output shows the coefficients that the model learned during training.
+
+### Evaluate the Trained Model
+
+It's time to see how the model performed π by evaluating it on a test set! Let's begin by making predictions on the test set.
+
+```{r test}
+# Make predictions on the test set
+results <- cuisines_test %>% select(cuisine) %>%
+ bind_cols(mr_fit %>% predict(new_data = cuisines_test))
+
+# Print out results
+results %>%
+ slice_head(n = 5)
+
+```
+
+Great job! In Tidymodels, evaluating model performance can be done using [yardstick](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics. As we did in our logistic regression lesson, let's begin by computing a confusion matrix.
+
+```{r conf_mat}
+# Confusion matrix for categorical data
+conf_mat(data = results, truth = cuisine, estimate = .pred_class)
+
+
+```
+
+When dealing with multiple classes, it's generally more intuitive to visualize this as a heat map, like this:
+
+```{r conf_viz}
+update_geom_defaults(geom = "tile", new = list(color = "black", alpha = 0.7))
+# Visualize confusion matrix
+results %>%
+ conf_mat(cuisine, .pred_class) %>%
+ autoplot(type = "heatmap")
+```
+
+The darker squares in the confusion matrix plot indicate high numbers of cases, and you can hopefully see a diagonal line of darker squares indicating cases where the predicted and actual label are the same.
+
+Let's now calculate summary statistics for the confusion matrix.
+
+```{r conf_stats}
+# Summary stats for confusion matrix
+conf_mat(data = results, truth = cuisine, estimate = .pred_class) %>% summary()
+```
+
+If we narrow down to some metrics such as accuracy, sensitivity, ppv, we are not badly off for a start π₯³!
+
+## 4. Digging Deeper
+
+Let's ask one subtle question: What criteria is used to settle for a given type of cuisine as the predicted outcome?
+
+Well, Statistical machine learning algorithms, like logistic regression, are based on `probability`; so what actually gets predicted by a classifier is a probability distribution over a set of possible outcomes. The class with the highest probability is then chosen as the most likely outcome for the given observations.
+
+Let's see this in action by making both hard class predictions and probabilities.
+
+```{r pred_prob}
+# Make hard class prediction and probabilities
+results_prob <- cuisines_test %>%
+ select(cuisine) %>%
+ bind_cols(mr_fit %>% predict(new_data = cuisines_test)) %>%
+ bind_cols(mr_fit %>% predict(new_data = cuisines_test, type = "prob"))
+
+# Print out results
+results_prob %>%
+ slice_head(n = 5)
+
+
+```
+
+Much better!
+
+β
Can you explain why the model is pretty sure that the first observation is Thai?
+
+## **πChallenge**
+
+In this lesson, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the [many options](https://www.tidymodels.org/find/parsnip/#models) Tidymodels provides to classify data and [other ways](https://parsnip.tidymodels.org/articles/articles/Examples.html#multinom_reg-models) to fit multinomial regression.
+
+#### THANK YOU TO:
+
+[`Allison Horst`](https://twitter.com/allison_horst/) for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).
+
+[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module β₯οΈ
+
+Happy Learning,
+
+[Eric](https://twitter.com/ericntay), Gold Microsoft Learn Student Ambassador.