You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/4-Classification/3-Classifiers-2/solution/R/lesson_12-R.ipynb

645 lines
26 KiB

{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "lesson_12-R.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "jsFutf_ygqSx"
},
"source": [
"# Build a classification model: Delicious Asian and Indian Cuisines"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HD54bEefgtNO"
},
"source": [
"## Cuisine classifiers 2\n",
"\n",
"In this second classification lesson, we will explore `more ways` to classify categorical data. We will also learn about the ramifications for choosing one classifier over the other.\n",
"\n",
"### [**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/23/)\n",
"\n",
"### **Prerequisite**\n",
"\n",
"We assume that you have completed the previous lessons since we will be carrying forward some concepts we learned before.\n",
"\n",
"For this lesson, we'll require the following packages:\n",
"\n",
"- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\n",
"\n",
"- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n",
"\n",
"- `themis`: The [themis package](https://themis.tidymodels.org/) provides Extra Recipes Steps for Dealing with Unbalanced Data.\n",
"\n",
"You can have them installed as:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"kernlab\", \"themis\", \"ranger\", \"xgboost\", \"kknn\"))`\n",
"\n",
"Alternatively, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing."
]
},
{
"cell_type": "code",
"metadata": {
"id": "vZ57IuUxgyQt"
},
"source": [
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
"\n",
"pacman::p_load(tidyverse, tidymodels, themis, kernlab, ranger, xgboost, kknn)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "z22M-pj4g07x"
},
"source": [
"Now, let's hit the ground running!\n",
"\n",
"## **1. A classification map**\n",
"\n",
"In our [previous lesson](https://github.com/microsoft/ML-For-Beginners/tree/main/4-Classification/2-Classifiers-1), we tried to address the question: how do we choose between multiple models? To a great extent, it depends on the characteristics of the data and the type of problem we want to solve (for instance classification or regression?)\n",
"\n",
"Previously, we learned about the various options you have when classifying data using Microsoft's cheat sheet. Python's Machine Learning framework, Scikit-learn, offers a similar but more granular cheat sheet that can further help narrow down your estimators (another term for classifiers):\n",
"\n",
"<p >\n",
" <img src=\"../../images/map.png\"\n",
" width=\"700\"/>\n",
" <figcaption></figcaption>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "u1i3xRIVg7vG"
},
"source": [
"> Tip: [visit this map online](https://scikit-learn.org/stable/tutorial/machine_learning_map/) and click along the path to read documentation.\n",
">\n",
"> The [Tidymodels reference site](https://www.tidymodels.org/find/parsnip/#models) also provides an excellent documentation about different types of model.\n",
"\n",
"### **The plan** 🗺️\n",
"\n",
"This map is very helpful once you have a clear grasp of your data, as you can 'walk' along its paths to a decision:\n",
"\n",
"- We have \\>50 samples\n",
"\n",
"- We want to predict a category\n",
"\n",
"- We have labeled data\n",
"\n",
"- We have fewer than 100K samples\n",
"\n",
"- ✨ We can choose a Linear SVC\n",
"\n",
"- If that doesn't work, since we have numeric data\n",
"\n",
" - We can try a ✨ KNeighbors Classifier\n",
"\n",
" - If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers\n",
"\n",
"This is a very helpful trail to follow. Now, let's get right into it using the [tidymodels](https://www.tidymodels.org/) modelling framework: a consistent and flexible collection of R packages developed to encourage good statistical practice 😊.\n",
"\n",
"## 2. Split the data and deal with imbalanced data set.\n",
"\n",
"From our previous lessons, we learnt that there were a set of common ingredients across our cuisines. Also, there was quite an unequal distribution in the number of cuisines.\n",
"\n",
"We'll deal with these by\n",
"\n",
"- Dropping the most common ingredients that create confusion between distinct cuisines, using `dplyr::select()`.\n",
"\n",
"- Use a `recipe` that preprocesses the data to get it ready for modelling by applying an `over-sampling` algorithm.\n",
"\n",
"We already looked at the above in the previous lesson so this should be a breeze 🥳!"
]
},
{
"cell_type": "code",
"metadata": {
"id": "6tj_rN00hClA"
},
"source": [
"# Load the core Tidyverse and Tidymodels packages\n",
"library(tidyverse)\n",
"library(tidymodels)\n",
"\n",
"# Load the original cuisines data\n",
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\n",
"\n",
"# Drop id column, rice, garlic and ginger from our original data set\n",
"df_select <- df %>% \n",
" select(-c(1, rice, garlic, ginger)) %>%\n",
" # Encode cuisine column as categorical\n",
" mutate(cuisine = factor(cuisine))\n",
"\n",
"\n",
"# Create data split specification\n",
"set.seed(2056)\n",
"cuisines_split <- initial_split(data = df_select,\n",
" strata = cuisine,\n",
" prop = 0.7)\n",
"\n",
"# Extract the data in each split\n",
"cuisines_train <- training(cuisines_split)\n",
"cuisines_test <- testing(cuisines_split)\n",
"\n",
"# Display distribution of cuisines in the training set\n",
"cuisines_train %>% \n",
" count(cuisine) %>% \n",
" arrange(desc(n))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "zFin5yw3hHb1"
},
"source": [
"### Deal with imbalanced data\n",
"\n",
"Imbalanced data often has negative effects on the model performance. Many models perform best when the number of observations is equal and, thus, tend to struggle with unbalanced data.\n",
"\n",
"There are majorly two ways of dealing with imbalanced data sets:\n",
"\n",
"- adding observations to the minority class: `Over-sampling` e.g using a SMOTE algorithm which synthetically generates new examples of the minority class using nearest neighbors of these cases.\n",
"\n",
"- removing observations from majority class: `Under-sampling`\n",
"\n",
"In our previous lesson, we demonstrated how to deal with imbalanced data sets using a `recipe`. A recipe can be thought of as a blueprint that describes what steps should be applied to a data set in order to get it ready for data analysis. In our case, we want to have an equal distribution in the number of our cuisines for our `training set`. Let's get right into it."
]
},
{
"cell_type": "code",
"metadata": {
"id": "cRzTnHolhLWd"
},
"source": [
"# Load themis package for dealing with imbalanced data\n",
"library(themis)\n",
"\n",
"# Create a recipe for preprocessing training data\n",
"cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%\n",
" step_smote(cuisine) \n",
"\n",
"# Print recipe\n",
"cuisines_recipe"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "KxOQ2ORhhO81"
},
"source": [
"Now we are ready to train models 👩‍💻👨‍💻!\n",
"\n",
"## 3. Beyond multinomial regression models\n",
"\n",
"In our previous lesson, we looked at multinomial regression models. Let's explore some more flexible models for classification.\n",
"\n",
"### Support Vector Machines.\n",
"\n",
"In the context of classification, `Support Vector Machines` is a machine learning technique that tries to find a *hyperplane* that \"best\" separates the classes. Let's look at a simple example:\n",
"\n",
"<p >\n",
" <img src=\"../../images/svm.png\"\n",
" width=\"300\"/>\n",
" <figcaption>https://commons.wikimedia.org/w/index.php?curid=22877598</figcaption>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "C4Wsd0vZhXYu"
},
"source": [
"H1~ does not separate the classes. H2~ does, but only with a small margin. H3~ separates them with the maximal margin.\n",
"\n",
"#### Linear Support Vector Classifier\n",
"\n",
"Support-Vector clustering (SVC) is a child of the Support-Vector machines family of ML techniques. In SVC, the hyperplane is chosen to correctly separate `most` of the training observations, but `may misclassify` a few observations. By allowing some points to be on the wrong side, the SVM becomes more robust to outliers hence better generalization to new data. The parameter that regulates this violation is referred to as `cost` which has a default value of 1 (see `help(\"svm_poly\")`).\n",
"\n",
"Let's create a linear SVC by setting `degree = 1` in a polynomial SVM model."
]
},
{
"cell_type": "code",
"metadata": {
"id": "vJpp6nuChlBz"
},
"source": [
"# Make a linear SVC specification\n",
"svc_linear_spec <- svm_poly(degree = 1) %>% \n",
" set_engine(\"kernlab\") %>% \n",
" set_mode(\"classification\")\n",
"\n",
"# Bundle specification and recipe into a worklow\n",
"svc_linear_wf <- workflow() %>% \n",
" add_recipe(cuisines_recipe) %>% \n",
" add_model(svc_linear_spec)\n",
"\n",
"# Print out workflow\n",
"svc_linear_wf"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "rDs8cWNkhoqu"
},
"source": [
"Now that we have captured the preprocessing steps and model specification into a *workflow*, we can go ahead and train the linear SVC and evaluate results while at it. For performance metrics, let's create a metric set that will evaluate: `accuracy`, `sensitivity`, `Positive Predicted Value` and `F Measure`\n",
"\n",
"> `augment()` will add column(s) for predictions to the given data."
]
},
{
"cell_type": "code",
"metadata": {
"id": "81wiqcwuhrnq"
},
"source": [
"# Train a linear SVC model\n",
"svc_linear_fit <- svc_linear_wf %>% \n",
" fit(data = cuisines_train)\n",
"\n",
"# Create a metric set\n",
"eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)\n",
"\n",
"\n",
"# Make predictions and Evaluate model performance\n",
"svc_linear_fit %>% \n",
" augment(new_data = cuisines_test) %>% \n",
" eval_metrics(truth = cuisine, estimate = .pred_class)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "0UFQvHf-huo3"
},
"source": [
"#### Support Vector Machine\n",
"\n",
"The support vector machine (SVM) is an extension of the support vector classifier in order to accommodate a non-linear boundary between the classes. In essence, SVMs use the *kernel trick* to enlarge the feature space to adapt to nonlinear relationships between classes. One popular and extremely flexible kernel function used by SVMs is the *Radial basis function.* Let's see how it will perform on our data."
]
},
{
"cell_type": "code",
"metadata": {
"id": "-KX4S8mzhzmp"
},
"source": [
"set.seed(2056)\n",
"\n",
"# Make an RBF SVM specification\n",
"svm_rbf_spec <- svm_rbf() %>% \n",
" set_engine(\"kernlab\") %>% \n",
" set_mode(\"classification\")\n",
"\n",
"# Bundle specification and recipe into a worklow\n",
"svm_rbf_wf <- workflow() %>% \n",
" add_recipe(cuisines_recipe) %>% \n",
" add_model(svm_rbf_spec)\n",
"\n",
"\n",
"# Train an RBF model\n",
"svm_rbf_fit <- svm_rbf_wf %>% \n",
" fit(data = cuisines_train)\n",
"\n",
"\n",
"# Make predictions and Evaluate model performance\n",
"svm_rbf_fit %>% \n",
" augment(new_data = cuisines_test) %>% \n",
" eval_metrics(truth = cuisine, estimate = .pred_class)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "QBFSa7WSh4HQ"
},
"source": [
"Much better 🤩!\n",
"\n",
"> ✅ Please see:\n",
">\n",
"> - [*Support Vector Machines*](https://bradleyboehmke.github.io/HOML/svm.html), Hands-on Machine Learning with R\n",
">\n",
"> - [*Support Vector Machines*](https://www.statlearning.com/), An Introduction to Statistical Learning with Applications in R\n",
">\n",
"> for further reading.\n",
"\n",
"### Nearest Neighbor classifiers\n",
"\n",
"*K*-nearest neighbor (KNN) is an algorithm in which each observation is predicted based on its *similarity* to other observations.\n",
"\n",
"Let's fit one to our data."
]
},
{
"cell_type": "code",
"metadata": {
"id": "k4BxxBcdh9Ka"
},
"source": [
"# Make a KNN specification\n",
"knn_spec <- nearest_neighbor() %>% \n",
" set_engine(\"kknn\") %>% \n",
" set_mode(\"classification\")\n",
"\n",
"# Bundle recipe and model specification into a workflow\n",
"knn_wf <- workflow() %>% \n",
" add_recipe(cuisines_recipe) %>% \n",
" add_model(knn_spec)\n",
"\n",
"# Train a boosted tree model\n",
"knn_wf_fit <- knn_wf %>% \n",
" fit(data = cuisines_train)\n",
"\n",
"\n",
"# Make predictions and Evaluate model performance\n",
"knn_wf_fit %>% \n",
" augment(new_data = cuisines_test) %>% \n",
" eval_metrics(truth = cuisine, estimate = .pred_class)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "HaegQseriAcj"
},
"source": [
"It appears that this model is not performing that well. Probably changing the model's arguments (see `help(\"nearest_neighbor\")` will improve model performance. Be sure to try it out.\n",
"\n",
"> ✅ Please see:\n",
">\n",
"> - [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)\n",
">\n",
"> - [An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)\n",
">\n",
"> to learn more about *K*-Nearest Neighbors classifiers.\n",
"\n",
"### Ensemble classifiers\n",
"\n",
"Ensemble algorithms work by combining multiple base estimators to produce an optimal model either by:\n",
"\n",
"`bagging`: applying an *averaging function* to a collection of base models\n",
"\n",
"`boosting`: building a sequence of models that build on one another to improve predictive performance.\n",
"\n",
"Let's start by trying out a Random Forest model, which builds a large collection of decision trees then applies an averaging function to for a better overall model."
]
},
{
"cell_type": "code",
"metadata": {
"id": "49DPoVs6iK1M"
},
"source": [
"# Make a random forest specification\n",
"rf_spec <- rand_forest() %>% \n",
" set_engine(\"ranger\") %>% \n",
" set_mode(\"classification\")\n",
"\n",
"# Bundle recipe and model specification into a workflow\n",
"rf_wf <- workflow() %>% \n",
" add_recipe(cuisines_recipe) %>% \n",
" add_model(rf_spec)\n",
"\n",
"# Train a random forest model\n",
"rf_wf_fit <- rf_wf %>% \n",
" fit(data = cuisines_train)\n",
"\n",
"\n",
"# Make predictions and Evaluate model performance\n",
"rf_wf_fit %>% \n",
" augment(new_data = cuisines_test) %>% \n",
" eval_metrics(truth = cuisine, estimate = .pred_class)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "RGVYwC_aiUWc"
},
"source": [
"Good job 👏!\n",
"\n",
"Let's also experiment with a Boosted Tree model.\n",
"\n",
"Boosted Tree defines an ensemble method that creates a series of sequential decision trees where each tree depends on the results of previous trees in an attempt to incrementally reduce the error. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.\n",
"\n",
"There are different ways to fit this model (see `help(\"boost_tree\")`). In this example, we'll fit Boosted trees via `xgboost` engine."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Py1YWo-micWs"
},
"source": [
"# Make a boosted tree specification\n",
"boost_spec <- boost_tree(trees = 200) %>% \n",
" set_engine(\"xgboost\") %>% \n",
" set_mode(\"classification\")\n",
"\n",
"# Bundle recipe and model specification into a workflow\n",
"boost_wf <- workflow() %>% \n",
" add_recipe(cuisines_recipe) %>% \n",
" add_model(boost_spec)\n",
"\n",
"# Train a boosted tree model\n",
"boost_wf_fit <- boost_wf %>% \n",
" fit(data = cuisines_train)\n",
"\n",
"\n",
"# Make predictions and Evaluate model performance\n",
"boost_wf_fit %>% \n",
" augment(new_data = cuisines_test) %>% \n",
" eval_metrics(truth = cuisine, estimate = .pred_class)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "zNQnbuejigZM"
},
"source": [
"\n",
"> ✅ Please see:\n",
">\n",
"> - [Machine Learning for Social Scientists](https://cimentadaj.github.io/ml_socsci/tree-based-methods.html#random-forests)\n",
">\n",
"> - [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)\n",
">\n",
"> - [An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)\n",
">\n",
"> - <https://algotech.netlify.app/blog/xgboost/> - Explores the AdaBoost model which is a good alternative to xgboost.\n",
">\n",
"> to learn more about Ensemble classifiers.\n",
"\n",
"## 4. Extra - comparing multiple models\n",
"\n",
"We have fitted quite a number of models in this lab 🙌. It can become tedious or onerous to create a lot of workflows from different sets of preprocessors and/or model specifications and then calculate the performance metrics one by one.\n",
"\n",
"Let's see if we can address this by creating a function that fits a list of workflows on the training set then returns the performance metrics based on the test set. We'll get to use `map()` and `map_dfr()` from the [purrr](https://purrr.tidyverse.org/) package to apply functions to each element in list.\n",
"\n",
"> [`map()`](https://purrr.tidyverse.org/reference/map.html) functions allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the [`map()`](https://purrr.tidyverse.org/reference/map.html) functions is the [iteration chapter](http://r4ds.had.co.nz/iteration.html) in R for data science."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Qzb7LyZnimd2"
},
"source": [
"set.seed(2056)\n",
"\n",
"# Create a metric set\n",
"eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)\n",
"\n",
"# Define a function that returns performance metrics\n",
"compare_models <- function(workflow_list, train_set, test_set){\n",
" \n",
" suppressWarnings(\n",
" # Fit each model to the train_set\n",
" map(workflow_list, fit, data = train_set) %>% \n",
" # Make predictions on the test set\n",
" map_dfr(augment, new_data = test_set, .id = \"model\") %>%\n",
" # Select desired columns\n",
" select(model, cuisine, .pred_class) %>% \n",
" # Evaluate model performance\n",
" group_by(model) %>% \n",
" eval_metrics(truth = cuisine, estimate = .pred_class) %>% \n",
" ungroup()\n",
" )\n",
" \n",
"} # End of function"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Fwa712sNisDA"
},
"source": [
"Let's call our function and compare the accuracy across the models."
]
},
{
"cell_type": "code",
"metadata": {
"id": "3i4VJOi2iu-a"
},
"source": [
"# Make a list of workflows\n",
"workflow_list <- list(\n",
" \"svc\" = svc_linear_wf,\n",
" \"svm\" = svm_rbf_wf,\n",
" \"knn\" = knn_wf,\n",
" \"random_forest\" = rf_wf,\n",
" \"xgboost\" = boost_wf)\n",
"\n",
"# Call the function\n",
"set.seed(2056)\n",
"perf_metrics <- compare_models(workflow_list = workflow_list, train_set = cuisines_train, test_set = cuisines_test)\n",
"\n",
"# Print out performance metrics\n",
"perf_metrics %>% \n",
" group_by(.metric) %>% \n",
" arrange(desc(.estimate)) %>% \n",
" slice_head(n=7)\n",
"\n",
"# Compare accuracy\n",
"perf_metrics %>% \n",
" filter(.metric == \"accuracy\") %>% \n",
" arrange(desc(.estimate))\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "KuWK_lEli4nW"
},
"source": [
"\n",
"[**workflowset**](https://workflowsets.tidymodels.org/) package allow users to create and easily fit a large number of models but is mostly designed to work with resampling techniques such as `cross-validation`, an approach we are yet to cover.\n",
"\n",
"## **🚀Challenge**\n",
"\n",
"Each of these techniques has a large number of parameters that you can tweak for instance `cost` in SVMs, `neighbors` in KNN, `mtry` (Randomly Selected Predictors) in Random Forest.\n",
"\n",
"Research each one's default parameters and think about what tweaking these parameters would mean for the model's quality.\n",
"\n",
"To find out more about a particular model and its parameters, use: `help(\"model\")` e.g `help(\"rand_forest\")`\n",
"\n",
"> In practice, we usually *estimate* the *best values* for these by training many models on a `simulated data set` and measuring how well all these models perform. This process is called **tuning**.\n",
"\n",
"### [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/24/)\n",
"\n",
"### **Review & Self Study**\n",
"\n",
"There's a lot of jargon in these lessons, so take a minute to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-77952-leestott) of useful terminology!\n",
"\n",
"#### THANK YOU TO:\n",
"\n",
"[`Allison Horst`](https://twitter.com/allison_horst/) for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\n",
"\n",
"[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the original Python version of this module ♥️\n",
"\n",
"Happy Learning,\n",
"\n",
"[Eric](https://twitter.com/ericntay), Gold Microsoft Learn Student Ambassador.\n",
"\n",
"<p >\n",
" <img src=\"../../images/r_learners_sm.jpeg\"\n",
" width=\"569\"/>\n",
" <figcaption>Artwork by @allison_horst</figcaption>\n"
]
}
]
}