You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
654 lines
25 KiB
654 lines
25 KiB
{
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0,
|
|
"metadata": {
|
|
"colab": {
|
|
"name": "lesson_12-R.ipynb",
|
|
"provenance": [],
|
|
"collapsed_sections": []
|
|
},
|
|
"kernelspec": {
|
|
"name": "ir",
|
|
"display_name": "R"
|
|
},
|
|
"language_info": {
|
|
"name": "R"
|
|
},
|
|
"coopTranslator": {
|
|
"original_hash": "fab50046ca413a38939d579f8432274f",
|
|
"translation_date": "2025-11-18T19:25:36+00:00",
|
|
"source_file": "4-Classification/3-Classifiers-2/solution/R/lesson_12-R.ipynb",
|
|
"language_code": "pcm"
|
|
}
|
|
},
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "jsFutf_ygqSx"
|
|
},
|
|
"source": [
|
|
"# Build classification model: Sweet Asian and Indian Food\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "HD54bEefgtNO"
|
|
},
|
|
"source": [
|
|
"## Cuisine classifiers 2\n",
|
|
"\n",
|
|
"For dis second lesson wey dey about classification, we go look `more ways` wey we fit take classify categorical data. We go still learn wetin fit happen if we choose one classifier instead of another one.\n",
|
|
"\n",
|
|
"### [**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/23/)\n",
|
|
"\n",
|
|
"### **Prerequisite**\n",
|
|
"\n",
|
|
"We dey assume say you don finish di previous lessons because we go dey use some concepts wey we don learn before.\n",
|
|
"\n",
|
|
"For dis lesson, we go need di following packages:\n",
|
|
"\n",
|
|
"- `tidyverse`: Di [tidyverse](https://www.tidyverse.org/) na [collection of R packages](https://www.tidyverse.org/packages) wey dem design to make data science fast, easy and fun!\n",
|
|
"\n",
|
|
"- `tidymodels`: Di [tidymodels](https://www.tidymodels.org/) framework na [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n",
|
|
"\n",
|
|
"- `themis`: Di [themis package](https://themis.tidymodels.org/) dey provide Extra Recipes Steps to handle unbalanced data.\n",
|
|
"\n",
|
|
"You fit install dem like dis:\n",
|
|
"\n",
|
|
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"kernlab\", \"themis\", \"ranger\", \"xgboost\", \"kknn\"))`\n",
|
|
"\n",
|
|
"Or, di script wey dey below go check whether you get di packages wey you need to complete dis module, and e go install dem for you if dem no dey.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "vZ57IuUxgyQt"
|
|
},
|
|
"source": [
|
|
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
|
|
"\n",
|
|
"pacman::p_load(tidyverse, tidymodels, themis, kernlab, ranger, xgboost, kknn)"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "z22M-pj4g07x"
|
|
},
|
|
"source": [
|
|
"Make we start dey waka sharp sharp!\n",
|
|
"\n",
|
|
"## **1. Map wey dey show classification**\n",
|
|
"\n",
|
|
"For our [last lesson](https://github.com/microsoft/ML-For-Beginners/tree/main/4-Classification/2-Classifiers-1), we bin try answer dis question: how we go fit choose between plenty models? E dey depend well well on di kind data wey we get and di type problem wey we wan solve (like classification or regression?)\n",
|
|
"\n",
|
|
"Before, we don learn about di different options wey dey available to classify data using Microsoft's cheat sheet. Python Machine Learning framework, Scikit-learn, get one cheat sheet wey dey more detailed wey fit help you narrow down your estimators (another name for classifiers):\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../../../../../translated_images/map.e963a6a51349425a.pcm.png\"\n",
|
|
" width=\"700\"/>\n",
|
|
" <figcaption></figcaption>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "u1i3xRIVg7vG"
|
|
},
|
|
"source": [
|
|
"> Tip: [go check dis map online](https://scikit-learn.org/stable/tutorial/machine_learning_map/) and click for di road to read di documentation.\n",
|
|
">\n",
|
|
"> Di [Tidymodels reference site](https://www.tidymodels.org/find/parsnip/#models) sef get beta documentation about di different kain model.\n",
|
|
"\n",
|
|
"### **Di plan** 🗺️\n",
|
|
"\n",
|
|
"Dis map go help well well once you sabi your data well, as you fit 'waka' follow di road to make decision:\n",
|
|
"\n",
|
|
"- We get \\>50 samples\n",
|
|
"\n",
|
|
"- We wan predict one category\n",
|
|
"\n",
|
|
"- We get labeled data\n",
|
|
"\n",
|
|
"- We get less than 100K samples\n",
|
|
"\n",
|
|
"- ✨ We fit choose Linear SVC\n",
|
|
"\n",
|
|
"- If e no work, since we get numeric data\n",
|
|
"\n",
|
|
" - We fit try ✨ KNeighbors Classifier\n",
|
|
"\n",
|
|
" - If e no work, try ✨ SVC and ✨ Ensemble Classifiers\n",
|
|
"\n",
|
|
"Dis na very helpful road to follow. Now, make we jump enter am using di [tidymodels](https://www.tidymodels.org/) modelling framework: one consistent and flexible collection of R packages wey dem develop to encourage beta statistical practice 😊.\n",
|
|
"\n",
|
|
"## 2. Share di data and handle imbalanced data set.\n",
|
|
"\n",
|
|
"From di lesson wey we do before, we learn say some common ingredients dey across di cuisines. Plus, di number of cuisines no dey balance well.\n",
|
|
"\n",
|
|
"We go handle dis one by:\n",
|
|
"\n",
|
|
"- Comot di most common ingredients wey dey cause wahala between di different cuisines, using `dplyr::select()`.\n",
|
|
"\n",
|
|
"- Use one `recipe` wey go preprocess di data to make am ready for modelling by applying one `over-sampling` algorithm.\n",
|
|
"\n",
|
|
"We don already look dis one for di lesson wey we do before so e go easy well well 🥳!\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "6tj_rN00hClA"
|
|
},
|
|
"source": [
|
|
"# Load the core Tidyverse and Tidymodels packages\n",
|
|
"library(tidyverse)\n",
|
|
"library(tidymodels)\n",
|
|
"\n",
|
|
"# Load the original cuisines data\n",
|
|
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv\")\n",
|
|
"\n",
|
|
"# Drop id column, rice, garlic and ginger from our original data set\n",
|
|
"df_select <- df %>% \n",
|
|
" select(-c(1, rice, garlic, ginger)) %>%\n",
|
|
" # Encode cuisine column as categorical\n",
|
|
" mutate(cuisine = factor(cuisine))\n",
|
|
"\n",
|
|
"\n",
|
|
"# Create data split specification\n",
|
|
"set.seed(2056)\n",
|
|
"cuisines_split <- initial_split(data = df_select,\n",
|
|
" strata = cuisine,\n",
|
|
" prop = 0.7)\n",
|
|
"\n",
|
|
"# Extract the data in each split\n",
|
|
"cuisines_train <- training(cuisines_split)\n",
|
|
"cuisines_test <- testing(cuisines_split)\n",
|
|
"\n",
|
|
"# Display distribution of cuisines in the training set\n",
|
|
"cuisines_train %>% \n",
|
|
" count(cuisine) %>% \n",
|
|
" arrange(desc(n))"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "zFin5yw3hHb1"
|
|
},
|
|
"source": [
|
|
"### How to handle imbalanced data\n",
|
|
"\n",
|
|
"Imbalanced data fit spoil model performance. Many models dey work well when di number of observations dey equal, so dem dey struggle when data no balance.\n",
|
|
"\n",
|
|
"Two main ways dey to handle imbalanced data sets:\n",
|
|
"\n",
|
|
"- Add more observations to di minority class: `Over-sampling` e.g use SMOTE algorithm wey dey create new examples for di minority class by using nearest neighbors of di cases.\n",
|
|
"\n",
|
|
"- Remove some observations from di majority class: `Under-sampling`\n",
|
|
"\n",
|
|
"For di previous lesson, we show how to handle imbalanced data sets using `recipe`. Recipe na like blueprint wey dey describe di steps wey you go follow for data set to prepare am for data analysis. For our case, we wan make di number of cuisines for our `training set` dey equal. Make we start.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "cRzTnHolhLWd"
|
|
},
|
|
"source": [
|
|
"# Load themis package for dealing with imbalanced data\n",
|
|
"library(themis)\n",
|
|
"\n",
|
|
"# Create a recipe for preprocessing training data\n",
|
|
"cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%\n",
|
|
" step_smote(cuisine) \n",
|
|
"\n",
|
|
"# Print recipe\n",
|
|
"cuisines_recipe"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "KxOQ2ORhhO81"
|
|
},
|
|
"source": [
|
|
"Now we don ready to train models 👩💻👨💻!\n",
|
|
"\n",
|
|
"## 3. Beyond multinomial regression models\n",
|
|
"\n",
|
|
"For di last lesson, we bin look multinomial regression models. Make we check some models wey dey more flexible for classification.\n",
|
|
"\n",
|
|
"### Support Vector Machines.\n",
|
|
"\n",
|
|
"For classification matter, `Support Vector Machines` na one machine learning method wey dey try find one *hyperplane* wey go \"best\" separate di classes. Make we see one simple example:\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../../../../../translated_images/svm.621ae7b516d678e0.pcm.png\"\n",
|
|
" width=\"300\"/>\n",
|
|
" <figcaption>https://commons.wikimedia.org/w/index.php?curid=22877598</figcaption>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "C4Wsd0vZhXYu"
|
|
},
|
|
"source": [
|
|
"H1~ no dey separate di classes. H2~ dey separate dem, but e get small space. H3~ dey separate dem wit di biggest space.\n",
|
|
"\n",
|
|
"#### Linear Support Vector Classifier\n",
|
|
"\n",
|
|
"Support-Vector clustering (SVC) na one pikin for di Support-Vector machines family for ML techniques. For SVC, di hyperplane wey dem go choose go fit separate `most` of di training observations well, but e fit `misclassify` some observations. If dem allow some points dey for di wrong side, di SVM go strong pass for outliers and e go fit generalize well well for new data. Di parameter wey dey control dis violation na wetin dem dey call `cost` and e get default value of 1 (check `help(\"svm_poly\")`).\n",
|
|
"\n",
|
|
"Make we create linear SVC by setting `degree = 1` for polynomial SVM model.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "vJpp6nuChlBz"
|
|
},
|
|
"source": [
|
|
"# Make a linear SVC specification\n",
|
|
"svc_linear_spec <- svm_poly(degree = 1) %>% \n",
|
|
" set_engine(\"kernlab\") %>% \n",
|
|
" set_mode(\"classification\")\n",
|
|
"\n",
|
|
"# Bundle specification and recipe into a worklow\n",
|
|
"svc_linear_wf <- workflow() %>% \n",
|
|
" add_recipe(cuisines_recipe) %>% \n",
|
|
" add_model(svc_linear_spec)\n",
|
|
"\n",
|
|
"# Print out workflow\n",
|
|
"svc_linear_wf"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "rDs8cWNkhoqu"
|
|
},
|
|
"source": [
|
|
"Now wey we don capture di preprocessing steps and model specification inside one *workflow*, we fit go ahead train di linear SVC and check di results as we dey do am. For performance metrics, make we create one metric set wey go check: `accuracy`, `sensitivity`, `Positive Predicted Value` and `F Measure`\n",
|
|
"\n",
|
|
"> `augment()` go add column(s) for predictions to di data wey dem give.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "81wiqcwuhrnq"
|
|
},
|
|
"source": [
|
|
"# Train a linear SVC model\n",
|
|
"svc_linear_fit <- svc_linear_wf %>% \n",
|
|
" fit(data = cuisines_train)\n",
|
|
"\n",
|
|
"# Create a metric set\n",
|
|
"eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Make predictions and Evaluate model performance\n",
|
|
"svc_linear_fit %>% \n",
|
|
" augment(new_data = cuisines_test) %>% \n",
|
|
" eval_metrics(truth = cuisine, estimate = .pred_class)"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "0UFQvHf-huo3"
|
|
},
|
|
"source": [
|
|
"#### Support Vector Machine\n",
|
|
"\n",
|
|
"Support vector machine (SVM) na extension of support vector classifier wey fit handle non-linear boundary between di classes. Di main idea be say SVMs dey use *kernel trick* to expand di feature space so e go fit work well with nonlinear relationships between di classes. One popular and very flexible kernel function wey SVMs dey use na *Radial basis function.* Make we see how e go perform for our data.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "-KX4S8mzhzmp"
|
|
},
|
|
"source": [
|
|
"set.seed(2056)\n",
|
|
"\n",
|
|
"# Make an RBF SVM specification\n",
|
|
"svm_rbf_spec <- svm_rbf() %>% \n",
|
|
" set_engine(\"kernlab\") %>% \n",
|
|
" set_mode(\"classification\")\n",
|
|
"\n",
|
|
"# Bundle specification and recipe into a worklow\n",
|
|
"svm_rbf_wf <- workflow() %>% \n",
|
|
" add_recipe(cuisines_recipe) %>% \n",
|
|
" add_model(svm_rbf_spec)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Train an RBF model\n",
|
|
"svm_rbf_fit <- svm_rbf_wf %>% \n",
|
|
" fit(data = cuisines_train)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Make predictions and Evaluate model performance\n",
|
|
"svm_rbf_fit %>% \n",
|
|
" augment(new_data = cuisines_test) %>% \n",
|
|
" eval_metrics(truth = cuisine, estimate = .pred_class)"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "QBFSa7WSh4HQ"
|
|
},
|
|
"source": [
|
|
"Much beta 🤩!\n",
|
|
"\n",
|
|
"> ✅ Abeg check:\n",
|
|
">\n",
|
|
"> - [*Support Vector Machines*](https://bradleyboehmke.github.io/HOML/svm.html), Hands-on Machine Learning with R\n",
|
|
">\n",
|
|
"> - [*Support Vector Machines*](https://www.statlearning.com/), An Introduction to Statistical Learning with Applications in R\n",
|
|
">\n",
|
|
"> for more tori.\n",
|
|
"\n",
|
|
"### Nearest Neighbor classifiers\n",
|
|
"\n",
|
|
"*K*-nearest neighbor (KNN) na one algorithm wey dey use similarity between observations to predict each one.\n",
|
|
"\n",
|
|
"Make we fit am to our data.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "k4BxxBcdh9Ka"
|
|
},
|
|
"source": [
|
|
"# Make a KNN specification\n",
|
|
"knn_spec <- nearest_neighbor() %>% \n",
|
|
" set_engine(\"kknn\") %>% \n",
|
|
" set_mode(\"classification\")\n",
|
|
"\n",
|
|
"# Bundle recipe and model specification into a workflow\n",
|
|
"knn_wf <- workflow() %>% \n",
|
|
" add_recipe(cuisines_recipe) %>% \n",
|
|
" add_model(knn_spec)\n",
|
|
"\n",
|
|
"# Train a boosted tree model\n",
|
|
"knn_wf_fit <- knn_wf %>% \n",
|
|
" fit(data = cuisines_train)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Make predictions and Evaluate model performance\n",
|
|
"knn_wf_fit %>% \n",
|
|
" augment(new_data = cuisines_test) %>% \n",
|
|
" eval_metrics(truth = cuisine, estimate = .pred_class)"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "HaegQseriAcj"
|
|
},
|
|
"source": [
|
|
"E be like say dis model no dey perform well. Maybe if you change di model argument dem (check `help(\"nearest_neighbor\")`), e go fit make di model perform better. Make sure say you try am.\n",
|
|
"\n",
|
|
"> ✅ Abeg check:\n",
|
|
">\n",
|
|
"> - [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)\n",
|
|
">\n",
|
|
"> - [An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)\n",
|
|
">\n",
|
|
"> to sabi more about *K*-Nearest Neighbors classifiers.\n",
|
|
"\n",
|
|
"### Ensemble classifiers\n",
|
|
"\n",
|
|
"Ensemble algorithm dem dey work by join plenty base estimator dem together to make one better model either by:\n",
|
|
"\n",
|
|
"`bagging`: wey go use *averaging function* for di collection of base model dem\n",
|
|
"\n",
|
|
"`boosting`: wey go build model dem one after di other to take improve di predictive performance.\n",
|
|
"\n",
|
|
"Make we start by try Random Forest model, wey dey build plenty decision tree dem, come use averaging function join dem together to make di overall model better.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "49DPoVs6iK1M"
|
|
},
|
|
"source": [
|
|
"# Make a random forest specification\n",
|
|
"rf_spec <- rand_forest() %>% \n",
|
|
" set_engine(\"ranger\") %>% \n",
|
|
" set_mode(\"classification\")\n",
|
|
"\n",
|
|
"# Bundle recipe and model specification into a workflow\n",
|
|
"rf_wf <- workflow() %>% \n",
|
|
" add_recipe(cuisines_recipe) %>% \n",
|
|
" add_model(rf_spec)\n",
|
|
"\n",
|
|
"# Train a random forest model\n",
|
|
"rf_wf_fit <- rf_wf %>% \n",
|
|
" fit(data = cuisines_train)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Make predictions and Evaluate model performance\n",
|
|
"rf_wf_fit %>% \n",
|
|
" augment(new_data = cuisines_test) %>% \n",
|
|
" eval_metrics(truth = cuisine, estimate = .pred_class)"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "RGVYwC_aiUWc"
|
|
},
|
|
"source": [
|
|
"Good job 👏!\n",
|
|
"\n",
|
|
"Make we try Boosted Tree model too.\n",
|
|
"\n",
|
|
"Boosted Tree na one kind ensemble method wey dey build series of decision trees one after the other. Each tree go use the result of the tree wey dey before am to try reduce the error small small. E dey focus on the weight of items wey dem classify wrong and e go adjust the fit for the next classifier to correct am.\n",
|
|
"\n",
|
|
"Different ways dey to fit this model (check `help(\"boost_tree\")`). For this example, we go fit Boosted trees with `xgboost` engine.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "Py1YWo-micWs"
|
|
},
|
|
"source": [
|
|
"# Make a boosted tree specification\n",
|
|
"boost_spec <- boost_tree(trees = 200) %>% \n",
|
|
" set_engine(\"xgboost\") %>% \n",
|
|
" set_mode(\"classification\")\n",
|
|
"\n",
|
|
"# Bundle recipe and model specification into a workflow\n",
|
|
"boost_wf <- workflow() %>% \n",
|
|
" add_recipe(cuisines_recipe) %>% \n",
|
|
" add_model(boost_spec)\n",
|
|
"\n",
|
|
"# Train a boosted tree model\n",
|
|
"boost_wf_fit <- boost_wf %>% \n",
|
|
" fit(data = cuisines_train)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Make predictions and Evaluate model performance\n",
|
|
"boost_wf_fit %>% \n",
|
|
" augment(new_data = cuisines_test) %>% \n",
|
|
" eval_metrics(truth = cuisine, estimate = .pred_class)"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "zNQnbuejigZM"
|
|
},
|
|
"source": [
|
|
"> ✅ Abeg check:\n",
|
|
">\n",
|
|
"> - [Machine Learning for Social Scientists](https://cimentadaj.github.io/ml_socsci/tree-based-methods.html#random-forests)\n",
|
|
">\n",
|
|
"> - [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)\n",
|
|
">\n",
|
|
"> - [An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)\n",
|
|
">\n",
|
|
"> - <https://algotech.netlify.app/blog/xgboost/> - Dem dey talk about AdaBoost model wey fit work well as alternative to xgboost.\n",
|
|
">\n",
|
|
"> to sabi more about Ensemble classifiers.\n",
|
|
"\n",
|
|
"## 4. Extra - compare plenty models\n",
|
|
"\n",
|
|
"We don fit plenty models for dis lab 🙌. E fit dey tire person or hard to dey create plenty workflows from different preprocessors and/or model specifications, come dey calculate performance metrics one by one.\n",
|
|
"\n",
|
|
"Make we see if we fit solve dis mata by creating one function wey go fit list of workflows for training set, then e go return performance metrics based on test set. We go use `map()` and `map_dfr()` from the [purrr](https://purrr.tidyverse.org/) package to apply functions to each element for list.\n",
|
|
"\n",
|
|
"> [`map()`](https://purrr.tidyverse.org/reference/map.html) functions dey help you replace plenty for loops with code wey short and easy to read. The best place to sabi about [`map()`](https://purrr.tidyverse.org/reference/map.html) functions na the [iteration chapter](http://r4ds.had.co.nz/iteration.html) for R for data science.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "Qzb7LyZnimd2"
|
|
},
|
|
"source": [
|
|
"set.seed(2056)\n",
|
|
"\n",
|
|
"# Create a metric set\n",
|
|
"eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)\n",
|
|
"\n",
|
|
"# Define a function that returns performance metrics\n",
|
|
"compare_models <- function(workflow_list, train_set, test_set){\n",
|
|
" \n",
|
|
" suppressWarnings(\n",
|
|
" # Fit each model to the train_set\n",
|
|
" map(workflow_list, fit, data = train_set) %>% \n",
|
|
" # Make predictions on the test set\n",
|
|
" map_dfr(augment, new_data = test_set, .id = \"model\") %>%\n",
|
|
" # Select desired columns\n",
|
|
" select(model, cuisine, .pred_class) %>% \n",
|
|
" # Evaluate model performance\n",
|
|
" group_by(model) %>% \n",
|
|
" eval_metrics(truth = cuisine, estimate = .pred_class) %>% \n",
|
|
" ungroup()\n",
|
|
" )\n",
|
|
" \n",
|
|
"} # End of function"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "Fwa712sNisDA"
|
|
},
|
|
"source": [
|
|
"Make we call our function and check di accuracy for di models.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"id": "3i4VJOi2iu-a"
|
|
},
|
|
"source": [
|
|
"# Make a list of workflows\n",
|
|
"workflow_list <- list(\n",
|
|
" \"svc\" = svc_linear_wf,\n",
|
|
" \"svm\" = svm_rbf_wf,\n",
|
|
" \"knn\" = knn_wf,\n",
|
|
" \"random_forest\" = rf_wf,\n",
|
|
" \"xgboost\" = boost_wf)\n",
|
|
"\n",
|
|
"# Call the function\n",
|
|
"set.seed(2056)\n",
|
|
"perf_metrics <- compare_models(workflow_list = workflow_list, train_set = cuisines_train, test_set = cuisines_test)\n",
|
|
"\n",
|
|
"# Print out performance metrics\n",
|
|
"perf_metrics %>% \n",
|
|
" group_by(.metric) %>% \n",
|
|
" arrange(desc(.estimate)) %>% \n",
|
|
" slice_head(n=7)\n",
|
|
"\n",
|
|
"# Compare accuracy\n",
|
|
"perf_metrics %>% \n",
|
|
" filter(.metric == \"accuracy\") %>% \n",
|
|
" arrange(desc(.estimate))\n"
|
|
],
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "KuWK_lEli4nW"
|
|
},
|
|
"source": [
|
|
"[**workflowset**](https://workflowsets.tidymodels.org/) package dey allow people create and fit plenty models easily, but e dey mostly work well with resampling techniques like `cross-validation`, wey we never talk about yet.\n",
|
|
"\n",
|
|
"## **🚀Challenge**\n",
|
|
"\n",
|
|
"Each of dis techniques get plenty parameters wey you fit adjust, like `cost` for SVMs, `neighbors` for KNN, `mtry` (Randomly Selected Predictors) for Random Forest.\n",
|
|
"\n",
|
|
"Check how each one default parameters be and think about wetin go happen if you adjust these parameters for the model quality.\n",
|
|
"\n",
|
|
"To sabi more about any model and e parameters, use: `help(\"model\")` e.g `help(\"rand_forest\")`\n",
|
|
"\n",
|
|
"> For real life, we dey usually *estimate* the *best values* for these by training plenty models on top `simulated data set` and check how well all these models dey perform. Dis process na wetin dem dey call **tuning**.\n",
|
|
"\n",
|
|
"### [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/24/)\n",
|
|
"\n",
|
|
"### **Review & Self Study**\n",
|
|
"\n",
|
|
"Plenty big grammar dey for these lessons, so take small time review [dis list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-77952-leestott) of useful words!\n",
|
|
"\n",
|
|
"#### THANK YOU TO:\n",
|
|
"\n",
|
|
"[`Allison Horst`](https://twitter.com/allison_horst/) for the amazing drawings wey make R more friendly and fun. You fit see more of her drawings for her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\n",
|
|
"\n",
|
|
"[Cassie Breviu](https://www.twitter.com/cassieview) and [Jen Looper](https://www.twitter.com/jenlooper) for creating the first Python version of dis module ♥️\n",
|
|
"\n",
|
|
"Enjoy your learning,\n",
|
|
"\n",
|
|
"[Eric](https://twitter.com/ericntay), Gold Microsoft Learn Student Ambassador.\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../../../../../translated_images/r_learners_sm.f9199f76f1e2e493.pcm.jpeg\"\n",
|
|
" width=\"569\"/>\n",
|
|
" <figcaption>Artwork by @allison_horst</figcaption>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"---\n\n<!-- CO-OP TRANSLATOR DISCLAIMER START -->\n**Disclaimer**: \nDis dokyument don use AI transleto service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translation. Even as we dey try make am correct, abeg sabi say machine translation fit get mistake or no dey accurate well. Di original dokyument wey dey for im native language na di main source wey you go trust. For important mata, e better make professional human transleto check am. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.\n<!-- CO-OP TRANSLATOR DISCLAIMER END -->\n"
|
|
]
|
|
}
|
|
]
|
|
} |