diff --git a/4-Classification/2-Classifiers-1/solution/R/lesson_11.Rmd b/4-Classification/2-Classifiers-1/solution/R/lesson_11.Rmd index 695ac7e55..2bc9b9988 100644 --- a/4-Classification/2-Classifiers-1/solution/R/lesson_11.Rmd +++ b/4-Classification/2-Classifiers-1/solution/R/lesson_11.Rmd @@ -18,7 +18,7 @@ In this lesson, we'll explore a variety of classifiers to *predict a given natio ### **Preparation** -This lesson builds up on our [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/1-Introduction/solution/lesson_10-R.ipynb) where we: +This lesson builds up on our [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/1-Introduction/solution/R/lesson_10.html) where we: - Made a gentle introduction to classifications using a dataset about all the brilliant cuisines of Asia and India 😋. diff --git a/4-Classification/2-Classifiers-1/solution/R/lesson_11.html b/4-Classification/2-Classifiers-1/solution/R/lesson_11.html new file mode 100644 index 000000000..20a9d3b90 --- /dev/null +++ b/4-Classification/2-Classifiers-1/solution/R/lesson_11.html @@ -0,0 +1,3560 @@ + + + + +
+ + + + + + + + +In this lesson, we’ll explore a variety of classifiers to predict +a given national cuisine based on a group of ingredients. While +doing so, we’ll learn more about some of the ways that algorithms can be +leveraged for classification tasks.
+ +This lesson builds up on our previous +lesson where we:
+Made a gentle introduction to classifications using a dataset +about all the brilliant cuisines of Asia and India 😋.
Explored some dplyr +verbs to prep and clean our data.
Made beautiful visualizations using ggplot2.
Demonstrated how to deal with imbalanced data by preprocessing it +using recipes.
Demonstrated how to prep and bake our
+recipe to confirm that it will work as supposed to.
For this lesson, we’ll require the following packages to clean, prep +and visualize our data:
+tidyverse: The tidyverse is a collection of R packages
+designed to makes data science faster, easier and more fun!
tidymodels: The tidymodels framework is a collection of packages
+for modeling and machine learning.
DataExplorer: The DataExplorer
+package is meant to simplify and automate EDA process and report
+generation.
themis: The themis package provides Extra
+Recipes Steps for Dealing with Unbalanced Data.
nnet: The nnet
+package provides functions for estimating feed-forward neural
+networks with a single hidden layer, and for multinomial logistic
+regression models.
You can have them installed as:
+install.packages(c("tidyverse", "tidymodels", "DataExplorer", "here"))
Alternatively, the script below checks whether you have the packages +required to complete this module and installs them for you in case they +are missing.
+suppressWarnings(if (!require("pacman"))install.packages("pacman"))
+
+pacman::p_load(tidyverse, tidymodels, DataExplorer, themis, here)Now, let’s hit the ground running!
+We’ll start by picking a few steps from our previous lesson.
+dplyr::select().Everyone loves rice, garlic and ginger!
+# Load the original cuisines data
+df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")## New names:
+## Rows: 2448 Columns: 385
+## ── Column specification
+## ──────────────────────────────────────────────────────── Delimiter: "," chr
+## (1): cuisine dbl (384): ...1, almond, angelica, anise, anise_seed, apple,
+## apple_brandy, a...
+## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
+## Specify the column types or set `show_col_types = FALSE` to quiet this message.
+## • `` -> `...1`
+# Drop id column, rice, garlic and ginger from our original data set
+df_select <- df %>%
+ select(-c(1, rice, garlic, ginger)) %>%
+ # Encode cuisine column as categorical
+ mutate(cuisine = factor(cuisine))
+
+# Display new data set
+df_select %>%
+ slice_head(n = 5)Perfect! Now, time to split the data such that 70% of the data goes
+to training and 30% goes to testing. We’ll also apply a
+stratification technique when splitting the data to
+maintain the proportion of each cuisine in the training and
+validation datasets.
rsample, a package in +Tidymodels, provides infrastructure for efficient data splitting and +resampling:
+# Load the core Tidymodels packages into R session
+library(tidymodels)
+
+# Create split specification
+set.seed(2056)
+cuisines_split <- initial_split(data = df_select,
+ strata = cuisine,
+ prop = 0.7)
+
+# Extract the data in each split
+cuisines_train <- training(cuisines_split)
+cuisines_test <- testing(cuisines_split)
+
+# Print the number of cases in each split
+cat("Training cases: ", nrow(cuisines_train), "\n",
+ "Test cases: ", nrow(cuisines_test), sep = "")## Training cases: 1712
+## Test cases: 736
+
+# Display distribution of cuisines in the training set
+cuisines_train %>%
+ count(cuisine) %>%
+ arrange(desc(n))As you might have noticed in the original data set as well as in our +training set, there is quite an unequal distribution in the number of +cuisines. Korean cuisines are almost 3 times Thai cuisines. +Imbalanced data often has negative effects on the model performance. +Many models perform best when the number of observations is equal and, +thus, tend to struggle with unbalanced data.
+There are majorly two ways of dealing with imbalanced data sets:
+adding observations to the minority class:
+Over-sampling e.g using a SMOTE algorithm which
+synthetically generates new examples of the minority class using nearest
+neighbors of these cases.
removing observations from majority class:
+Under-sampling
In our previous lesson, we demonstrated how to deal with imbalanced
+data sets using a recipe. A recipe can be thought of as a
+blueprint that describes what steps should be applied to a data set in
+order to get it ready for data analysis. In our case, we want to have an
+equal distribution in the number of our cuisines for our
+training set. Let’s get right into it.
# Load themis package for dealing with imbalanced data
+library(themis)
+
+# Create a recipe for preprocessing training data
+cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%
+ step_smote(cuisine)
+
+# Print recipe
+cuisines_recipe##
+## ── Recipe ──────────────────────────────────────────────────────────────────────
+##
+## ── Inputs
+## Number of variables by role
+## outcome: 1
+## predictor: 380
+##
+## ── Operations
+## • SMOTE based on: cuisine
+You can of course go ahead and confirm (using prep+bake) that the
+recipe will work as you expect it - all the cuisine labels having
+559 observations.
Since we’ll be using this recipe as a preprocessor for modeling, a
+workflow() will do all the prep and bake for us, so we
+won’t have to manually estimate the recipe.
Now we are ready to train a model 👩💻👨💻!
+Now we have to decide which algorithm to use for the job 🤔.
+In Tidymodels, the parsnip package
+provides consistent interface for working with models across different
+engines (packages). Please see the parsnip documentation to explore model types &
+engines and their corresponding model
+arguments. The variety is quite bewildering at first sight. For
+instance, the following methods all include classification
+techniques:
C5.0 Rule-Based Classification Models
Flexible Discriminant Models
Linear Discriminant Models
Regularized Discriminant Models
Logistic Regression Models
Multinomial Regression Models
Naive Bayes Models
Support Vector Machines
Nearest Neighbors
Decision Trees
Ensemble methods
Neural Networks
The list goes on!
+So, which classifier should you choose? Often, running through +several and looking for a good result is a way to test.
+++AutoML solves this problem neatly by running these comparisons in the +cloud, allowing you to choose the best algorithm for your data. Try it +here
+
Also the choice of classifier depends on our problem. For instance,
+when the outcome can be categorized into
+more than two classes, like in our case, you must use a
+multiclass classification algorithm as opposed to
+binary classification.
A better way than wildly guessing, however, is to follow the ideas on +this downloadable ML +Cheat sheet. Here, we discover that, for our multiclass problem, we +have some choices:
+Let’s see if we can reason our way through different approaches given +the constraints we have:
+Deep Neural networks are too heavy. Given our +clean, but minimal dataset, and the fact that we are running training +locally via notebooks, deep neural networks are too heavyweight for this +task.
No two-class classifier. We do not use a +two-class classifier, so that rules out one-vs-all.
Decision tree or logistic regression could work. +A decision tree might work, or multinomial regression/multiclass +logistic regression for multiclass data.
Multiclass Boosted Decision Trees solve a different +problem. The multiclass boosted decision tree is most suitable +for nonparametric tasks, e.g. tasks designed to build rankings, so it is +not useful for us.
Also, normally before embarking on more complex machine learning
+models e.g ensemble methods, it’s a good idea to build the simplest
+possible model to get an idea of what is going on. So for this lesson,
+we’ll start with a multinomial logistic regression
+model.
++Logistic regression is a technique used when the outcome variable is +categorical (or nominal). For Binary logistic regression the number of +outcome variables is two, whereas the number of outcome variables for +multinomial logistic regression is more than two. See Advanced +Regression Methods for further reading.
+
In Tidymodels, parsnip::multinom_reg(), defines a model
+that uses linear predictors to predict multiclass data using the
+multinomial distribution. See ?multinom_reg() for the
+different ways/engines you can use to fit this model.
For this example, we’ll fit a Multinomial regression model via the +default nnet +engine.
+++I picked a value for
+penaltysort of randomly. There are +better ways to choose this value that is, by using +resamplingandtuningthe model which we’ll +discuss later.See Tidymodels: +Get Started in case you want to learn more on how to tune model +hyperparameters.
+
# Create a multinomial regression model specification
+mr_spec <- multinom_reg(penalty = 1) %>%
+ set_engine("nnet", MaxNWts = 2086) %>%
+ set_mode("classification")
+
+# Print model specification
+mr_spec## Multinomial Regression Model Specification (classification)
+##
+## Main Arguments:
+## penalty = 1
+##
+## Engine-Specific Arguments:
+## MaxNWts = 2086
+##
+## Computational engine: nnet
+Great job 🥳! Now that we have a recipe and a model specification, we
+need to find a way of bundling them together into an object that will
+first preprocess the data then fit the model on the preprocessed data
+and also allow for potential post-processing activities. In Tidymodels,
+this convenient object is called a workflow and
+conveniently holds your modeling components! This is what we’d call
+pipelines in Python.
So let’s bundle everything up into a workflow!📦
+# Bundle recipe and model specification
+mr_wf <- workflow() %>%
+ add_recipe(cuisines_recipe) %>%
+ add_model(mr_spec)
+
+# Print out workflow
+mr_wf## ══ Workflow ════════════════════════════════════════════════════════════════════
+## Preprocessor: Recipe
+## Model: multinom_reg()
+##
+## ── Preprocessor ────────────────────────────────────────────────────────────────
+## 1 Recipe Step
+##
+## • step_smote()
+##
+## ── Model ───────────────────────────────────────────────────────────────────────
+## Multinomial Regression Model Specification (classification)
+##
+## Main Arguments:
+## penalty = 1
+##
+## Engine-Specific Arguments:
+## MaxNWts = 2086
+##
+## Computational engine: nnet
+Workflows 👌👌! A workflow() can be fit
+in much the same way a model can. So, time to train a model!
# Train a multinomial regression model
+mr_fit <- fit(object = mr_wf, data = cuisines_train)
+
+mr_fit## ══ Workflow [trained] ══════════════════════════════════════════════════════════
+## Preprocessor: Recipe
+## Model: multinom_reg()
+##
+## ── Preprocessor ────────────────────────────────────────────────────────────────
+## 1 Recipe Step
+##
+## • step_smote()
+##
+## ── Model ───────────────────────────────────────────────────────────────────────
+## Call:
+## nnet::multinom(formula = ..y ~ ., data = data, decay = ~1, MaxNWts = ~2086,
+## trace = FALSE)
+##
+## Coefficients:
+## (Intercept) almond angelica anise anise_seed apple
+## indian 0.19723325 0.2409661 0 -5.004955e-05 -0.1657635 -0.05769734
+## japanese 0.13961959 -0.6262400 0 -1.169155e-04 -0.4893596 -0.08585717
+## korean 0.22377347 -0.1833485 0 -5.560395e-05 -0.2489401 -0.15657804
+## thai -0.04336577 -0.6106258 0 4.903828e-04 -0.5782866 0.63451105
+## apple_brandy apricot armagnac artemisia artichoke asparagus
+## indian 0 0.37042636 0 -0.09122797 0 -0.27181970
+## japanese 0 0.28895643 0 -0.12651100 0 0.14054037
+## korean 0 -0.07981259 0 0.55756709 0 -0.66979948
+## thai 0 -0.33160904 0 -0.10725182 0 -0.02602152
+## avocado bacon baked_potato balm banana barley
+## indian -0.46624197 0.16008055 0 0 -0.2838796 0.2230625
+## japanese 0.90341344 0.02932727 0 0 -0.4142787 2.0953906
+## korean -0.06925382 -0.35804134 0 0 -0.2686963 -0.7233404
+## thai -0.21473955 -0.75594439 0 0 0.6784880 -0.4363320
+## bartlett_pear basil bay bean beech
+## indian 0 -0.7128756 0.1011587 -0.8777275 -0.0004380795
+## japanese 0 0.1288697 0.9425626 -0.2380748 0.3373437611
+## korean 0 -0.2445193 -0.4744318 -0.8957870 -0.0048784496
+## thai 0 1.5365848 0.1333256 0.2196970 -0.0113078024
+## beef beef_broth beef_liver beer beet
+## indian -0.7985278 0.2430186 -0.035598065 -0.002173738 0.01005813
+## japanese 0.2241875 -0.3653020 -0.139551027 0.128905553 0.04923911
+## korean 0.5366515 -0.6153237 0.213455197 -0.010828645 0.27325423
+## thai 0.1570012 -0.9364154 -0.008032213 -0.035063746 -0.28279823
+## bell_pepper bergamot berry bitter_orange black_bean
+## indian 0.49074330 0 0.58947607 0.191256164 -0.1945233
+## japanese 0.09074167 0 -0.25917977 -0.118915977 -0.3442400
+## korean -0.57876763 0 -0.07874180 -0.007729435 -0.5220672
+## thai 0.92554006 0 -0.07210196 -0.002983296 -0.4614426
+## black_currant black_mustard_seed_oil black_pepper black_raspberry
+## indian 0 0.38935801 -0.4453495 0
+## japanese 0 -0.05452887 -0.5440869 0
+## korean 0 -0.03929970 0.8025454 0
+## thai 0 -0.21498372 -0.9854806 0
+## black_sesame_seed black_tea blackberry blackberry_brandy
+## indian -0.2759246 0.3079977 0.191256164 0
+## japanese -0.6101687 -0.1671913 -0.118915977 0
+## korean 1.5197674 -0.3036261 -0.007729435 0
+## thai -0.1755656 -0.1487033 -0.002983296 0
+## blue_cheese blueberry bone_oil bourbon_whiskey brandy
+## indian 0 0.216164294 -0.2276744 0 0.22427587
+## japanese 0 -0.119186087 0.3913019 0 -0.15595599
+## korean 0 -0.007821986 0.2854487 0 -0.02562342
+## thai 0 -0.004947048 -0.0253658 0 -0.05715244
+##
+## ...
+## and 308 more lines.
+The output shows the coefficients that the model learned during +training.
+It’s time to see how the model performed 📏 by evaluating it on a +test set! Let’s begin by making predictions on the test set.
+# Make predictions on the test set
+results <- cuisines_test %>% select(cuisine) %>%
+ bind_cols(mr_fit %>% predict(new_data = cuisines_test))
+
+# Print out results
+results %>%
+ slice_head(n = 5)Great job! In Tidymodels, evaluating model performance can be done +using yardstick - a +package used to measure the effectiveness of models using performance +metrics. As we did in our logistic regression lesson, let’s begin by +computing a confusion matrix.
+# Confusion matrix for categorical data
+conf_mat(data = results, truth = cuisine, estimate = .pred_class)## Truth
+## Prediction chinese indian japanese korean thai
+## chinese 83 1 8 15 10
+## indian 4 163 1 2 6
+## japanese 21 5 73 25 1
+## korean 15 0 11 191 0
+## thai 10 11 3 7 70
+When dealing with multiple classes, it’s generally more intuitive to +visualize this as a heat map, like this:
+update_geom_defaults(geom = "tile", new = list(color = "black", alpha = 0.7))
+# Visualize confusion matrix
+results %>%
+ conf_mat(cuisine, .pred_class) %>%
+ autoplot(type = "heatmap")The darker squares in the confusion matrix plot indicate high numbers +of cases, and you can hopefully see a diagonal line of darker squares +indicating cases where the predicted and actual label are the same.
+Let’s now calculate summary statistics for the confusion matrix.
+# Summary stats for confusion matrix
+conf_mat(data = results, truth = cuisine, estimate = .pred_class) %>% summary()If we narrow down to some metrics such as accuracy, sensitivity, ppv, +we are not badly off for a start 🥳!
+Let’s ask one subtle question: What criteria is used to settle for a +given type of cuisine as the predicted outcome?
+Well, Statistical machine learning algorithms, like logistic
+regression, are based on probability; so what actually gets
+predicted by a classifier is a probability distribution over a set of
+possible outcomes. The class with the highest probability is then chosen
+as the most likely outcome for the given observations.
Let’s see this in action by making both hard class predictions and +probabilities.
+# Make hard class prediction and probabilities
+results_prob <- cuisines_test %>%
+ select(cuisine) %>%
+ bind_cols(mr_fit %>% predict(new_data = cuisines_test)) %>%
+ bind_cols(mr_fit %>% predict(new_data = cuisines_test, type = "prob"))
+
+# Print out results
+results_prob %>%
+ slice_head(n = 5)Much better!
+✅ Can you explain why the model is pretty sure that the first +observation is Thai?
+In this lesson, you used your cleaned data to build a machine +learning model that can predict a national cuisine based on a series of +ingredients. Take some time to read through the many options +Tidymodels provides to classify data and other +ways to fit multinomial regression.
+Allison Horst
+for creating the amazing illustrations that make R more welcoming and
+engaging. Find more illustrations at her gallery.
Cassie Breviu and Jen Looper for creating the +original Python version of this module ♥️
+Happy Learning,
+Eric, Gold Microsoft Learn +Student Ambassador.
+In this second classification lesson, we will explore
+more ways to classify categorical data. We will also learn
+about the ramifications for choosing one classifier over the other.
We assume that you have completed the previous lessons since we will +be carrying forward some concepts we learned before.
+For this lesson, we’ll require the following packages:
+tidyverse: The tidyverse is a collection of R packages
+designed to makes data science faster, easier and more fun!
tidymodels: The tidymodels framework is a collection of packages
+for modeling and machine learning.
themis: The themis package provides Extra
+Recipes Steps for Dealing with Unbalanced Data.
You can have them installed as:
+install.packages(c("tidyverse", "tidymodels", "kernlab", "themis", "ranger", "xgboost", "kknn"))
Alternatively, the script below checks whether you have the packages +required to complete this module and installs them for you in case they +are missing.
+suppressWarnings(if (!require("pacman"))install.packages("pacman"))
+
+pacman::p_load(tidyverse, tidymodels, themis, kernlab, ranger, xgboost, kknn)##
+## The downloaded binary packages are in
+## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
+##
+## The downloaded binary packages are in
+## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
+##
+## The downloaded binary packages are in
+## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
+##
+## The downloaded binary packages are in
+## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpE2TSCy/downloaded_packages
+Now, let’s hit the ground running!
+In our previous +lesson, we tried to address the question: how do we choose between +multiple models? To a great extent, it depends on the characteristics of +the data and the type of problem we want to solve (for instance +classification or regression?)
+Previously, we learned about the various options you have when +classifying data using Microsoft’s cheat sheet. Python’s Machine +Learning framework, Scikit-learn, offers a similar but more granular +cheat sheet that can further help narrow down your estimators (another +term for classifiers):
+
+
++Tip: visit +this map online and click along the path to read documentation.
+The Tidymodels +reference site also provides an excellent documentation about +different types of model.
+
This map is very helpful once you have a clear grasp of your data, as +you can ‘walk’ along its paths to a decision:
+We have >50 samples
We want to predict a category
We have labeled data
We have fewer than 100K samples
✨ We can choose a Linear SVC
If that doesn’t work, since we have numeric data
+We can try a ✨ KNeighbors Classifier
+This is a very helpful trail to follow. Now, let’s get right into it +using the tidymodels modelling +framework: a consistent and flexible collection of R packages developed +to encourage good statistical practice 😊.
+From our previous lessons, we learnt that there were a set of common +ingredients across our cuisines. Also, there was quite an unequal +distribution in the number of cuisines.
+We’ll deal with these by
+Dropping the most common ingredients that create confusion
+between distinct cuisines, using dplyr::select().
Use a recipe that preprocesses the data to get it
+ready for modelling by applying an over-sampling
+algorithm.
We already looked at the above in the previous lesson so this should +be a breeze 🥳!
+# Load the core Tidyverse and Tidymodels packages
+library(tidyverse)
+library(tidymodels)
+
+# Load the original cuisines data
+df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/4-Classification/data/cuisines.csv")## New names:
+## Rows: 2448 Columns: 385
+## ── Column specification
+## ──────────────────────────────────────────────────────── Delimiter: "," chr
+## (1): cuisine dbl (384): ...1, almond, angelica, anise, anise_seed, apple,
+## apple_brandy, a...
+## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
+## Specify the column types or set `show_col_types = FALSE` to quiet this message.
+## • `` -> `...1`
+# Drop id column, rice, garlic and ginger from our original data set
+df_select <- df %>%
+ select(-c(1, rice, garlic, ginger)) %>%
+ # Encode cuisine column as categorical
+ mutate(cuisine = factor(cuisine))
+
+
+# Create data split specification
+set.seed(2056)
+cuisines_split <- initial_split(data = df_select,
+ strata = cuisine,
+ prop = 0.7)
+
+# Extract the data in each split
+cuisines_train <- training(cuisines_split)
+cuisines_test <- testing(cuisines_split)
+
+# Display distribution of cuisines in the training set
+cuisines_train %>%
+ count(cuisine) %>%
+ arrange(desc(n))Imbalanced data often has negative effects on the model performance. +Many models perform best when the number of observations is equal and, +thus, tend to struggle with unbalanced data.
+There are majorly two ways of dealing with imbalanced data sets:
+adding observations to the minority class:
+Over-sampling e.g using a SMOTE algorithm which
+synthetically generates new examples of the minority class using nearest
+neighbors of these cases.
removing observations from majority class:
+Under-sampling
In our previous lesson, we demonstrated how to deal with imbalanced
+data sets using a recipe. A recipe can be thought of as a
+blueprint that describes what steps should be applied to a data set in
+order to get it ready for data analysis. In our case, we want to have an
+equal distribution in the number of our cuisines for our
+training set. Let’s get right into it.
# Load themis package for dealing with imbalanced data
+library(themis)
+
+# Create a recipe for preprocessing training data
+cuisines_recipe <- recipe(cuisine ~ ., data = cuisines_train) %>%
+ step_smote(cuisine)
+
+# Print recipe
+cuisines_recipe##
+## ── Recipe ──────────────────────────────────────────────────────────────────────
+##
+## ── Inputs
+## Number of variables by role
+## outcome: 1
+## predictor: 380
+##
+## ── Operations
+## • SMOTE based on: cuisine
+Now we are ready to train models 👩💻👨💻!
+In our previous lesson, we looked at multinomial regression models. +Let’s explore some more flexible models for classification.
+In the context of classification,
+Support Vector Machines is a machine learning technique
+that tries to find a hyperplane that “best” separates the
+classes. Let’s look at a simple example:
H1 does not separate the classes. H2 does, but +only with a small margin. H3 separates them with the maximal +margin.
+Support-Vector clustering (SVC) is a child of the Support-Vector
+machines family of ML techniques. In SVC, the hyperplane is chosen to
+correctly separate most of the training observations, but
+may misclassify a few observations. By allowing some points
+to be on the wrong side, the SVM becomes more robust to outliers hence
+better generalization to new data. The parameter that regulates this
+violation is referred to as cost which has a default value
+of 1 (see help("svm_poly")).
Let’s create a linear SVC by setting degree = 1 in a
+polynomial SVM model.
# Make a linear SVC specification
+svc_linear_spec <- svm_poly(degree = 1) %>%
+ set_engine("kernlab") %>%
+ set_mode("classification")
+
+# Bundle specification and recipe into a worklow
+svc_linear_wf <- workflow() %>%
+ add_recipe(cuisines_recipe) %>%
+ add_model(svc_linear_spec)
+
+# Print out workflow
+svc_linear_wf## ══ Workflow ════════════════════════════════════════════════════════════════════
+## Preprocessor: Recipe
+## Model: svm_poly()
+##
+## ── Preprocessor ────────────────────────────────────────────────────────────────
+## 1 Recipe Step
+##
+## • step_smote()
+##
+## ── Model ───────────────────────────────────────────────────────────────────────
+## Polynomial Support Vector Machine Model Specification (classification)
+##
+## Main Arguments:
+## degree = 1
+##
+## Computational engine: kernlab
+Now that we have captured the preprocessing steps and model
+specification into a workflow, we can go ahead and train the
+linear SVC and evaluate results while at it. For performance metrics,
+let’s create a metric set that will evaluate: accuracy,
+sensitivity, Positive Predicted Value and
+F Measure
++ ++
augment()will add column(s) for predictions to the +given data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
+# Create a metric set
+eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)
+
+
+# Make predictions and Evaluate model performance
+svc_linear_fit %>%
+ augment(new_data = cuisines_test) %>%
+ eval_metrics(truth = cuisine, estimate = .pred_class)The support vector machine (SVM) is an extension of the support +vector classifier in order to accommodate a non-linear boundary between +the classes. In essence, SVMs use the kernel trick to enlarge +the feature space to adapt to nonlinear relationships between classes. +One popular and extremely flexible kernel function used by SVMs is the +Radial basis function. Let’s see how it will perform on our +data.
+set.seed(2056)
+
+# Make an RBF SVM specification
+svm_rbf_spec <- svm_rbf() %>%
+ set_engine("kernlab") %>%
+ set_mode("classification")
+
+# Bundle specification and recipe into a worklow
+svm_rbf_wf <- workflow() %>%
+ add_recipe(cuisines_recipe) %>%
+ add_model(svm_rbf_spec)
+
+
+# Train an RBF model
+svm_rbf_fit <- svm_rbf_wf %>%
+ fit(data = cuisines_train)## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
+# Make predictions and Evaluate model performance
+svm_rbf_fit %>%
+ augment(new_data = cuisines_test) %>%
+ eval_metrics(truth = cuisine, estimate = .pred_class)Much better 🤩!
+++✅ Please see:
++
+- +
Support Vector +Machines, Hands-on Machine Learning with R
- +
Support Vector +Machines, An Introduction to Statistical Learning with +Applications in R
for further reading.
+
K-nearest neighbor (KNN) is an algorithm in which each +observation is predicted based on its similarity to other +observations.
+Let’s fit one to our data.
+# Make a KNN specification
+knn_spec <- nearest_neighbor() %>%
+ set_engine("kknn") %>%
+ set_mode("classification")
+
+# Bundle recipe and model specification into a workflow
+knn_wf <- workflow() %>%
+ add_recipe(cuisines_recipe) %>%
+ add_model(knn_spec)
+
+# Train a boosted tree model
+knn_wf_fit <- knn_wf %>%
+ fit(data = cuisines_train)
+
+
+# Make predictions and Evaluate model performance
+knn_wf_fit %>%
+ augment(new_data = cuisines_test) %>%
+ eval_metrics(truth = cuisine, estimate = .pred_class)It appears that this model is not performing that well. Probably
+changing the model’s arguments (see
+help("nearest_neighbor") will improve model performance. Be
+sure to try it out.
++✅ Please see:
+ +to learn more about K-Nearest Neighbors classifiers.
+
Ensemble algorithms work by combining multiple base estimators to +produce an optimal model either by:
+bagging: applying an averaging function to a
+collection of base models
boosting: building a sequence of models that build on
+one another to improve predictive performance.
Let’s start by trying out a Random Forest model, which builds a large +collection of decision trees then applies an averaging function to for a +better overall model.
+# Make a random forest specification
+rf_spec <- rand_forest() %>%
+ set_engine("ranger") %>%
+ set_mode("classification")
+
+# Bundle recipe and model specification into a workflow
+rf_wf <- workflow() %>%
+ add_recipe(cuisines_recipe) %>%
+ add_model(rf_spec)
+
+# Train a random forest model
+rf_wf_fit <- rf_wf %>%
+ fit(data = cuisines_train)
+
+
+# Make predictions and Evaluate model performance
+rf_wf_fit %>%
+ augment(new_data = cuisines_test) %>%
+ eval_metrics(truth = cuisine, estimate = .pred_class)Good job 👏!
+Let’s also experiment with a Boosted Tree model.
+Boosted Tree defines an ensemble method that creates a series of +sequential decision trees where each tree depends on the results of +previous trees in an attempt to incrementally reduce the error. It +focuses on the weights of incorrectly classified items and adjusts the +fit for the next classifier to correct.
+There are different ways to fit this model (see
+help("boost_tree")). In this example, we’ll fit Boosted
+trees via xgboost engine.
# Make a boosted tree specification
+boost_spec <- boost_tree(trees = 200) %>%
+ set_engine("xgboost") %>%
+ set_mode("classification")
+
+# Bundle recipe and model specification into a workflow
+boost_wf <- workflow() %>%
+ add_recipe(cuisines_recipe) %>%
+ add_model(boost_spec)
+
+# Train a boosted tree model
+boost_wf_fit <- boost_wf %>%
+ fit(data = cuisines_train)
+
+
+# Make predictions and Evaluate model performance
+boost_wf_fit %>%
+ augment(new_data = cuisines_test) %>%
+ eval_metrics(truth = cuisine, estimate = .pred_class)++✅ Please see:
++
+- +
- +
- +
An Introduction to +Statistical Learning with Applications in R
- +
https://algotech.netlify.app/blog/xgboost/ - Explores +the AdaBoost model which is a good alternative to xgboost.
to learn more about Ensemble classifiers.
+
We have fitted quite a number of models in this lab 🙌. It can become +tedious or onerous to create a lot of workflows from different sets of +preprocessors and/or model specifications and then calculate the +performance metrics one by one.
+Let’s see if we can address this by creating a function that fits a
+list of workflows on the training set then returns the performance
+metrics based on the test set. We’ll get to use map() and
+map_dfr() from the purrr package to apply functions
+to each element in list.
+++
map()+functions allow you to replace many for loops with code that is both +more succinct and easier to read. The best place to learn about themap()+functions is the iteration chapter in R +for data science.
set.seed(2056)
+
+# Create a metric set
+eval_metrics <- metric_set(ppv, sens, accuracy, f_meas)
+
+# Define a function that returns performance metrics
+compare_models <- function(workflow_list, train_set, test_set){
+
+ suppressWarnings(
+ # Fit each model to the train_set
+ map(workflow_list, fit, data = train_set) %>%
+ # Make predictions on the test set
+ map_dfr(augment, new_data = test_set, .id = "model") %>%
+ # Select desired columns
+ select(model, cuisine, .pred_class) %>%
+ # Evaluate model performance
+ group_by(model) %>%
+ eval_metrics(truth = cuisine, estimate = .pred_class) %>%
+ ungroup()
+ )
+
+} # End of functionLet’s call our function and compare the accuracy across the +models.
+# Make a list of workflows
+workflow_list <- list(
+ "svc" = svc_linear_wf,
+ "svm" = svm_rbf_wf,
+ "knn" = knn_wf,
+ "random_forest" = rf_wf,
+ "xgboost" = boost_wf)
+
+# Call the function
+set.seed(2056)
+perf_metrics <- compare_models(workflow_list = workflow_list, train_set = cuisines_train, test_set = cuisines_test)
+
+# Print out performance metrics
+perf_metrics %>%
+ group_by(.metric) %>%
+ arrange(desc(.estimate)) %>%
+ slice_head(n=7)workflowset
+package allow users to create and easily fit a large number of models
+but is mostly designed to work with resampling techniques such as
+cross-validation, an approach we are yet to cover.
Each of these techniques has a large number of parameters that you
+can tweak for instance cost in SVMs, neighbors
+in KNN, mtry (Randomly Selected Predictors) in Random
+Forest.
Research each one’s default parameters and think about what tweaking +these parameters would mean for the model’s quality.
+To find out more about a particular model and its parameters, use:
+help("model") e.g help("rand_forest")
++ +In practice, we usually estimate the best values +for these by training many models on a
+simulated data set+and measuring how well all these models perform. This process is called +tuning.
There’s a lot of jargon in these lessons, so take a minute to review +this +list of useful terminology!
+Allison Horst
+for creating the amazing illustrations that make R more welcoming and
+engaging. Find more illustrations at her gallery.
Cassie Breviu and Jen Looper for creating the +original Python version of this module ♥️
+Happy Learning,
+Eric, Gold Microsoft Learn +Student Ambassador.
+Clustering is a type of Unsupervised +Learning that presumes that a dataset is unlabelled or that its +inputs are not matched with predefined outputs. It uses various +algorithms to sort through unlabeled data and provide groupings +according to patterns it discerns in the data.
+ +Clustering +is very useful for data exploration. Let’s see if it can help discover +trends and patterns in the way Nigerian audiences consume music.
+++✅ Take a minute to think about the uses of clustering. In real life, +clustering happens whenever you have a pile of laundry and need to sort +out your family members’ clothes 🧦👕👖🩲. In data science, clustering +happens when trying to analyze a user’s preferences, or determine the +characteristics of any unlabeled dataset. Clustering, in a way, helps +make sense of chaos, like a sock drawer.
+
In a professional setting, clustering can be used to determine things +like market segmentation, determining what age groups buy what items, +for example. Another use would be anomaly detection, perhaps to detect +fraud from a dataset of credit card transactions. Or you might use +clustering to determine tumors in a batch of medical scans.
+✅ Think a minute about how you might have encountered clustering ‘in +the wild’, in a banking, e-commerce, or business setting.
+++🎓 Interestingly, cluster analysis originated in the fields of +Anthropology and Psychology in the 1930s. Can you imagine how it might +have been used?
+
Alternately, you could use it for grouping search results - by +shopping links, images, or reviews, for example. Clustering is useful +when you have a large dataset that you want to reduce and on which you +want to perform more granular analysis, so the technique can be used to +learn about data before other models are constructed.
+✅ Once your data is organized in clusters, you assign it a cluster +Id, and this technique can be useful when preserving a dataset’s +privacy; you can instead refer to a data point by its cluster id, rather +than by more revealing identifiable data. Can you think of other reasons +why you’d refer to a cluster Id rather than other elements of the +cluster to identify it?
+++🎓 How we create clusters has a lot to do with how we gather up the +data points into groups. Let’s unpack some vocabulary:
+🎓 ‘Transductive’ +vs. ‘inductive’
+Transductive inference is derived from observed training cases that +map to specific test cases. Inductive inference is derived from training +cases that map to general rules which are only then applied to test +cases.
+An example: Imagine you have a dataset that is only partially +labelled. Some things are ‘records’, some ‘cds’, and some are blank. +Your job is to provide labels for the blanks. If you choose an inductive +approach, you’d train a model looking for ‘records’ and ‘cds’, and apply +those labels to your unlabeled data. This approach will have trouble +classifying things that are actually ‘cassettes’. A transductive +approach, on the other hand, handles this unknown data more effectively +as it works to group similar items together and then applies a label to +a group. In this case, clusters might reflect ‘round musical things’ and +‘square musical things’.
+🎓 ‘Non-flat’ +vs. ‘flat’ geometry
+Derived from mathematical terminology, non-flat vs. flat geometry +refers to the measure of distances between points by either ‘flat’ (Euclidean) or +‘non-flat’ (non-Euclidean) geometrical methods.
+‘Flat’ in this context refers to Euclidean geometry (parts of which +are taught as ‘plane’ geometry), and non-flat refers to non-Euclidean +geometry. What does geometry have to do with machine learning? Well, as +two fields that are rooted in mathematics, there must be a common way to +measure distances between points in clusters, and that can be done in a +‘flat’ or ‘non-flat’ way, depending on the nature of the data. Euclidean +distances are measured as the length of a line segment between two +points. Non-Euclidean +distances are measured along a curve. If your data, visualized, +seems to not exist on a plane, you might need to use a specialized +algorithm to handle it.
+
+ ++Clusters are defined by their distance matrix, e.g. the distances +between points. This distance can be measured a few ways. Euclidean +clusters are defined by the average of the point values, and contain a +‘centroid’ or center point. Distances are thus measured by the distance +to that centroid. Non-Euclidean distances refer to ‘clustroids’, the +point closest to other points. Clustroids in turn can be defined in +various ways.
+ +Constrained +Clustering introduces ‘semi-supervised’ learning into this +unsupervised method. The relationships between points are flagged as +‘cannot link’ or ‘must-link’ so some rules are forced on the +dataset.
+An example: If an algorithm is set free on a batch of unlabelled or +semi-labelled data, the clusters it produces may be of poor quality. In +the example above, the clusters might group ‘round music things’ and +‘square music things’ and ‘triangular things’ and ‘cookies’. If given +some constraints, or rules to follow (“the item must be made of +plastic”, “the item needs to be able to produce music”) this can help +‘constrain’ the algorithm to make better choices.
+🎓 ‘Density’
+Data that is ‘noisy’ is considered to be ‘dense’. The distances +between points in each of its clusters may prove, on examination, to be +more or less dense, or ‘crowded’ and thus this data needs to be analyzed +with the appropriate clustering method. This +article demonstrates the difference between using K-Means clustering +vs. HDBSCAN algorithms to explore a noisy dataset with uneven cluster +density.
+
Deepen your understanding of clustering techniques in this Learn +module
+There are over 100 clustering algorithms, and their use depends on +the nature of the data at hand. Let’s discuss some of the major +ones:
+Centroid clustering. This popular algorithm
+requires the choice of ‘k’, or the number of clusters to form, after
+which the algorithm determines the center point of a cluster and gathers
+data around that point. K-means
+clustering is a popular version of centroid clustering which
+separates a data set into pre-defined K groups. The center is determined
+by the nearest mean, thus the name. The squared distance from the
+cluster is minimized.
Distribution-based clustering. Based in +statistical modeling, distribution-based clustering centers on +determining the probability that a data point belongs to a cluster, and +assigning it accordingly. Gaussian mixture methods belong to this +type.
Density-based clustering. Data points are +assigned to clusters based on their density, or their grouping around +each other. Data points far from the group are considered outliers or +noise. DBSCAN, Mean-shift and OPTICS belong to this type of +clustering.
Grid-based clustering. For multi-dimensional +datasets, a grid is created and the data is divided amongst the grid’s +cells, thereby creating clusters.
The best way to learn about clustering is to try it for yourself, so +that’s what you’ll do in this exercise.
+We’ll require some packages to knock-off this module. You can have
+them installed as:
+install.packages(c('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork'))
Alternatively, the script below checks whether you have the packages +required to complete this module and installs them for you in case some +are missing.
+ +## Loading required package: pacman
+pacman::p_load('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork')##
+## The downloaded binary packages are in
+## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmplRAI5s/downloaded_packages
+##
+## summarytools installed
+## Warning in pacman::p_load("tidyverse", "tidymodels", "DataExplorer", "summarytools", : Failed to install/load:
+## summarytools
+
+Clustering as a technique is greatly aided by proper visualization, +so let’s get started by visualizing our music data. This exercise will +help us decide which of the methods of clustering we should most +effectively use for the nature of this data.
+Let’s hit the ground running by importing the data.
+# Load the core tidyverse and make it available in your current R session
+library(tidyverse)
+
+# Import the data into a tibble
+df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv")
+
+# View the first 5 rows of the data set
+df %>%
+ slice_head(n = 5)Sometimes, we may want some little more information on our data. We
+can have a look at the data and its structure
+by using the glimpse()
+function:
## Rows: 530
+## Columns: 16
+## $ name <chr> "Sparky", "shuga rush", "LITT!", "Confident / Feeling…
+## $ album <chr> "Mandy & The Jungle", "EVERYTHING YOU HEARD IS TRUE",…
+## $ artist <chr> "Cruel Santino", "Odunsi (The Engine)", "AYLØ", "Lady…
+## $ artist_top_genre <chr> "alternative r&b", "afropop", "indie r&b", "nigerian …
+## $ release_date <dbl> 2019, 2020, 2018, 2019, 2018, 2020, 2018, 2018, 2019,…
+## $ length <dbl> 144000, 89488, 207758, 175135, 152049, 184800, 202648…
+## $ popularity <dbl> 48, 30, 40, 14, 25, 26, 29, 27, 36, 30, 33, 35, 46, 2…
+## $ danceability <dbl> 0.666, 0.710, 0.836, 0.894, 0.702, 0.803, 0.818, 0.80…
+## $ acousticness <dbl> 0.8510, 0.0822, 0.2720, 0.7980, 0.1160, 0.1270, 0.452…
+## $ energy <dbl> 0.420, 0.683, 0.564, 0.611, 0.833, 0.525, 0.587, 0.30…
+## $ instrumentalness <dbl> 5.34e-01, 1.69e-04, 5.37e-04, 1.87e-04, 9.10e-01, 6.6…
+## $ liveness <dbl> 0.1100, 0.1010, 0.1100, 0.0964, 0.3480, 0.1290, 0.590…
+## $ loudness <dbl> -6.699, -5.640, -7.127, -4.961, -6.044, -10.034, -9.8…
+## $ speechiness <dbl> 0.0829, 0.3600, 0.0424, 0.1130, 0.0447, 0.1970, 0.199…
+## $ tempo <dbl> 133.015, 129.993, 130.005, 111.087, 105.115, 100.103,…
+## $ time_signature <dbl> 5, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4,…
+Good job!💪
+We can observe that glimpse() will give you the total
+number of rows (observations) and columns (variables), then, the first
+few entries of each variable in a row after the variable name. In
+addition, the data type of the variable is given immediately
+after each variable’s name inside < >.
DataExplorer::introduce() can summarize this information
+neatly:
Awesome! We have just learnt that our data has no missing values.
+While we are at it, we can explore common central tendency statistics
+(e.g mean
+and median) and
+measures of dispersion (e.g standard
+deviation) using summarytools::descr()
# Describe common statistics
+df %>% descr(stats = "common")
+
+Let’s look at the general values of the data. Note that popularity
+can be 0, which show songs that have no ranking. We’ll
+remove those shortly.
++🤔 If we are working with clustering, an unsupervised method that +does not require labeled data, why are we showing this data with labels? +In the data exploration phase, they come in handy, but they are not +necessary for the clustering algorithms to work.
+
Let’s go ahead and find out the most popular genres 🎶 by making a +count of the instances it appears.
+# Popular genres
+top_genres <- df %>%
+ count(artist_top_genre, sort = TRUE) %>%
+# Encode to categorical and reorder the according to count
+ mutate(artist_top_genre = factor(artist_top_genre) %>% fct_inorder())
+
+# Print the top genres
+top_genresThat went well! They say a picture is worth a thousand rows of a data +frame (actually nobody ever says that 😅). But you get the gist of it, +right?
+One way to visualize categorical data (character or factor variables) +is using barplots. Let’s make a barplot of the top 10 genres:
+# Change the default gray theme
+theme_set(theme_light())
+
+# Visualize popular genres
+top_genres %>%
+ slice(1:10) %>%
+ ggplot(mapping = aes(x = artist_top_genre, y = n,
+ fill = artist_top_genre)) +
+ geom_col(alpha = 0.8) +
+ paletteer::scale_fill_paletteer_d("rcartocolor::Vivid") +
+ ggtitle("Top genres") +
+ theme(plot.title = element_text(hjust = 0.5),
+ # Rotates the X markers (so we can read them)
+ axis.text.x = element_text(angle = 90))Now it’s way easier to identify that we have missing
+genres 🧐!
++A good visualisation will show you things that you did not expect, or +raise new questions about the data - Hadley Wickham and Garrett +Grolemund, R For Data +Science
+
Note, when the top genre is described as Missing, that
+means that Spotify did not classify it, so let’s get rid of it.
# Visualize popular genres
+top_genres %>%
+ filter(artist_top_genre != "Missing") %>%
+ slice(1:10) %>%
+ ggplot(mapping = aes(x = artist_top_genre, y = n,
+ fill = artist_top_genre)) +
+ geom_col(alpha = 0.8) +
+ paletteer::scale_fill_paletteer_d("rcartocolor::Vivid") +
+ ggtitle("Top genres") +
+ theme(plot.title = element_text(hjust = 0.5),
+ # Rotates the X markers (so we can read them)
+ axis.text.x = element_text(angle = 90))From the little data exploration, we learn that the top three genres
+dominate this dataset. Let’s concentrate on afro dancehall,
+afropop, and nigerian pop, additionally filter
+the dataset to remove anything with a 0 popularity value (meaning it was
+not classified with a popularity in the dataset and can be considered
+noise for our purposes):
nigerian_songs <- df %>%
+ # Concentrate on top 3 genres
+ filter(artist_top_genre %in% c("afro dancehall", "afropop","nigerian pop")) %>%
+ # Remove unclassified observations
+ filter(popularity != 0)
+
+
+
+# Visualize popular genres
+nigerian_songs %>%
+ count(artist_top_genre) %>%
+ ggplot(mapping = aes(x = artist_top_genre, y = n,
+ fill = artist_top_genre)) +
+ geom_col(alpha = 0.8) +
+ paletteer::scale_fill_paletteer_d("ggsci::category10_d3") +
+ ggtitle("Top genres") +
+ theme(plot.title = element_text(hjust = 0.5))Let’s see whether there is any apparent linear relationship among the +numerical variables in our data set. This relationship is quantified +mathematically by the correlation +statistic.
+The correlation statistic is a value between -1 and 1 that indicates +the strength of a relationship. Values above 0 indicate a +positive correlation (high values of one variable tend to +coincide with high values of the other), while values below 0 indicate a +negative correlation (high values of one variable tend to +coincide with low values of the other).
+# Narrow down to numeric variables and fid correlation
+corr_mat <- nigerian_songs %>%
+ select(where(is.numeric)) %>%
+ cor()
+
+# Visualize correlation matrix
+corrplot(corr_mat, order = 'AOE', col = c('white', 'black'), bg = 'gold2') The data is not strongly correlated except between
+energy and loudness, which makes sense, given
+that loud music is usually pretty energetic. Popularity has
+a correspondence to release date, which also makes sense,
+as more recent songs are probably more popular. Length and energy seem
+to have a correlation too.
It will be interesting to see what a clustering algorithm can make of +this data!
+++🎓 Note that correlation does not imply causation! We have proof of +correlation but no proof of causation. An amusing web site +has some visuals that emphasize this point.
+
Let’s ask some more subtle questions. Are the genres significantly +different in the perception of their danceability, based on their +popularity? Let’s examine our top three genres data distribution for +popularity and danceability along a given x and y axis using density +plots.
+# Perform 2D kernel density estimation
+density_estimate_2d <- nigerian_songs %>%
+ ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre)) +
+ geom_density_2d(bins = 5, size = 1) +
+ paletteer::scale_color_paletteer_d("RSkittleBrewer::wildberry") +
+ xlim(-20, 80) +
+ ylim(0, 1.2)
+
+# Density plot based on the popularity
+density_estimate_pop <- nigerian_songs %>%
+ ggplot(mapping = aes(x = popularity, fill = artist_top_genre, color = artist_top_genre)) +
+ geom_density(size = 1, alpha = 0.5) +
+ paletteer::scale_fill_paletteer_d("RSkittleBrewer::wildberry") +
+ paletteer::scale_color_paletteer_d("RSkittleBrewer::wildberry") +
+ theme(legend.position = "none")
+
+# Density plot based on the danceability
+density_estimate_dance <- nigerian_songs %>%
+ ggplot(mapping = aes(x = danceability, fill = artist_top_genre, color = artist_top_genre)) +
+ geom_density(size = 1, alpha = 0.5) +
+ paletteer::scale_fill_paletteer_d("RSkittleBrewer::wildberry") +
+ paletteer::scale_color_paletteer_d("RSkittleBrewer::wildberry")
+
+
+# Patch everything together
+library(patchwork)
+density_estimate_2d / (density_estimate_pop + density_estimate_dance)We see that there are concentric circles that line up, regardless of +genre. Could it be that Nigerian tastes converge at a certain level of +danceability for this genre?
+In general, the three genres align in terms of their popularity and +danceability. Determining clusters in this loosely-aligned data will be +a challenge. Let’s see whether a scatter plot can support this.
+# A scatter plot of popularity and danceability
+scatter_plot <- nigerian_songs %>%
+ ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre, shape = artist_top_genre)) +
+ geom_point(size = 2, alpha = 0.8) +
+ paletteer::scale_color_paletteer_d("futurevisions::mars")
+
+# Add a touch of interactivity
+ggplotly(scatter_plot)A scatterplot of the same axes shows a similar pattern of +convergence.
+In general, for clustering, you can use scatterplots to show clusters +of data, so mastering this type of visualization is very useful. In the +next lesson, we will take this filtered data and use k-means clustering +to discover groups in this data that see to overlap in interesting +ways.
+In preparation for the next lesson, make a chart about the various +clustering algorithms you might discover and use in a production +environment. What kinds of problems is the clustering trying to +address?
+Before you apply clustering algorithms, as we have learned, it’s a +good idea to understand the nature of your dataset. Read more on this +topic here
+Deepen your understanding of clustering techniques:
+Train and Evaluate +Clustering Models using Tidymodels and friends
Bradley Boehmke & Brandon Greenwell, Hands-On Machine +Learning with R.
Jen Looper for +creating the original Python version of this module ♥️
+Dasani Madipalli
+for creating the amazing illustrations that make machine learning
+concepts more interpretable and easier to understand.
Happy Learning,
+Eric, Gold Microsoft Learn +Student Ambassador.
+In this lesson, you will learn how to create clusters using the +Tidymodels package and other packages in the R ecosystem (we’ll call +them friends 🧑🤝🧑), and the Nigerian music dataset you imported earlier. +We will cover the basics of K-Means for Clustering. Keep in mind that, +as you learned in the earlier lesson, there are many ways to work with +clusters and the method you use depends on your data. We will try +K-Means as it’s the most common clustering technique. Let’s get +started!
+Terms you will learn about:
+Silhouette scoring
Elbow method
Inertia
Variance
K-Means
+Clustering is a method derived from the domain of signal processing.
+It is used to divide and partition groups of data into
+k clusters based on similarities in their features.
The clusters can be visualized as Voronoi diagrams, +which include a point (or ‘seed’) and its corresponding region.
+K-Means clustering has the following steps:
+The data scientist starts by specifying the desired number of +clusters to be created.
Next, the algorithm randomly selects K observations from the data +set to serve as the initial centers for the clusters (i.e., +centroids).
Next, each of the remaining observations is assigned to its +closest centroid.
Next, the new means of each cluster is computed and the centroid +is moved to the mean.
Now that the centers have been recalculated, every observation is +checked again to see if it might be closer to a different cluster. All +the objects are reassigned again using the updated cluster means. The +cluster assignment and centroid update steps are iteratively repeated +until the cluster assignments stop changing (i.e., when convergence is +achieved). Typically, the algorithm terminates when each new iteration +results in negligible movement of centroids and the clusters become +static.
++Note that due to randomization of the initial k observations used as +the starting centroids, we can get slightly different results each time +we apply the procedure. For this reason, most algorithms use several +random starts and choose the iteration with the lowest WCSS. As +such, it is strongly recommended to always run K-Means with several +values of nstart to avoid an undesirable local +optimum.
+
This short animation using the artwork +of Allison Horst explains the clustering process:
+A fundamental question that arises in clustering is this: how do you
+know how many clusters to separate your data into? One drawback of using
+K-Means includes the fact that you will need to establish
+k, that is the number of centroids.
+Fortunately the elbow method helps to estimate a good
+starting value for k. You’ll try it in a minute.
Prerequisite
+We’ll pick off right from where we stopped in the previous +lesson, where we analysed the data set, made lots of visualizations +and filtered the data set to observations of interest. Be sure to check +it out!
+We’ll require some packages to knock-off this module. You can have
+them installed as:
+install.packages(c('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork'))
Alternatively, the script below checks whether you have the packages +required to complete this module and installs them for you in case some +are missing.
+suppressWarnings(if(!require("pacman")) install.packages("pacman",repos = "http://cran.us.r-project.org"))## Loading required package: pacman
+pacman::p_load('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork')##
+## The downloaded binary packages are in
+## /var/folders/c9/r3f6t3kj3wv9jrh50g63hp1r0000gn/T//RtmpHKd9vp/downloaded_packages
+##
+## summarytools installed
+## Warning in pacman::p_load("tidyverse", "tidymodels", "cluster", "summarytools", : Failed to install/load:
+## summarytools
+Let’s hit the ground running!
+This is a recap of what we did in the previous lesson. Let’s slice +and dice some data!
+# Load the core tidyverse and make it available in your current R session
+library(tidyverse)
+
+# Import the data into a tibble
+df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv", show_col_types = FALSE)
+
+# Narrow down to top 3 popular genres
+nigerian_songs <- df %>%
+ # Concentrate on top 3 genres
+ filter(artist_top_genre %in% c("afro dancehall", "afropop","nigerian pop")) %>%
+ # Remove unclassified observations
+ filter(popularity != 0)
+
+
+
+# Visualize popular genres using bar plots
+theme_set(theme_light())
+nigerian_songs %>%
+ count(artist_top_genre) %>%
+ ggplot(mapping = aes(x = artist_top_genre, y = n,
+ fill = artist_top_genre)) +
+ geom_col(alpha = 0.8) +
+ paletteer::scale_fill_paletteer_d("ggsci::category10_d3") +
+ ggtitle("Top genres") +
+ theme(plot.title = element_text(hjust = 0.5))🤩 That went well!
+How clean is this data? Let’s check for outliers using box plots. We +will concentrate on numeric columns with fewer outliers (although you +could clean out the outliers). Boxplots can show the range of the data +and will help choose which columns to use. Note, Boxplots do not show +variance, an important element of good clusterable data. Please see this +discussion for further reading.
+Boxplots are
+used to graphically depict the distribution of numeric
+data, so let’s start by selecting all numeric columns alongside
+the popular music genres.
# Select top genre column and all other numeric columns
+df_numeric <- nigerian_songs %>%
+ select(artist_top_genre, where(is.numeric))
+
+# Display the data
+df_numeric %>%
+ slice_head(n = 5)See how the selection helper where makes this easy 💁?
+Explore such other functions here.
Since we’ll be making a boxplot for each numeric features and we want
+to avoid using loops, let’s reformat our data into a longer
+format that will allow us to take advantage of facets -
+subplots that each display one subset of the data.
# Pivot data from wide to long
+df_numeric_long <- df_numeric %>%
+ pivot_longer(!artist_top_genre, names_to = "feature_names", values_to = "values")
+
+# Print out data
+df_numeric_long %>%
+ slice_head(n = 15)Much longer! Now time for some ggplots! So what
+geom will we use?
# Make a box plot
+df_numeric_long %>%
+ ggplot(mapping = aes(x = feature_names, y = values, fill = feature_names)) +
+ geom_boxplot() +
+ facet_wrap(~ feature_names, ncol = 4, scales = "free") +
+ theme(legend.position = "none")Easy-gg!
+Now we can see this data is a little noisy: by observing each column +as a boxplot, you can see outliers. You could go through the dataset and +remove these outliers, but that would make the data pretty minimal.
+For now, let’s choose which columns we will use for our clustering
+exercise. Let’s pick the numeric columns with similar ranges. We could
+encode the artist_top_genre as numeric but we’ll drop it
+for now.
We can compute k-means in R with the built-in kmeans
+function, see help("kmeans()"). kmeans()
+function accepts a data frame with all numeric columns as it’s primary
+argument.
The first step when using k-means clustering is to specify the number +of clusters (k) that will be generated in the final solution. We know +there are 3 song genres that we carved out of the dataset, so let’s try +3:
+set.seed(2056)
+# Kmeans clustering for 3 clusters
+kclust <- kmeans(
+ df_numeric_select,
+ # Specify the number of clusters
+ centers = 3,
+ # How many random initial configurations
+ nstart = 25
+)
+
+# Display clustering object
+kclust## K-means clustering with 3 clusters of sizes 65, 111, 110
+##
+## Cluster means:
+## popularity danceability acousticness loudness energy
+## 1 53.40000 0.7698615 0.2684248 -5.081200 0.7167231
+## 2 31.28829 0.7310811 0.2558767 -5.159550 0.7589279
+## 3 10.12727 0.7458727 0.2720171 -4.586418 0.7906091
+##
+## Clustering vector:
+## [1] 2 3 2 2 2 2 2 2 2 3 2 2 3 2 1 2 3 3 1 3 1 1 1 3 1 2 1 1 2 2 3 3 1 2 2 2 2
+## [38] 3 3 1 2 1 2 1 2 1 1 3 3 2 3 1 1 2 2 2 2 3 3 1 3 2 2 3 2 2 3 2 3 2 2 3 3 3
+## [75] 3 3 2 3 2 2 1 2 3 3 3 2 2 2 2 3 2 2 2 2 3 3 2 3 3 2 3 2 3 2 3 2 2 3 2 1 3
+## [112] 3 2 3 3 2 2 2 2 2 2 2 1 3 3 3 3 1 3 2 3 2 3 2 2 2 1 2 3 3 3 2 3 1 3 2 2 3
+## [149] 3 3 1 3 2 2 2 3 3 1 3 2 3 3 3 3 2 1 1 1 3 1 1 1 1 1 1 2 1 3 1 1 3 1 1 2 1
+## [186] 1 3 3 2 1 2 2 1 2 2 3 3 1 3 3 1 1 3 1 2 1 3 1 2 1 1 2 2 2 3 3 3 3 3 1 2 2
+## [223] 2 2 2 3 3 3 3 3 2 2 3 3 1 3 3 3 1 2 2 2 3 3 1 1 3 3 2 1 1 1 1 1 2 1 1 2 3
+## [260] 3 3 2 2 2 3 2 3 2 3 3 3 1 2 2 2 3 2 3 1 3 2 3 3 3 2 3
+##
+## Within cluster sum of squares by cluster:
+## [1] 3550.293 4559.358 4889.010
+## (between_SS / total_SS = 85.8 %)
+##
+## Available components:
+##
+## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
+## [6] "betweenss" "size" "iter" "ifault"
+The kmeans object contains several bits of information which is well
+explained in help("kmeans()"). For now, let’s focus on a
+few. We see that the data has been grouped into 3 clusters of sizes 65,
+110, 111. The output also contains the cluster centers (means) for the 3
+groups across the 5 variables.
The clustering vector is the cluster assignment for each observation.
+Let’s use the augment function to add the cluster
+assignment the original data set.
# Add predicted cluster assignment to data set
+augment(kclust, df_numeric_select) %>%
+ relocate(.cluster) %>%
+ slice_head(n = 10)Perfect, we have just partitioned our data set into a set of 3
+groups. So, how good is our clustering 🤷? Let’s take a look at the
+Silhouette score
Silhouette +analysis can be used to study the separation distance between the +resulting clusters. This score varies from -1 to 1, and if the score is +near 1, the cluster is dense and well-separated from other clusters. A +value near 0 represents overlapping clusters with samples very close to +the decision boundary of the neighboring clusters. (Source).
+The average silhouette method computes the average silhouette of +observations for different values of k. A high average +silhouette score indicates a good clustering.
+The silhouette function in the cluster package to
+compuate the average silhouette width.
++The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance which we discussed in +the previous +lesson.
+
# Load cluster package
+library(cluster)
+
+# Compute average silhouette score
+ss <- silhouette(kclust$cluster,
+ # Compute euclidean distance
+ dist = dist(df_numeric_select))
+mean(ss[, 3])## [1] 0.5494668
+Our score is .549, so right in the middle. This
+indicates that our data is not particularly well-suited to this type of
+clustering. Let’s see whether we can confirm this hunch visually. The factoextra
+package provides functions (fviz_cluster()) to
+visualize clustering.
The overlap in clusters indicates that our data is not particularly +well-suited to this type of clustering but let’s continue.
+A fundamental question that often arises in K-Means clustering is +this - without known class labels, how do you know how many clusters to +separate your data into?
+One way we can try to find out is to use a data sample to
+create a series of clustering models with an incrementing
+number of clusters (e.g from 1-10), and evaluate clustering metrics such
+as the Silhouette score.
Let’s determine the optimal number of clusters by computing the +clustering algorithm for different values of k and evaluating +the Within Cluster Sum of Squares (WCSS). The total +within-cluster sum of square (WCSS) measures the compactness of the +clustering and we want it to be as small as possible, with lower values +meaning that the data points are closer.
+Let’s explore the effect of different choices of k, from
+1 to 10, on this clustering.
# Create a series of clustering models
+kclusts <- tibble(k = 1:10) %>%
+ # Perform kmeans clustering for 1,2,3 ... ,10 clusters
+ mutate(model = map(k, ~ kmeans(df_numeric_select, centers = .x, nstart = 25)),
+ # Farm out clustering metrics eg WCSS
+ glanced = map(model, ~ glance(.x))) %>%
+ unnest(cols = glanced)
+
+
+# View clustering rsulsts
+kclustsNow that we have the total within-cluster sum-of-squares +(tot.withinss) for each clustering algorithm with center k, we +use the elbow +method to find the optimal number of clusters. The method consists +of plotting the WCSS as a function of the number of clusters, and +picking the elbow of the curve as the number of +clusters to use.
+set.seed(2056)
+# Use elbow method to determine optimum number of clusters
+kclusts %>%
+ ggplot(mapping = aes(x = k, y = tot.withinss)) +
+ geom_line(size = 1.2, alpha = 0.8, color = "#FF7F0EFF") +
+ geom_point(size = 2, color = "#FF7F0EFF")## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+## ℹ Please use `linewidth` instead.
+## This warning is displayed once every 8 hours.
+## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+## generated.
+The plot shows a large reduction in WCSS (so greater
+tightness) as the number of clusters increases from one to two,
+and a further noticeable reduction from two to three clusters. After
+that, the reduction is less pronounced, resulting in an
+elbow 💪in the chart at around three clusters. This is a
+good indication that there are two to three reasonably well separated
+clusters of data points.
We can now go ahead and extract the clustering model where
+k = 3:
+++
pull(): used to extract a single column+
pluck(): used to index data structures such as lists
# Extract k = 3 clustering
+final_kmeans <- kclusts %>%
+ filter(k == 3) %>%
+ pull(model) %>%
+ pluck(1)
+
+
+final_kmeans## K-means clustering with 3 clusters of sizes 111, 110, 65
+##
+## Cluster means:
+## popularity danceability acousticness loudness energy
+## 1 31.28829 0.7310811 0.2558767 -5.159550 0.7589279
+## 2 10.12727 0.7458727 0.2720171 -4.586418 0.7906091
+## 3 53.40000 0.7698615 0.2684248 -5.081200 0.7167231
+##
+## Clustering vector:
+## [1] 1 2 1 1 1 1 1 1 1 2 1 1 2 1 3 1 2 2 3 2 3 3 3 2 3 1 3 3 1 1 2 2 3 1 1 1 1
+## [38] 2 2 3 1 3 1 3 1 3 3 2 2 1 2 3 3 1 1 1 1 2 2 3 2 1 1 2 1 1 2 1 2 1 1 2 2 2
+## [75] 2 2 1 2 1 1 3 1 2 2 2 1 1 1 1 2 1 1 1 1 2 2 1 2 2 1 2 1 2 1 2 1 1 2 1 3 2
+## [112] 2 1 2 2 1 1 1 1 1 1 1 3 2 2 2 2 3 2 1 2 1 2 1 1 1 3 1 2 2 2 1 2 3 2 1 1 2
+## [149] 2 2 3 2 1 1 1 2 2 3 2 1 2 2 2 2 1 3 3 3 2 3 3 3 3 3 3 1 3 2 3 3 2 3 3 1 3
+## [186] 3 2 2 1 3 1 1 3 1 1 2 2 3 2 2 3 3 2 3 1 3 2 3 1 3 3 1 1 1 2 2 2 2 2 3 1 1
+## [223] 1 1 1 2 2 2 2 2 1 1 2 2 3 2 2 2 3 1 1 1 2 2 3 3 2 2 1 3 3 3 3 3 1 3 3 1 2
+## [260] 2 2 1 1 1 2 1 2 1 2 2 2 3 1 1 1 2 1 2 3 2 1 2 2 2 1 2
+##
+## Within cluster sum of squares by cluster:
+## [1] 4559.358 4889.010 3550.293
+## (between_SS / total_SS = 85.8 %)
+##
+## Available components:
+##
+## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
+## [6] "betweenss" "size" "iter" "ifault"
+Great! Let’s go ahead and visualize the clusters obtained. Care for
+some interactivity using plotly?
# Add predicted cluster assignment to data set
+results <- augment(final_kmeans, df_numeric_select) %>%
+ bind_cols(df_numeric %>% select(artist_top_genre))
+
+# Plot cluster assignments
+clust_plt <- results %>%
+ ggplot(mapping = aes(x = popularity, y = danceability, color = .cluster, shape = artist_top_genre)) +
+ geom_point(size = 2, alpha = 0.8) +
+ paletteer::scale_color_paletteer_d("ggthemes::Tableau_10")
+
+ggplotly(clust_plt)Perhaps we would have expected that each cluster (represented by +different colors) would have distinct genres (represented by different +shapes).
+Let’s take a look at the model’s accuracy.
+# Assign genres to predefined integers
+label_count <- results %>%
+ group_by(artist_top_genre) %>%
+ mutate(id = cur_group_id()) %>%
+ ungroup() %>%
+ summarise(correct_labels = sum(.cluster == id))
+
+
+# Print results
+cat("Result:", label_count$correct_labels, "out of", nrow(results), "samples were correctly labeled.")## Result: 109 out of 286 samples were correctly labeled.
+
+##
+## Accuracy score: 0.3811189
+This model’s accuracy is not bad, but not great. It may be that the +data may not lend itself well to K-Means Clustering. This data is too +imbalanced, too little correlated and there is too much variance between +the column values to cluster well. In fact, the clusters that form are +probably heavily influenced or skewed by the three genre categories we +defined above.
+Nevertheless, that was quite a learning process!
+In Scikit-learn’s documentation, you can see that a model like this +one, with clusters not very well demarcated, has a ‘variance’ +problem:
+Variance is defined as “the average of the squared differences from +the Mean” (Source). +In the context of this clustering problem, it refers to data that the +numbers of our dataset tend to diverge a bit too much from the mean.
+✅ This is a great moment to think about all the ways you could +correct this issue. Tweak the data a bit more? Use different columns? +Use a different algorithm? Hint: Try scaling +your data to normalize it and test other columns.
+++Try this ‘variance +calculator’ to understand the concept a bit more.
+
Spend some time with this notebook, tweaking parameters. Can you +improve the accuracy of the model by cleaning the data more (removing +outliers, for example)? You can use weights to give more weight to given +data samples. What else can you do to create better clusters?
+Hint: Try to scale your data. There’s commented code in the notebook +that adds standard scaling to make the data columns resemble each other +more closely in terms of range. You’ll find that while the silhouette +score goes down, the ‘kink’ in the elbow graph smooths out. This is +because leaving the data unscaled allows data with less variance to +carry more weight. Read a bit more on this problem here.
+Take a look at a K-Means Simulator such +as this one. You can use this tool to visualize sample data points +and determine its centroids. You can edit the data’s randomness, numbers +of clusters and numbers of centroids. Does this help you get an idea of +how the data can be grouped?
Also, take a look at this +handout on K-Means from Stanford.
Want to try out your newly acquired clustering skills to data sets +that lend well to K-Means clustering? Please see:
+Train and Evaluate +Clustering Models using Tidymodels and friends
K-means +Cluster Analysis, UC Business Analytics R Programming Guide
Jen Looper for +creating the original Python version of this module ♥️
+Allison Horst
+for creating the amazing illustrations that make R more welcoming and
+engaging. Find more illustrations at her gallery.
Happy Learning,
+Eric, Gold Microsoft Learn +Student Ambassador.
+#{r include=FALSE} #library(here) #library(rmd2jupyter) #rmd2jupyter("lesson_14.Rmd") #
+
Here γ is the so-called **discount factor** that determines to which extent you should prefer the current reward over the future reward and vice versa.
@@ -316,4 +316,5 @@ Overall, it is important to remember that the success and quality of the learnin
## [Post-lecture quiz](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/46/)
-## Assignment [A More Realistic World](assignment.md)
+## Assignment
+[A More Realistic World](assignment.md)
diff --git a/8-Reinforcement/1-QLearning/images/bellman-equation.png b/8-Reinforcement/1-QLearning/images/bellman-equation.png
new file mode 100644
index 000000000..60ff3c97b
Binary files /dev/null and b/8-Reinforcement/1-QLearning/images/bellman-equation.png differ
diff --git a/8-Reinforcement/2-Gym/README.md b/8-Reinforcement/2-Gym/README.md
index de3eac373..b5e4237a6 100644
--- a/8-Reinforcement/2-Gym/README.md
+++ b/8-Reinforcement/2-Gym/README.md
@@ -1,7 +1,7 @@
# CartPole Skating
The problem we have been solving in the previous lesson might seem like a toy problem, not really applicable for real life scenarios. This is not the case, because many real world problems also share this scenario - including playing Chess or Go. They are similar, because we also have a board with given rules and a **discrete state**.
-https://white-water-09ec41f0f.azurestaticapps.net/
+
## [Pre-lecture quiz](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/47/)
## Introduction
@@ -331,10 +331,11 @@ You should see something like this:
## [Post-lecture quiz](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/48/)
-## Assignment: [Train a Mountain Car](assignment.md)
+## Assignment
+[Train a Mountain Car](assignment.md)
## Conclusion
We have now learned how to train agents to achieve good results just by providing them a reward function that defines the desired state of the game, and by giving them an opportunity to intelligently explore the search space. We have successfully applied the Q-Learning algorithm in the cases of discrete and continuous environments, but with discrete actions.
-It's important to also study situations where action state is also continuous, and when observation space is much more complex, such as the image from the Atari game screen. In those problems we often need to use more powerful machine learning techniques, such as neural networks, in order to achieve good results. Those more advanced topics are the subject of our forthcoming more advanced AI course.
\ No newline at end of file
+It's important to also study situations where action state is also continuous, and when observation space is much more complex, such as the image from the Atari game screen. In those problems we often need to use more powerful machine learning techniques, such as neural networks, in order to achieve good results. Those more advanced topics are the subject of our forthcoming more advanced AI course.
diff --git a/9-Real-World/1-Applications/README.md b/9-Real-World/1-Applications/README.md
index 3cf3e16d2..36643172c 100644
--- a/9-Real-World/1-Applications/README.md
+++ b/9-Real-World/1-Applications/README.md
@@ -19,16 +19,14 @@ The finance sector offers many opportunities for machine learning. Many problems
We learned about [k-means clustering](../../5-Clustering/2-K-Means/README.md) earlier in the course, but how can it be used to solve problems related to credit card fraud?
K-means clustering comes in handy during a credit card fraud detection technique called **outlier detection**. Outliers, or deviations in observations about a set of data, can tell us if a credit card is being used in a normal capacity or if something unusual is going on. As shown in the paper linked below, you can sort credit card data using a k-means clustering algorithm and assign each transaction to a cluster based on how much of an outlier it appears to be. Then, you can evaluate the riskiest clusters for fraudulent versus legitimate transactions.
-
-https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.680.1195&rep=rep1&type=pdf
+[Reference](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.680.1195&rep=rep1&type=pdf)
### Wealth management
In wealth management, an individual or firm handles investments on behalf of their clients. Their job is to sustain and grow wealth in the long-term, so it is essential to choose investments that perform well.
One way to evaluate how a particular investment performs is through statistical regression. [Linear regression](../../2-Regression/1-Tools/README.md) is a valuable tool for understanding how a fund performs relative to some benchmark. We can also deduce whether or not the results of the regression are statistically significant, or how much they would affect a client's investments. You could even further expand your analysis using multiple regression, where additional risk factors can be taken into account. For an example of how this would work for a specific fund, check out the paper below on evaluating fund performance using regression.
-
-http://www.brightwoodventures.com/evaluating-fund-performance-using-regression/
+[Reference](http://www.brightwoodventures.com/evaluating-fund-performance-using-regression/)
## 🎓 Education
@@ -37,14 +35,12 @@ The educational sector is also a very interesting area where ML can be applied.
### Predicting student behavior
[Coursera](https://coursera.com), an online open course provider, has a great tech blog where they discuss many engineering decisions. In this case study, they plotted a regression line to try to explore any correlation between a low NPS (Net Promoter Score) rating and course retention or drop-off.
-
-https://medium.com/coursera-engineering/controlled-regression-quantifying-the-impact-of-course-quality-on-learner-retention-31f956bd592a
+[Reference](https://medium.com/coursera-engineering/controlled-regression-quantifying-the-impact-of-course-quality-on-learner-retention-31f956bd592a)
### Mitigating bias
[Grammarly](https://grammarly.com), a writing assistant that checks for spelling and grammar errors, uses sophisticated [natural language processing systems](../../6-NLP/README.md) throughout its products. They published an interesting case study in their tech blog about how they dealt with gender bias in machine learning, which you learned about in our [introductory fairness lesson](../../1-Introduction/3-fairness/README.md).
-
-https://www.grammarly.com/blog/engineering/mitigating-gender-bias-in-autocorrect/
+[Reference](https://www.grammarly.com/blog/engineering/mitigating-gender-bias-in-autocorrect/)
## 👜 Retail
@@ -53,14 +49,12 @@ The retail sector can definitely benefit from the use of ML, with everything fro
### Personalizing the customer journey
At Wayfair, a company that sells home goods like furniture, helping customers find the right products for their taste and needs is paramount. In this article, engineers from the company describe how they use ML and NLP to "surface the right results for customers". Notably, their Query Intent Engine has been built to use entity extraction, classifier training, asset and opinion extraction, and sentiment tagging on customer reviews. This is a classic use case of how NLP works in online retail.
-
-https://www.aboutwayfair.com/tech-innovation/how-we-use-machine-learning-and-natural-language-processing-to-empower-search
+[Reference](https://www.aboutwayfair.com/tech-innovation/how-we-use-machine-learning-and-natural-language-processing-to-empower-search)
### Inventory management
Innovative, nimble companies like [StitchFix](https://stitchfix.com), a box service that ships clothing to consumers, rely heavily on ML for recommendations and inventory management. Their styling teams work together with their merchandising teams, in fact: "one of our data scientists tinkered with a genetic algorithm and applied it to apparel to predict what would be a successful piece of clothing that doesn't exist today. We brought that to the merchandise team and now they can use that as a tool."
-
-https://www.zdnet.com/article/how-stitch-fix-uses-machine-learning-to-master-the-science-of-styling/
+[Reference](https://www.zdnet.com/article/how-stitch-fix-uses-machine-learning-to-master-the-science-of-styling/)
## 🏥 Health Care
@@ -69,20 +63,17 @@ The health care sector can leverage ML to optimize research tasks and also logis
### Managing clinical trials
Toxicity in clinical trials is a major concern to drug makers. How much toxicity is tolerable? In this study, analyzing various clinical trial methods led to the development of a new approach for predicting the odds of clinical trial outcomes. Specifically, they were able to use random forest to produce a [classifier](../../4-Classification/README.md) that is able to distinguish between groups of drugs.
-
-https://www.sciencedirect.com/science/article/pii/S2451945616302914
+[Reference](https://www.sciencedirect.com/science/article/pii/S2451945616302914)
### Hospital readmission management
Hospital care is costly, especially when patients have to be readmitted. This paper discusses a company that uses ML to predict readmission potential using [clustering](../../5-Clustering/README.md) algorithms. These clusters help analysts to "discover groups of readmissions that may share a common cause".
-
-https://healthmanagement.org/c/healthmanagement/issuearticle/hospital-readmissions-and-machine-learning
+[Reference](https://healthmanagement.org/c/healthmanagement/issuearticle/hospital-readmissions-and-machine-learning)
### Disease management
The recent pandemic has shone a bright light on the ways that machine learning can aid in stopping the spread of disease. In this article, you'll recognize the use of ARIMA, logistic curves, linear regression, and SARIMA. "This work is an attempt to calculate the rate of spread of this virus and thus to predict the deaths, recoveries, and confirmed cases, so that it may help us to prepare better and survive."
-
-https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7979218/
+[Reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7979218/)
## 🌲 Ecology and Green Tech
@@ -93,22 +84,19 @@ Nature and ecology consists of many sensitive systems where the interplay betwee
You learned about [Reinforcement Learning](../../8-Reinforcement/README.md) in previous lessons. It can be very useful when trying to predict patterns in nature. In particular, it can be used to track ecological problems like forest fires and the spread of invasive species. In Canada, a group of researchers used Reinforcement Learning to build forest wildfire dynamics models from satellite images. Using an innovative "spatially spreading process (SSP)", they envisioned a forest fire as "the agent at any cell in the landscape." "The set of actions the fire can take from a location at any point in time includes spreading north, south, east, or west or not spreading.
This approach inverts the usual RL setup since the dynamics of the corresponding Markov Decision Process (MDP) is a known function for immediate wildfire spread." Read more about the classic algorithms used by this group at the link below.
-
-https://www.frontiersin.org/articles/10.3389/fict.2018.00006/full
+[Reference](https://www.frontiersin.org/articles/10.3389/fict.2018.00006/full)
### Motion sensing of animals
While deep learning has created a revolution in visually tracking animal movements (you can build your own [polar bear tracker](https://docs.microsoft.com/learn/modules/build-ml-model-with-azure-stream-analytics/?WT.mc_id=academic-77952-leestott) here), classic ML still has a place in this task.
Sensors to track movements of farm animals and IoT make use of this type of visual processing, but more basic ML techniques are useful to preprocess data. For example, in this paper, sheep postures were monitored and analyzed using various classifier algorithms. You might recognize the ROC curve on page 335.
-
-https://druckhaus-hofmann.de/gallery/31-wj-feb-2020.pdf
+[Reference](https://druckhaus-hofmann.de/gallery/31-wj-feb-2020.pdf)
### ⚡️ Energy Management
In our lessons on [time series forecasting](../../7-TimeSeries/README.md), we invoked the concept of smart parking meters to generate revenue for a town based on understanding supply and demand. This article discusses in detail how clustering, regression and time series forecasting combined to help predict future energy use in Ireland, based off of smart metering.
-
-https://www-cdn.knime.com/sites/default/files/inline-images/knime_bigdata_energy_timeseries_whitepaper.pdf
+[Reference](https://www-cdn.knime.com/sites/default/files/inline-images/knime_bigdata_energy_timeseries_whitepaper.pdf)
## 💼 Insurance
@@ -117,8 +105,7 @@ The insurance sector is another sector that uses ML to construct and optimize vi
### Volatility Management
MetLife, a life insurance provider, is forthcoming with the way they analyze and mitigate volatility in their financial models. In this article you'll notice binary and ordinal classification visualizations. You'll also discover forecasting visualizations.
-
-https://investments.metlife.com/content/dam/metlifecom/us/investments/insights/research-topics/macro-strategy/pdf/MetLifeInvestmentManagement_MachineLearnedRanking_070920.pdf
+[Reference](https://investments.metlife.com/content/dam/metlifecom/us/investments/insights/research-topics/macro-strategy/pdf/MetLifeInvestmentManagement_MachineLearnedRanking_070920.pdf)
## 🎨 Arts, Culture, and Literature
@@ -127,8 +114,7 @@ In the arts, for example in journalism, there are many interesting problems. Det
### Fake news detection
Detecting fake news has become a game of cat and mouse in today's media. In this article, researchers suggest that a system combining several of the ML techniques we have studied can be tested and the best model deployed: "This system is based on natural language processing to extract features from the data and then these features are used for the training of machine learning classifiers such as Naive Bayes, Support Vector Machine (SVM), Random Forest (RF), Stochastic Gradient Descent (SGD), and Logistic Regression(LR)."
-
-https://www.irjet.net/archives/V7/i6/IRJET-V7I6688.pdf
+[Reference](https://www.irjet.net/archives/V7/i6/IRJET-V7I6688.pdf)
This article shows how combining different ML domains can produce interesting results that can help stop fake news from spreading and creating real damage; in this case, the impetus was the spread of rumors about COVID treatments that incited mob violence.
@@ -137,16 +123,14 @@ This article shows how combining different ML domains can produce interesting re
Museums are at the cusp of an AI revolution in which cataloging and digitizing collections and finding links between artifacts is becoming easier as technology advances. Projects such as [In Codice Ratio](https://www.sciencedirect.com/science/article/abs/pii/S0306457321001035#:~:text=1.,studies%20over%20large%20historical%20sources.) are helping unlock the mysteries of inaccessible collections such as the Vatican Archives. But, the business aspect of museums benefits from ML models as well.
For example, the Art Institute of Chicago built models to predict what audiences are interested in and when they will attend expositions. The goal is to create individualized and optimized visitor experiences each time the user visits the museum. "During fiscal 2017, the model predicted attendance and admissions within 1 percent of accuracy, says Andrew Simnick, senior vice president at the Art Institute."
-
-https://www.chicagobusiness.com/article/20180518/ISSUE01/180519840/art-institute-of-chicago-uses-data-to-make-exhibit-choices
+[Reference](https://www.chicagobusiness.com/article/20180518/ISSUE01/180519840/art-institute-of-chicago-uses-data-to-make-exhibit-choices)
## 🏷 Marketing
### Customer segmentation
The most effective marketing strategies target customers in different ways based on various groupings. In this article, the uses of Clustering algorithms are discussed to support differentiated marketing. Differentiated marketing helps companies improve brand recognition, reach more customers, and make more money.
-
-https://ai.inqline.com/machine-learning-for-marketing-customer-segmentation/
+[Reference](https://ai.inqline.com/machine-learning-for-marketing-customer-segmentation/)
## 🚀 Challenge
diff --git a/README.md b/README.md
index 1a40ccbd4..4cc946f7b 100644
--- a/README.md
+++ b/README.md
@@ -100,11 +100,11 @@ By ensuring that the content aligns with projects, the process is made more enga
| 08 | North American pumpkin prices 🎃 | [Regression](2-Regression/README.md) | Build a logistic regression model |