Merge pull request #1 from jasleen101010/main

Remaning taks on Logistic Regression Lesson 4
pull/667/head
Vidushi Gupta 1 year ago committed by GitHub
commit 9e572bb7c0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

File diff suppressed because it is too large Load Diff

@ -72,21 +72,6 @@ Logistic regression does not offer the same features as linear regression. The f
![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png){width="600"} ![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png){width="600"}
#### **Other classifications**
There are other types of logistic regression, including multinomial and ordinal:
- **Multinomial**, which involves having more than one category - "Orange, White, and Striped".
- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).
![Multinomial vs ordinal regression](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/4-Logistic/images/multinomial-vs-ordinal.png)
\
**It's still linear**
Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.
#### **Variables DO NOT have to correlate** #### **Variables DO NOT have to correlate**
Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations. Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.
@ -179,36 +164,52 @@ baked_pumpkins %>%
slice_head(n = 5) slice_head(n = 5)
``` ```
Now let's compare the feature distributions for each label value using box plots. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple `facets`. Now, let's make a categorical plot showing the distribution of the predictors with respect to the outcome color!
```{r pivot} ```{r cat plot pumpkins-colors-variety}
# Pivot data to long format # Specify colors for each value of the hue variable
baked_pumpkins_long <- baked_pumpkins %>% palette <- c(ORANGE = "orange", WHITE = "wheat")
pivot_longer(!color, names_to = "features", values_to = "values")
# Print out restructured data
baked_pumpkins_long %>%
slice_head(n = 10)
# Create the bar plot
ggplot(pumpkins_select, aes(y = variety, fill = color)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = palette) +
labs(y = "Variety", fill = "Color") +
theme_minimal()
``` ```
Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin.
Now, let's make some boxplots showing the distribution of the predictors with respect to the outcome color! ### **Analysing relationships between features and label**
```{r}
# Define the color palette
palette <- c(ORANGE = "orange", WHITE = "wheat")
# We need the encoded Item Size column to use it as the x-axis values in the plot
pumpkins_select_plot<-pumpkins_select
pumpkins_select_plot$item_size <- baked_pumpkins$item_size
# Create the grouped box plot
ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +
geom_boxplot() +
facet_grid(variety ~ ., scales = "free_x") +
scale_fill_manual(values = palette) +
labs(x = "Item Size", y = "") +
theme_minimal() +
theme(strip.text = element_text(size = 12)) +
theme(axis.text.x = element_text(size = 10)) +
theme(axis.title.x = element_text(size = 12)) +
theme(axis.title.y = element_blank()) +
theme(legend.position = "bottom") +
guides(fill = guide_legend(title = "Color")) +
theme(panel.spacing = unit(0.5, "lines"))+
theme(strip.text.y = element_text(size = 4, hjust = 0))
```{r boxplots}
theme_set(theme_light())
#Make a box plot for each predictor feature
baked_pumpkins_long %>%
mutate(color = factor(color)) %>%
ggplot(mapping = aes(x = color, y = values, fill = features)) +
geom_boxplot() +
facet_wrap(~ features, scales = "free", ncol = 3) +
scale_color_viridis_d(option = "cividis", end = .8) +
theme(legend.position = "none")
``` ```
Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin. Let's now focus on a specific relationship: Item Size and Color!
#### **Use a swarm plot** #### **Use a swarm plot**
@ -227,19 +228,10 @@ baked_pumpkins %>%
scale_color_brewer(palette = "Dark2", direction = -1) + scale_color_brewer(palette = "Dark2", direction = -1) +
theme(legend.position = "none") theme(legend.position = "none")
``` ```
Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color. Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.
## 3. Build your model ## 3. Build your model
> **🧮 Show Me The Math**
>
> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:
>
> ![](../../images/sigmoid.png)
>
> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.
Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value. Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.
It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling: It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:
@ -257,8 +249,6 @@ pumpkins_test <- testing(pumpkins_split)
# Print out the first 5 rows of the training set # Print out the first 5 rows of the training set
pumpkins_train %>% pumpkins_train %>%
slice_head(n = 5) slice_head(n = 5)
``` ```
🙌 We are now ready to train a model by fitting the training features to the training label (color). 🙌 We are now ready to train a model by fitting the training features to the training label (color).
@ -294,11 +284,11 @@ log_reg_wf <- workflow() %>%
# Print out the workflow # Print out the workflow
log_reg_wf log_reg_wf
``` ```
After a workflow has been *specified*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function. The workflow will estimate a recipe and preprocess the data before training, so we won't have to manually do that using prep and bake. After a workflow has been *specified*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function. The workflow will estimate a recipe and preprocess the data before training, so we won't have to manually do that using prep and bake.
```{r train} ```{r train}
# Train the model # Train the model
wf_fit <- log_reg_wf %>% wf_fit <- log_reg_wf %>%
@ -307,6 +297,7 @@ wf_fit <- log_reg_wf %>%
# Print the trained workflow # Print the trained workflow
wf_fit wf_fit
``` ```
The model print out shows the coefficients learned during training. The model print out shows the coefficients learned during training.
@ -338,8 +329,6 @@ The [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat
```{r conf_mat} ```{r conf_mat}
# Confusion matrix for prediction results # Confusion matrix for prediction results
conf_mat(data = results, truth = color, estimate = .pred_class) conf_mat(data = results, truth = color, estimate = .pred_class)
``` ```
Let's interpret the confusion matrix. Our model is asked to classify pumpkins between two binary categories, category `white` and category `not-white` Let's interpret the confusion matrix. Our model is asked to classify pumpkins between two binary categories, category `white` and category `not-white`
@ -418,5 +407,3 @@ But for now, congratulations 🎉🎉🎉! You've completed these regression les
You R awesome! You R awesome!
![Artwork by \@allison_horst](../../images/r_learners_sm.jpeg) ![Artwork by \@allison_horst](../../images/r_learners_sm.jpeg)

File diff suppressed because one or more lines are too long
Loading…
Cancel
Save