added plot for relationship bw features and labels

pull/667/head
Jasleen Sondhi 2 years ago
parent 435f1ed598
commit 8b7d50a9ec

@ -177,21 +177,6 @@ baked_pumpkins %>%
slice_head(n = 5) slice_head(n = 5)
``` ```
Now let's compare the feature distributions for each label value using box plots. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple `facets`.
```{r pivot}
# Pivot data to long format
baked_pumpkins_long <- baked_pumpkins %>%
pivot_longer(!color, names_to = "features", values_to = "values")
# Print out restructured data
baked_pumpkins_long %>%
slice_head(n = 10)
```
Now, let's make a categorical plot showing the distribution of the predictors with respect to the outcome color! Now, let's make a categorical plot showing the distribution of the predictors with respect to the outcome color!
```{r cat plot pumpkins-colors-variety} ```{r cat plot pumpkins-colors-variety}
@ -208,6 +193,36 @@ ggplot(pumpkins, aes(y = Variety, fill = Color)) +
Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin. Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin.
### **Analysing relationships between features and label**
```{r}
# Define the color palette
palette <- c(ORANGE = "orange", WHITE = "wheat")
# We need the encoded Item Size column to use it as the x-axis values in the plot
pumpkins_select$item_size <- baked_pumpkins$item_size
# Create the grouped box plot
ggplot(pumpkins_select, aes(x = `item_size`, y = color, fill = color)) +
geom_boxplot() +
facet_grid(variety ~ ., scales = "free_x") +
scale_fill_manual(values = palette) +
labs(x = "Item Size", y = "") +
theme_minimal() +
theme(strip.text = element_text(size = 12)) +
theme(axis.text.x = element_text(size = 10)) +
theme(axis.title.x = element_text(size = 12)) +
theme(axis.title.y = element_blank()) +
theme(legend.position = "bottom") +
guides(fill = guide_legend(title = "Color")) +
theme(panel.spacing = unit(2.0, "lines"))+
theme(strip.text.y = element_text(size = 4, hjust = 0))
```
Let's now focus on a specific relationship: Item Size and Color!
#### **Use a swarm plot** #### **Use a swarm plot**
Color is a binary category (Orange or Not), it's called `categorical data`. There are other various ways of [visualizing categorical data](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar). Color is a binary category (Orange or Not), it's called `categorical data`. There are other various ways of [visualizing categorical data](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar).
@ -230,8 +245,6 @@ baked_pumpkins %>%
Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color. Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.
### **Analysing relationships between features and label**
## 3. Build your model ## 3. Build your model
Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value. Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.

Loading…
Cancel
Save