Now let's compare the feature distributions for each label value using box plots. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple `facets`.
Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin.
### **Analysing relationships between features and label**
```{r}
# Define the color palette
palette <- c(ORANGE = "orange", WHITE = "wheat")
# We need the encoded Item Size column to use it as the x-axis values in the plot
Let's now focus on a specific relationship: Item Size and Color!
#### **Use a swarm plot**
Color is a binary category (Orange or Not), it's called `categorical data`. There are other various ways of [visualizing categorical data](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar).
@ -230,8 +245,6 @@ baked_pumpkins %>%
Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.
### **Analysing relationships between features and label**
## 3. Build your model
Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.