@ -243,6 +243,27 @@ Now that we have an idea of the relationship between the binary categories of co
## 3. Build your model
Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.
It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:
```{r split_data}
# Split data into 80% for training and 20% for testing
set.seed(2056)
pumpkins_split <- pumpkins_select %>%
initial_split(prop = 0.8)
# Extract the data in each split
pumpkins_train <- training(pumpkins_split)
pumpkins_test <- testing(pumpkins_split)
# Print out the first 5 rows of the training set
pumpkins_train %>%
slice_head(n = 5)
```
🙌 We are now ready to train a model by fitting the training features to the training label (color).
We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers. Just like `baked_pumpkins`, we create a `pumpkins_recipe` but do not `prep` and `bake` since it would be bundled into a workflow, which you will see in just a few steps from now.