diff --git a/2-Regression/4-Logistic/solution/R/lesson_4.Rmd b/2-Regression/4-Logistic/solution/R/lesson_4.Rmd index 09aca861..aee56f4e 100644 --- a/2-Regression/4-Logistic/solution/R/lesson_4.Rmd +++ b/2-Regression/4-Logistic/solution/R/lesson_4.Rmd @@ -243,6 +243,27 @@ Now that we have an idea of the relationship between the binary categories of co ## 3. Build your model +Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value. + +It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling: + +```{r split_data} +# Split data into 80% for training and 20% for testing +set.seed(2056) +pumpkins_split <- pumpkins_select %>% + initial_split(prop = 0.8) + +# Extract the data in each split +pumpkins_train <- training(pumpkins_split) +pumpkins_test <- testing(pumpkins_split) + +# Print out the first 5 rows of the training set +pumpkins_train %>% + slice_head(n = 5) + + +``` + 🙌 We are now ready to train a model by fitting the training features to the training label (color). We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers. Just like `baked_pumpkins`, we create a `pumpkins_recipe` but do not `prep` and `bake` since it would be bundled into a workflow, which you will see in just a few steps from now.