I’ve been learning the Tidymodels framework for building Machine Learning models in R pioneered by Max Kuhn and Julia Silge. After spending a few weeks playing around with it, I can see the benefits of having a standardised modelling pipeline similar to sklearn in Python. In this post, I’d like to walk through a standard example of implementing the CatBoost algorithm using the Tidymodels framework.

CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. It is especially designed for datasets that have many categorical/ nominal features. It’s been shown to be on par with boosting methods like XGBoost. For more information about the algorithm you can checkout the CatBoost documentation in the references. There has already been an excellent CatBoost with Tidymodels tutorial by Roel M. Hogervorst on his blog in 2020 which I will link in the references section but the Tidymodels framework has added some new features since then so this post will demonstrate these additions. 

For those unfamiliar with the Tidymodels framework, I’ll link Julia Silge’s blog and the Tidymodels book which proved to be excellent learning resources for me. All the code used in this post is on GitHub. 

Getting Started – Installing Dependencies and Loading Packages

Before we look at any of the Tidymodels code, let’s use the renv package to load the dependencies I was using. After you’ve cloned the Github repo, run the following command to initialise renv environment for the R project renv::init(). Once you do this, you should get a message that states 

The project library is out of sync with the lockfile.
* Use `renv::restore()` to install packages recorded in the lockfile.

run the renv::restore() command to install the packages required, answer yes to any prompts R studio gives in the console and we are good to go.

Now we can open the catboost_workflow_ameshousing.R script and load the packages with minimal friction. Run the library commands to install the packages required for this walkthrough.

library(AmesHousing)
library(janitor)
library(tidyverse)
library(tidymodels)
library(doParallel)
library(finetune)
library(butcher)
library(knitr)
library(remotes)
library(devtools)

#remotes::install_github("curso-r/treesnip")
library(treesnip)
#devtools::install_github('catboost/catboost', subdir = 'catboost/R-package')
library(catboost)

Loading and Splitting Data

We’re going to use the Ames Housing data for this blog post. Our outcome variable, sale_price is continuous so our modelling constitutes a regression problem. We want to use all the remaining variables we have available in the dataset as predictors to help us predict sale_price, we have a mix of categorical and continuous predictors which makes this dataset a good candidate for the CatBoost algorithm.

Moreover, we will partition a small sample of this data to act as subsequent or future data. We do this because we want to later show that a trained workflow can be saved and loaded to predict on data that it hasn’t been exposed to in the model training process. This simulates a production environment where we want a deployed model to predict on data that will be available subsequent to the model building phase. 

Subsequently, we split the data available into training and testing sets ready for the modelling process. We keep 80% of the data for training and 20% for testing.

# Load Data ----
set.seed(123)
ames_data <- make_ames() %>% 
  janitor::clean_names()

## Partition the dataframe to keep some data to act as new data
ames_future <- tail(ames_data, 100)
ames_current <- head(ames_data, 2830)

# Split Data ----
ames_split <- initial_split(ames_current, prop = 0.8, strata = sale_price)

ames_train <- training(ames_split)
ames_test <- testing(ames_split)

Data Preprocessing – Creating a Recipe

Recipes allow us to apply preprocessing steps in sequence to our training data. This is a compact way of preparing data for modelling.

# Create Recipe ----
ames_recipe <- recipe(sale_price ~ ., data = ames_train) %>% 
  step_other(all_nominal_predictors(), threshold = 0.01) %>% 
  step_nzv(all_nominal_predictors())

All recipe specifications begin with a formula which defines the predictors and outcome variable as well as the dataset upon which the subsequent preprocessing steps will be applied. In our case, our outcome variable is sale_price and we will take all other variables in the dataset as predictors. We apply these steps on the ames_train data partition.

The step_* functions are the preprocessing we apply to the dataset in sequence. All step_* functions first take a function like all_nominal_predictors() which tells the recipe object the type of variables in the dataset upon which to apply the present preprocessing step. step_other takes all the nominal/categorical predictors in the data and rolls up all categories which are fewer than 1% into a new category called ‘other’. step_nzv will remove nominal variables that are highly sparse or unbalanced. 

Defining Model and Workflow

Next, we define the CatBoost model specification we want to train and then connect our defined model and recipe in a single object called a workflow. Note, defining a model in R doesn’t train the model but merely describes it’s structure; Python users will be familiar with this type of model specification but this is new in R where usually we and train the model in one go. 

# Define Model -----
## Add catboost
#treesnip::add_boost_tree_catboost()

catboost_spec <- boost_tree(
  trees = 1000,
  min_n = tune(),
  learn_rate = tune(),
  tree_depth = tune()) %>% 
  set_engine("catboost", loss_function = "RMSE", nthread = 6) %>% 
  set_mode("regression")

## Constrain the learn rate to be between 0.001 and 0.005
catboost_param <- catboost_spec %>% 
  parameters() %>% 
  update(learn_rate = threshold(c(0.001, 0.005)))

In the code snippet above, I’ve initialised the CatBoost model with some hyperparameters, the tune() function acts as a placeholder which tells R to perform hyperparameter tuning while model training for the respective hyperparameter. The set_engine function defines the model type and loss function we want to use and the set_mode function tells R that we are using the CatBoost model for regression as opposed to classification. 

We want to set a constraint on the learn_rate parameter to be between the range [0.001, 0.005] so we use the update function to do this. There is no particular reason for this constraint other than to illustrate how this can be done for a given set of hyperparameters in Tidymodels. 

We build a workflow that connects our recipe and model in a single object, this can be thought of as a modelling pipeline where raw data is fed in and it is cleaned, processed and modelled by calling functions held in a single object.

# Define Workflow to connect Recipe and Model ----
catboost_workflow <- workflow() %>% 
  add_recipe(ames_recipe) %>% 
  add_model(catboost_spec)

Training the Model

Now, we train and tune the model we’ve specified on our training data.

# Train and Tune Model ----
## Define a random grid for hyperparameters to vary over
set.seed(123)

catboost_grid <- grid_max_entropy(
  catboost_param,
  size = 10
)

## Make Cross Validation Folds
set.seed(123)
ames_folds <- vfold_cv(data = ames_train, v = 5)

## Tune Model using Parallel Processing
cores <- parallel::detectCores(logical = FALSE)
cl <- makeForkCluster(cores - 1)  
doParallel::registerDoParallel(cl) # Register Backend

set.seed(123)
catboost_tuned <- catboost_workflow %>% 
  tune_grid(resamples = ames_folds, 
            grid = catboost_grid,
            control = control_grid(save_pred = TRUE, verbose = TRUE),
            metrics = metric_set(rmse, rsq, mae))

Firstly, we specify a grid over which the CatBoost tuning parameters can vary. There are various grid construction methods Tidymodels affords and it’s worth playing around with them to see what a good set of hyperparameters would be.

Next, we make 5 cross validation folds out of our training data to fit the model. I’m not going to go over cross validation in the post but cross validation with Tidymodels has the benefit of applying preprocessing recipes to analysis folds only and not on the assessment folds. This means that as the model is being trained, preprocessing is only being performed on the portion of data that is used to train the model (analysis set) but not on the portion of data left out which will be used for model assessment (assessment set). This gives a more accurate representation of model performance. Not doing this is a common pitfall in model fitting and Tidymodels takes care of this for you. 

Finally, we use parallel processing to train the multiple candidate models using cross validation. I’ve used forking in my example which works well for Mac; Windows users may consider using the makePSOCKcluster function instead. The tune_grid function applies the model workflow on the cross validation folds and tunes over the hyperparameters specified in the catboost_grid. To track the performance of each model, we define a metric set of RMSE (Root Mean Square Error), R^2 and MAE (Mean Absolute Error). For each model fit in training, Tidymodels will report these 3 metrics which we can use for comparison. 

Model Performance and Best Model

We can view the performance of all the models fit in a Tibble or via plots.

## View performance metrics across all hyperparameter permutations
catboost_tuned %>% 
  collect_metrics()

# Plot for tuning parameter performance ----
## RMSE
catboost_tuned %>% 
  show_best(metric = "rmse", n = 10) %>% 
  tidyr::pivot_longer(min_n:learn_rate, names_to="variable",values_to="value" ) %>% 
  ggplot(mapping = aes(value, mean)) +
  geom_line(alpha=1/2)+
  geom_point()+
  facet_wrap(~variable, scales = "free")+
  ggtitle("Best parameters for RMSE")

## MAE
catboost_tuned %>% 
  show_best(metric = "mae", n = 10) %>% 
  tidyr::pivot_longer(min_n:learn_rate, names_to="variable", values_to="value" ) %>% 
  ggplot(mapping = aes(value,mean)) +
  geom_line(alpha=1/2)+
  geom_point()+
  facet_wrap(~variable, scales = "free")+
  ggtitle("Best parameters for MAE")

Now that we have candidate models and their performance metrics, we want to select a best model in terms of performance and we want to fit that to the entire training set. This is the model we will use to make predictions. 

# Finalise Model ----
## Update workflow with the best model found
catboost_best_model <- catboost_tuned %>% 
  select_best(metric = "rmse")

final_catboost_workflow <- catboost_workflow %>% 
  finalize_workflow(catboost_best_model)

## Fit the final model to all the training data
final_catboost_model <- final_catboost_workflow %>% 
  fit(data = ames_train)

The select_best function looks at the tibble of candidate model results in catboost_tuned and extracts the hyperparameters of the best model according to a supplied performance metric – the metric supplied here is RMSE. 

Next, we finalise the workflow by updating the model inside with the hyperparameters of the best model. Essentially we are defining the workflow with a single model where we freeze the hyperparameter to be the optimal hyperparameters found via cross validation. Finally, we fit the final workflow on the entire training data. 

Saving Workflows

Once we have a trained workflow we want to be able to save and load it as an R object. This will be useful if we are pushing a model to production, we can instead push a workflow and call our preprocessing and model in one go. Before we save the workflow object, we perform a last_fit, which fits a workflow on the training data and predicts once on the test set.

# Fit model to test data ----
catboost_fit_final <- final_catboost_model %>% 
  last_fit(ames_split)

Before we save the workflow, we want to remove any extraneous elements in the workflow to make the object size as small as possible. Luckily, the butcher() function from the butcher library allows us to make the workflow smaller.

# Save Workflow ----
## Extract final fitted workflow
catboost_wf_model <- catboost_fit_final$.workflow[[1]]

## Butcher Workflow to reduce extraneous elements in workflow
catboost_wf_model_reduced <- butcher(catboost_wf_model)

## Compute difference between butchered and non butchered workflow objects
lobstr::obj_size(catboost_wf_model) # 5,709,912 B
lobstr::obj_size(catboost_wf_model_reduced) # 3,170,152 B

## Save Model as RDS object
saveRDS(catboost_wf_model_reduced, file = "catboost_saved_model.rds")

The code above extracts the fitted workflow from the last fit object and applies the butcher function. We can see that the object size has reduced slightly from 5,709,912 bytes to 3,170,152 bytes. Once we’ve reduced the workflow size, we can save the workflow object as a RDS object. 

Loading Workflows

Let’s pretend that some time has elapsed since we’ve built the CatBoost on the AmesHousing dataset. We have new data come in which is held in the ames_future dataframe. To generate predictions on this dataset, we can load our previously trained workflow. 

# Load Workflow and predict on unseen data ----
catboost_loaded_wf <- readRDS(file = "catboost_saved_model.rds")

## Predict workflow on ames_future data
catboost_loaded_wf %>% 
  augment(ames_future) 

That is the end of the Tidymodels with CatBoost journey! The full script is in the linked Github repository and replicated below. 

Full Code

library(AmesHousing)
library(janitor)
library(tidyverse)
library(tidymodels)
library(doParallel)
library(finetune)
library(butcher)
library(knitr)
library(remotes)
library(devtools)

#remotes::install_github("curso-r/treesnip")
library(treesnip)
#devtools::install_github('catboost/catboost', subdir = 'catboost/R-package')
library(catboost)


# Load Data ----
set.seed(123)
ames_data <- make_ames() %>% 
  janitor::clean_names()

## Partition the dataframe to keep some data to act as new data
ames_future <- tail(ames_data, 100)
ames_current <- head(ames_data, 2830)

# Split Data ----
ames_split <- initial_split(ames_current, prop = 0.8, strata = sale_price)

ames_train <- training(ames_split)
ames_test <- testing(ames_split)

# Create Recipe ----
ames_recipe <- recipe(sale_price ~ ., data = ames_train) %>% 
  step_other(all_nominal_predictors(), threshold = 0.01) %>% 
  step_nzv(all_nominal_predictors())

# Define Model -----
## Add catboost
#treesnip::add_boost_tree_catboost()

catboost_spec <- boost_tree(
  trees = 1000,
  min_n = tune(),
  learn_rate = tune(),
  tree_depth = tune()) %>% 
  set_engine("catboost", loss_function = "RMSE", nthread = 6) %>% 
  set_mode("regression")

## Constrain the learn rate to be between 0.001 and 0.005
catboost_param <- catboost_spec %>% 
  parameters() %>% 
  update(learn_rate = threshold(c(0.001, 0.005)))

# Define Workflow to connect Recipe and Model ----
catboost_workflow <- workflow() %>% 
  add_recipe(ames_recipe) %>% 
  add_model(catboost_spec)

# Train and Tune Model ----
## Define a random grid for hyperparameters to vary over
set.seed(123)

catboost_grid <- grid_max_entropy(
  catboost_param,
  size = 10
)

## Make Cross Validation Folds
set.seed(123)
ames_folds <- vfold_cv(data = ames_train, v = 5)

## Tune Model using Parallel Processing
cores <- parallel::detectCores(logical = FALSE)
cl <- makeForkCluster(cores - 1)  
doParallel::registerDoParallel(cl) # Register Backend

set.seed(123)
catboost_tuned <- catboost_workflow %>% 
  tune_grid(resamples = ames_folds, 
            grid = catboost_grid,
            control = control_grid(save_pred = TRUE, verbose = TRUE),
            metrics = metric_set(rmse, rsq, mae))


## View performance metrics across all hyperparameter permutations
catboost_tuned %>% 
  collect_metrics()

# Plot for tuning parameter performance ----
## RMSE
catboost_tuned %>% 
  show_best(metric = "rmse", n = 10) %>% 
  tidyr::pivot_longer(min_n:learn_rate, names_to="variable",values_to="value" ) %>% 
  ggplot(mapping = aes(value, mean)) +
  geom_line(alpha=1/2)+
  geom_point()+
  facet_wrap(~variable, scales = "free")+
  ggtitle("Best parameters for RMSE")

## MAE
catboost_tuned %>% 
  show_best(metric = "mae", n = 10) %>% 
  tidyr::pivot_longer(min_n:learn_rate, names_to="variable", values_to="value" ) %>% 
  ggplot(mapping = aes(value,mean)) +
  geom_line(alpha=1/2)+
  geom_point()+
  facet_wrap(~variable, scales = "free")+
  ggtitle("Best parameters for MAE")

# Finalise Model ----
## Update workflow with the best model found
catboost_best_model <- catboost_tuned %>% 
  select_best(metric = "rmse")

final_catboost_workflow <- catboost_workflow %>% 
  finalize_workflow(catboost_best_model)

## Fit the final model to all the training data
final_catboost_model <- final_catboost_workflow %>% 
  fit(data = ames_train)

# Fit model to test data ----
catboost_fit_final <- final_catboost_model %>% 
  last_fit(ames_split)

# Model Evaluation ----
## Metrics on test set
catboost_fit_final %>% 
  collect_metrics()

## Predictions on test set
catboost_fit_final %>% 
  collect_predictions() %>% 
  dplyr::select(starts_with(".pred")) %>% 
  bind_cols(ames_test)

## Residuals on training set
final_catboost_model %>% 
  predict(new_data = ames_train) %>% 
  bind_cols(ames_train) %>% 
  mutate(residuals = sale_price - .pred) %>% 
  ggplot(mapping = aes(x = .pred, y = residuals)) + 
  geom_point()

## Residuals on test set
catboost_fit_final %>% 
  collect_predictions() %>% 
  mutate(residuals = sale_price - .pred) %>% 
  ggplot(mapping = aes(x = .pred, y = residuals)) + 
  geom_point()

# Yardstick prediction metrics ----
## Training set
final_catboost_model %>% 
  predict(new_data = ames_train) %>% 
  bind_cols(ames_train) %>% 
  yardstick::metrics(sale_price, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ",")) %>%
  knitr::kable()

## Test set
catboost_fit_final %>% 
  collect_predictions() %>% 
  yardstick::metrics(sale_price, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ",")) %>%
  knitr::kable()

# Save Workflow ----
## Extract final fitted workflow
catboost_wf_model <- catboost_fit_final$.workflow[[1]]

## Butcher Workflow to reduce extraneous elements in workflow
catboost_wf_model_reduced <- butcher(catboost_wf_model)

## Compute difference between butchered and non butchered workflow objects
lobstr::obj_size(catboost_wf_model)
lobstr::obj_size(catboost_wf_model_reduced)

## Save Model as RDS object
saveRDS(catboost_wf_model_reduced, file = "catboost_saved_model.rds")

# Load Workflow and predict on unseen data ----
catboost_loaded_wf <- readRDS(file = "catboost_saved_model.rds")

## Predict workflow on ames_future data
catboost_loaded_wf %>% 
  augment(ames_future) 

# Stop Cluster ----
parallel::stopCluster(cl)

References

  1. https://github.com/JunaidMB/tidymodels_series
  2. https://blog.rmhogervorst.nl/blog/2020/08/28/how-to-use-catboost-with-tidymodels/
  3. https://juliasilge.com/blog/
  4. https://catboost.ai/docs
  5. https://rstudio.github.io/renv/articles/collaborating.html
  6. https://www.tmwr.org/

Leave a Reply