Parallelize 'caret' functions

The futurize package allows you to easily turn sequential code into parallel code by piping the sequential code to the futurize() function. Easy!

TL;DR

library(futurize)
plan(multisession)
library(caret)

ctrl <- trainControl(method = "cv", number = 10)
model <- train(Species ~ ., data = iris, method = "rf", trControl = ctrl) |> futurize()

Introduction

This vignette demonstrates how to use this approach to parallelize caret functions such as train().

The caret package provides a rich set of machine-learning tools with a unified API. The train() function fits models using cross-validation or bootstrap resampling, making it an excellent candidate for parallelization.

Example: Training a random forest with cross-validation

The train() function fits models across multiple resampling iterations:

library(caret)

## Set up 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)

## Train a random forest model
model <- train(Species ~ ., data = iris, method = "rf", trControl = ctrl)

Here train() evaluates sequentially, but we can easily make it evaluate in parallel by piping to futurize():

library(futurize)
library(caret)

ctrl <- trainControl(method = "cv", number = 10)
model <- train(Species ~ ., data = iris, method = "rf", trControl = ctrl) |> futurize()

This will distribute the cross-validation folds across the available parallel workers, given that we have set up parallel workers, e.g.

plan(multisession)

The built-in multisession backend parallelizes on your local computer and works on all operating systems. There are [other parallel backends] to choose from, including alternatives to parallelize locally as well as distributed across remote machines, e.g.

plan(future.mirai::mirai_multisession)

and

plan(future.batchtools::batchtools_slurm)

Supported Functions

The following caret functions are supported by futurize():

bag()
gafs()
nearZeroVar()
rfe()
safs()
sbf()
train()