This function acts as a user-friendly interface to build a random forest based on individual rpart trees.

rforest(
  formula,
  data,
  method,
  weights = NULL,
  parms = NULL,
  control = NULL,
  ncand,
  ntrees,
  subsample = 1,
  track_oob = FALSE,
  keep_data = FALSE,
  red_mem = FALSE
)

Arguments

formula

object of the class formula with a symbolic description of the form response ~ var1 + var2 + var3 without interactions. Please refrain from applying transformation functions to the response, but add the transformed variable to the data beforehand. Two exceptions exist, see method = 'poisson' and method = 'exp' below.

data

data frame containing the training data observations.

method

string specifying the type of forest to build. Options are:

'class'

classification forest (OOB error tracking only implemented for binary classification).

'anova'

standard regression forest with a squared error loss.

'poisson'

poisson regression forest for count data. The left-hand-side of formula can be specified as cbind(observation_time, number_of_events) to include time exposures.

'gamma'

gamma regression forest for strictly positive long-tailed data.

'lognormal'

lognormal regression forest for strictly positive long-tailed data.

'exp'

exponential scaling for survival data. The left-hand-side of formula is specified as Surv(observation_time, event_indicator) to include time exposures.

weights

optional name of the variable in data to use as case weights. Either as a string or simply the variable name should work.

parms

optional parameters for the splitting function, see rpart for the details and allowed options.

control

list of options that control the fitting details of the rpart trees. Use rpart.control to set this up.

ncand

integer specifying the number of randomly chosen variable candidates to consider at each node to find the optimal split.

ntrees

integer specifying the number of trees in the ensemble.

subsample

numeric in the range [0,1]. Each tree in the ensemble is built on randomly sampled data of size subsample * nrow(data).

track_oob

boolean to indicate whether the out-of-bag errors should be tracked (TRUE) or not (FALSE). This option is not implemented for method = 'exp' or multi-class classification. For the other methods, these errors are tracked:

'class'

Matthews correlation coefficient for binary classification.

'anova'

mean squared error.

'poisson'

Poisson deviance.

'gamma'

gamma deviance.

'lognormal'

mean squared error.

All these errors are evaluated in a weighted version if weights are supplied.

keep_data

boolean to indicate whether the data should be saved with the fit. Not advised to set this to TRUE for large data sets.

red_mem

boolean whether to reduce the memory footprint of the rpart trees by eliminating non-essential elements from the fits. It is adviced to set this to TRUE for large values of ntrees.

Value

object of the class rforest, which is a list containing the following elements:

trees

list of length equal to ntrees, containing the individual rpart trees.

oob_error

numeric vector of length equal to ntrees, containing the OOB error at each iteration (if track_oob = TRUE).

data

the training data (if keep_data = TRUE).