Build a random forest — rforest • distRforest

This function acts as a user-friendly interface to build a random forest based on individual rpart trees.

rforest(
  formula,
  data,
  method,
  weights = NULL,
  parms = NULL,
  control = NULL,
  ncand,
  ntrees,
  subsample = 1,
  track_oob = FALSE,
  keep_data = FALSE,
  red_mem = FALSE
)

Arguments

formula	object of the class `formula` with a symbolic description of the form `response ~ var1 + var2 + var3` without interactions. Please refrain from applying transformation functions to the response, but add the transformed variable to the `data` beforehand. Two exceptions exist, see `method = 'poisson'` and `method = 'exp'` below.
data	data frame containing the training data observations.
method	string specifying the type of forest to build. Options are: 'class' classification forest (OOB error tracking only implemented for binary classification). 'anova' standard regression forest with a squared error loss. 'poisson' poisson regression forest for count data. The left-hand-side of `formula` can be specified as `cbind(observation_time, number_of_events)` to include time exposures. 'gamma' gamma regression forest for strictly positive long-tailed data. 'lognormal' lognormal regression forest for strictly positive long-tailed data. 'exp' exponential scaling for survival data. The left-hand-side of `formula` is specified as `Surv(observation_time, event_indicator)` to include time exposures.
weights	optional name of the variable in `data` to use as case weights. Either as a string or simply the variable name should work.
parms	optional parameters for the splitting function, see `rpart` for the details and allowed options.
control	list of options that control the fitting details of the `rpart` trees. Use `rpart.control` to set this up.
ncand	integer specifying the number of randomly chosen variable candidates to consider at each node to find the optimal split.
ntrees	integer specifying the number of trees in the ensemble.
subsample	numeric in the range [0,1]. Each tree in the ensemble is built on randomly sampled data of size `subsample * nrow(data)`.
track_oob	boolean to indicate whether the out-of-bag errors should be tracked (TRUE) or not (FALSE). This option is not implemented for `method = 'exp'` or multi-class classification. For the other methods, these errors are tracked: 'class' Matthews correlation coefficient for binary classification. 'anova' mean squared error. 'poisson' Poisson deviance. 'gamma' gamma deviance. 'lognormal' mean squared error. All these errors are evaluated in a weighted version if `weights` are supplied.
keep_data	boolean to indicate whether the `data` should be saved with the fit. Not advised to set this to `TRUE` for large data sets.
red_mem	boolean whether to reduce the memory footprint of the `rpart` trees by eliminating non-essential elements from the fits. It is adviced to set this to `TRUE` for large values of `ntrees`.

Value

object of the class rforest, which is a list containing the following elements:

trees

list of length equal to ntrees, containing the individual rpart trees.

oob_error

numeric vector of length equal to ntrees, containing the OOB error at each iteration (if track_oob = TRUE).

data

the training data (if keep_data = TRUE).