Automated tuning process for the penalty parameter lambda, with built-in feature selection. Lambda directly influences the granularity of the segmentation, with low/high values resulting in a fine/coarse segmentation.

autotune(
  mfit,
  data,
  vars,
  target,
  max_ngrps = 15,
  hcut = 0.75,
  ignr_intr = NULL,
  pred_fun = NULL,
  lambdas = as.vector(outer(seq(1, 10, 0.1), 10^(-7:3))),
  nfolds = 5,
  strat_vars = NULL,
  glm_par = alist(),
  err_fun = mse,
  ncores = -1,
  out_pds = FALSE
)

Arguments

mfit

Fitted model object (e.g., a "gbm" or "randomForest" object).

data

Data frame containing the original training data.

vars

Character vector specifying the features in data to use.

target

String specifying the target (or response) variable to model.

max_ngrps

Integer specifying the maximum number of groups that each feature's values/levels are allowed to be grouped into.

hcut

Numeric in the range [0,1] specifying the cut-off value for the normalized cumulative H-statistic over all two-way interactions, ordered from most to least important, between the features in vars. Note that hcut = 0 will consider the single most important interaction, while hcut = 1 will consider all possible two-way interactions. Setting hcut = -1 will only consider main effects in the tuning.

ignr_intr

Optional character string specifying features to ignore when searching for meaningful interactions to incorporate in the GLM.

pred_fun

Optional prediction function to calculate feature effects for the model in mfit. Requires two arguments: object and newdata. See pdp::partial and this article for the details. See also the function gbm_fun in the example.

lambdas

Numeric vector with the possible lambda values to explore. The search grid is created automatically via lambda_grid such that it contains only those values of lambda that result in a unique grouping of the full set of features. A seperate grid is generated for main and interaction effects, due to the scale difference in both types.

nfolds

Integer for the number of folds in K-fold cross-validation.

strat_vars

Character (vector) specifying the feature(s) to use for stratified sampling. The default NULL implies no stratification is applied.

glm_par

Named list, constructed via alist, containing arguments to be passed on to glm. Examples are: family, weights or offset. Note that formula will be ignored as the GLM formula is determined by the specified target and the automatic feature selection in the tuning process.

err_fun

Error function to calculate the prediction errors on the validation folds. This must be an R function which outputs a single number and takes two vectors y_pred and y_true as input for the predicted and true target values respectively. An additional input vector w_case is allowed to use case weights in the error function. The weights are determined automatically based on the weights field supplied to glm_par. Examples already included in the package are:

mse

mean squared error loss function (default).

wgt_mse

weighted mean squared error loss function.

poi_dev

Poisson deviance loss function.

See err_fun for details on these predefined functions.

ncores

Integer specifying the number of cores to use. The default ncores = -1 uses all the available physical cores (not threads), as determined by parallel::detectCores(logical = 'FALSE').

out_pds

Boolean to indicate whether to add the calculated PD effects for the selected features to the output list.

Value

List with the following elements:

slct_feat

named vector containing the selected features (names) and the optimal number of groups for each feature (values).

best_surr

the optimal GLM surrogate, which is fit to all observations in data. The segmented data can be obtained via the $data attribute of the GLM fit.

tune_main

the cross-validation results for the main effects as a tidy data frame. The column cv_err contains the cross-validated error, while the columns 1:nfolds contain the error on the validation folds.

tune_intr

cross-validation results for the interaction effects.

pd_fx

List with the PD effects for the selected features (only present if out_pds = TRUE).

Examples

if (FALSE) { data('mtpl_be') features <- setdiff(names(mtpl_be), c('id', 'nclaims', 'expo', 'long', 'lat')) set.seed(12345) gbm_fit <- gbm::gbm(as.formula(paste('nclaims ~', paste(features, collapse = ' + '))), distribution = 'poisson', data = mtpl_be, n.trees = 50, interaction.depth = 3, shrinkage = 0.1) gbm_fun <- function(object, newdata) mean(predict(object, newdata, n.trees = object$n.trees, type = 'response')) gbm_fit %>% autotune(data = mtpl_be, vars = c('ageph', 'bm', 'coverage', 'fuel', 'sex', 'fleet', 'use'), target = 'nclaims', hcut = 0.75, pred_fun = gbm_fun, lambdas = as.vector(outer(seq(1, 10, 1), 10^(-6:-2))), nfolds = 5, strat_vars = c('nclaims', 'expo'), glm_par = alist(family = poisson(link = 'log'), offset = log(expo)), err_fun = poi_dev, ncores = -1) }