Automatic tuning — autotune • maidrr

Automated tuning process for the penalty parameter lambda, with built-in feature selection. Lambda directly influences the granularity of the segmentation, with low/high values resulting in a fine/coarse segmentation.

autotune(
  mfit,
  data,
  vars,
  target,
  max_ngrps = 15,
  hcut = 0.75,
  ignr_intr = NULL,
  pred_fun = NULL,
  lambdas = as.vector(outer(seq(1, 10, 0.1), 10^(-7:3))),
  nfolds = 5,
  strat_vars = NULL,
  glm_par = alist(),
  err_fun = mse,
  ncores = -1,
  out_pds = FALSE
)

Arguments

mfit	Fitted model object (e.g., a "gbm" or "randomForest" object).
data	Data frame containing the original training data.
vars	Character vector specifying the features in `data` to use.
target	String specifying the target (or response) variable to model.
max_ngrps	Integer specifying the maximum number of groups that each feature's values/levels are allowed to be grouped into.
hcut	Numeric in the range [0,1] specifying the cut-off value for the normalized cumulative H-statistic over all two-way interactions, ordered from most to least important, between the features in `vars`. Note that `hcut = 0` will consider the single most important interaction, while `hcut = 1` will consider all possible two-way interactions. Setting `hcut = -1` will only consider main effects in the tuning.
ignr_intr	Optional character string specifying features to ignore when searching for meaningful interactions to incorporate in the GLM.
pred_fun	Optional prediction function to calculate feature effects for the model in `mfit`. Requires two arguments: `object` and `newdata`. See `pdp::partial` and this article for the details. See also the function `gbm_fun` in the example.
lambdas	Numeric vector with the possible lambda values to explore. The search grid is created automatically via `lambda_grid` such that it contains only those values of lambda that result in a unique grouping of the full set of features. A seperate grid is generated for main and interaction effects, due to the scale difference in both types.
nfolds	Integer for the number of folds in K-fold cross-validation.
strat_vars	Character (vector) specifying the feature(s) to use for stratified sampling. The default NULL implies no stratification is applied.
glm_par	Named list, constructed via `alist`, containing arguments to be passed on to `glm`. Examples are: `family`, `weights` or `offset`. Note that `formula` will be ignored as the GLM formula is determined by the specified `target` and the automatic feature selection in the tuning process.
err_fun	Error function to calculate the prediction errors on the validation folds. This must be an R function which outputs a single number and takes two vectors `y_pred` and `y_true` as input for the predicted and true target values respectively. An additional input vector `w_case` is allowed to use case weights in the error function. The weights are determined automatically based on the `weights` field supplied to `glm_par`. Examples already included in the package are: mse mean squared error loss function (default). wgt_mse weighted mean squared error loss function. poi_dev Poisson deviance loss function. See `err_fun` for details on these predefined functions.
ncores	Integer specifying the number of cores to use. The default `ncores = -1` uses all the available physical cores (not threads), as determined by `parallel::detectCores(logical = 'FALSE')`.
out_pds	Boolean to indicate whether to add the calculated PD effects for the selected features to the output list.

Value

List with the following elements:

slct_feat

named vector containing the selected features (names) and the optimal number of groups for each feature (values).

best_surr

the optimal GLM surrogate, which is fit to all observations in data. The segmented data can be obtained via the $data attribute of the GLM fit.

tune_main

the cross-validation results for the main effects as a tidy data frame. The column cv_err contains the cross-validated error, while the columns 1:nfolds contain the error on the validation folds.

tune_intr

cross-validation results for the interaction effects.

pd_fx

List with the PD effects for the selected features (only present if out_pds = TRUE).

Examples

if (FALSE) {
data('mtpl_be')
features <- setdiff(names(mtpl_be), c('id', 'nclaims', 'expo', 'long', 'lat'))
set.seed(12345)
gbm_fit <- gbm::gbm(as.formula(paste('nclaims ~',
                               paste(features, collapse = ' + '))),
                    distribution = 'poisson',
                    data = mtpl_be,
                    n.trees = 50,
                    interaction.depth = 3,
                    shrinkage = 0.1)
gbm_fun <- function(object, newdata) mean(predict(object, newdata, n.trees = object$n.trees, type = 'response'))
gbm_fit %>% autotune(data = mtpl_be,
                     vars = c('ageph', 'bm', 'coverage', 'fuel', 'sex', 'fleet', 'use'),
                     target = 'nclaims',
                     hcut = 0.75,
                     pred_fun = gbm_fun,
                     lambdas = as.vector(outer(seq(1, 10, 1), 10^(-6:-2))),
                     nfolds = 5,
                     strat_vars = c('nclaims', 'expo'),
                     glm_par = alist(family = poisson(link = 'log'),
                                     offset = log(expo)),
                     err_fun = poi_dev,
                     ncores = -1)
}