Automated tuning process for the penalty parameter lambda, with built-in feature selection. Lambda directly influences the granularity of the segmentation, with low/high values resulting in a fine/coarse segmentation.
autotune( mfit, data, vars, target, max_ngrps = 15, hcut = 0.75, ignr_intr = NULL, pred_fun = NULL, lambdas = as.vector(outer(seq(1, 10, 0.1), 10^(-7:3))), nfolds = 5, strat_vars = NULL, glm_par = alist(), err_fun = mse, ncores = -1, out_pds = FALSE )
| mfit | Fitted model object (e.g., a "gbm" or "randomForest" object). |
|---|---|
| data | Data frame containing the original training data. |
| vars | Character vector specifying the features in |
| target | String specifying the target (or response) variable to model. |
| max_ngrps | Integer specifying the maximum number of groups that each feature's values/levels are allowed to be grouped into. |
| hcut | Numeric in the range [0,1] specifying the cut-off value for the
normalized cumulative H-statistic over all two-way interactions, ordered
from most to least important, between the features in |
| ignr_intr | Optional character string specifying features to ignore when searching for meaningful interactions to incorporate in the GLM. |
| pred_fun | Optional prediction function to calculate feature effects for
the model in |
| lambdas | Numeric vector with the possible lambda values to explore. The
search grid is created automatically via |
| nfolds | Integer for the number of folds in K-fold cross-validation. |
| strat_vars | Character (vector) specifying the feature(s) to use for stratified sampling. The default NULL implies no stratification is applied. |
| glm_par | Named list, constructed via |
| err_fun | Error function to calculate the prediction errors on the
validation folds. This must be an R function which outputs a single number
and takes two vectors
See |
| ncores | Integer specifying the number of cores to use. The default
|
| out_pds | Boolean to indicate whether to add the calculated PD effects for the selected features to the output list. |
List with the following elements:
named vector containing the selected features (names) and the optimal number of groups for each feature (values).
the optimal GLM
surrogate, which is fit to all observations in data. The segmented
data can be obtained via the $data attribute of the GLM fit.
the cross-validation results for the main effects as a
tidy data frame. The column cv_err contains the cross-validated
error, while the columns 1:nfolds contain the error on the
validation folds.
cross-validation results for the interaction effects.
List with the PD effects for the
selected features (only present if out_pds = TRUE).
if (FALSE) { data('mtpl_be') features <- setdiff(names(mtpl_be), c('id', 'nclaims', 'expo', 'long', 'lat')) set.seed(12345) gbm_fit <- gbm::gbm(as.formula(paste('nclaims ~', paste(features, collapse = ' + '))), distribution = 'poisson', data = mtpl_be, n.trees = 50, interaction.depth = 3, shrinkage = 0.1) gbm_fun <- function(object, newdata) mean(predict(object, newdata, n.trees = object$n.trees, type = 'response')) gbm_fit %>% autotune(data = mtpl_be, vars = c('ageph', 'bm', 'coverage', 'fuel', 'sex', 'fleet', 'use'), target = 'nclaims', hcut = 0.75, pred_fun = gbm_fun, lambdas = as.vector(outer(seq(1, 10, 1), 10^(-6:-2))), nfolds = 5, strat_vars = c('nclaims', 'expo'), glm_par = alist(family = poisson(link = 'log'), offset = log(expo)), err_fun = poi_dev, ncores = -1) }