Title: | A Robust Boosting Algorithm |
---|---|
Description: | An implementation of robust boosting algorithms for regression in R. This includes the RRBoost method proposed in the paper "Robust Boosting for Regression Problems" (Ju X and Salibian-Barrera M. 2020) <doi:10.1016/j.csda.2020.107065> (to appear in Computational Statistics and Data Science). It also implements previously proposed boosting algorithms in the simulation section of the paper: L2Boost, LADBoost, MBoost (Friedman, J. H. (2001) <10.1214/aos/1013203451>) and Robloss (Lutz et al. (2008) <10.1016/j.csda.2007.11.006>). |
Authors: | Xiaomeng Ju [aut,cre], Matias Salibian-Barrera [aut], |
Maintainer: | Xiaomeng Ju <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2 |
Built: | 2025-02-17 03:28:59 UTC |
Source: | https://github.com/xmengju/rrboost |
Here goes a description of the data.
data(airfoil)
data(airfoil)
An object of class "data.frame"
.
Here goes a more detailed description of the data.
There are 1503 observations and 6 variables:
y
, frequency
, angle
, chord_length
,
velocity
, and thickness
.
The UCI Archive https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise,
Brooks, T. F., Pope, D. S., and Marcolini, M. A. (1989). Airfoil self-noise and prediction. NASA Reference Publication-1218, document id: 9890016302.
data(airfoil)
data(airfoil)
This function implements the RRBoost robust boosting algorithm for regression, as well as other robust and non-robust boosting algorithms for regression.
Boost( x_train, y_train, x_val, y_val, x_test, y_test, type = "RRBoost", error = c("rmse", "aad"), niter = 200, y_init = "LADTree", max_depth = 1, tree_init_provided = NULL, control = Boost.control() )
Boost( x_train, y_train, x_val, y_val, x_test, y_test, type = "RRBoost", error = c("rmse", "aad"), niter = 200, y_init = "LADTree", max_depth = 1, tree_init_provided = NULL, control = Boost.control() )
x_train |
predictor matrix for training data (matrix/dataframe) |
y_train |
response vector for training data (vector/dataframe) |
x_val |
predictor matrix for validation data (matrix/dataframe) |
y_val |
response vector for validation data (vector/dataframe) |
x_test |
predictor matrix for test data (matrix/dataframe, optional, required when |
y_test |
response vector for test data (vector/dataframe, optional, required when |
type |
type of the boosting method: "L2Boost", "LADBoost", "MBoost", "Robloss", "SBoost", "RRBoost" (character string) |
error |
a character string (or vector of character strings) indicating the type of error metrics to be evaluated on the test set. Valid options are: "rmse" (root mean squared error), "aad" (average absolute deviation), and "trmse" (trimmed root mean squared error) |
niter |
number of boosting iterations (for RRBoost: |
y_init |
a string indicating the initial estimator to be used. Valid options are: "median" or "LADTree" (character string) |
max_depth |
the maximum depth of the tree learners (numeric) |
tree_init_provided |
an optional pre-fitted initial tree (an |
control |
a named list of control parameters, as returned by |
This function implements a robust boosting algorithm for regression (RRBoost).
It also includes the following robust and non-robust boosting algorithms
for regression: L2Boost, LADBoost, MBoost, Robloss, and SBoost. This function
uses the functions available in the rpart
package to construct binary regression trees.
A list with the following components:
type |
which boosting algorithm was run. One of: "L2Boost", "LADBoost", "MBoost", "Robloss", "SBoost", "RRBoost" (character string) |
control |
the list of control parameters used |
niter |
number of iterations for the boosting algorithm (for RRBoost |
error |
if |
tree_init |
if |
tree_list |
if |
f_train_init |
a vector of the initialized estimator of the training data |
alpha |
a vector of base learners' coefficients |
early_stop_idx |
early stopping iteration |
when_init |
if |
loss_train |
a vector of training loss values (one per iteration) |
loss_val |
a vector of validation loss values (one per iteration) |
err_val |
a vector of validation aad errors (one per iteration) |
err_train |
a vector of training aad errors (one per iteration) |
err_test |
a matrix of test errors before and at the early stopping iteration (returned if make_prediction = TRUE in control); the matrix dimension is the early stopping iteration by the number of error types (matches the |
f_train |
a matrix of training function estimates at all iterations (returned if save_f = TRUE in control); each column corresponds to the fitted values of the predictor at each iteration |
f_val |
a matrix of validation function estimates at all iterations (returned if save_f = TRUE in control); each column corresponds to the fitted values of the predictor at each iteration |
f_test |
a matrix of test function estimatesbefore and at the early stopping iteration (returned if save_f = TRUE and make_prediction = TRUE in control); each column corresponds to the fitted values of the predictor at each iteration |
var_select |
a vector of variable selection indicators (one per explanatory variable; 1 if the variable was selected by at least one of the base learners, and 0 otherwise) |
var_importance |
a vector of permutation variable importance scores (one per explanatory variable, and returned if cal_imp = TRUE in control) |
Xiaomeng Ju, [email protected]
Boost.validation
, Boost.control
.
data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] model_RRBoost_LADTree = Boost(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, x_test = xtest, y_test = ytest, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 10, ## to keep the running time low control = Boost.control(max_depth_init = 2, min_leaf_size_init = 20, make_prediction = TRUE, cal_imp = FALSE))
data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] model_RRBoost_LADTree = Boost(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, x_test = xtest, y_test = ytest, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 10, ## to keep the running time low control = Boost.control(max_depth_init = 2, min_leaf_size_init = 20, make_prediction = TRUE, cal_imp = FALSE))
Tuning and control parameters for the RRBoost robust boosting algorithm, including the initial fit.
Boost.control( n_init = 100, eff_m = 0.95, bb = 0.5, trim_prop = NULL, trim_c = 3, max_depth_init = 3, min_leaf_size_init = 10, cal_imp = TRUE, save_f = FALSE, make_prediction = TRUE, save_tree = FALSE, precision = 4, shrinkage = 1, trace = FALSE )
Boost.control( n_init = 100, eff_m = 0.95, bb = 0.5, trim_prop = NULL, trim_c = 3, max_depth_init = 3, min_leaf_size_init = 10, cal_imp = TRUE, save_f = FALSE, make_prediction = TRUE, save_tree = FALSE, precision = 4, shrinkage = 1, trace = FALSE )
n_init |
number of iterations for the SBoost step of RRBoost ( |
eff_m |
scalar between 0 and 1 indicating the efficiency (measured in a linear model with Gaussian errors) of Tukey's loss function used in the 2nd stage of RRBoost. |
bb |
breakdown point of the M-scale estimator used in the SBoost step (numeric) |
trim_prop |
trimming proportion if 'trmse' is used as the performance metric (numeric). 'trmse' calculates the root-mean-square error of residuals (r) of which |r| < quantile(|r|, 1-trim_prop) (e.g. trim_prop = 0.1 ignores 10% of the data and calculates RMSE of residuals whose absolute values are below 90% quantile of |r|). If both |
trim_c |
the trimming constant if 'trmse' is used as the performance metric (numeric, defaults to 3). 'trmse' calculates the root-mean-square error of the residuals (r) between median(r) + trim_c mad(r) and median(r) - trim_c mad(r). If both |
max_depth_init |
the maximum depth of the initial LADTtree (numeric, defaults to 3) |
min_leaf_size_init |
the minimum number of observations per node of the initial LADTtree (numeric, defaults to 10) |
cal_imp |
logical indicating whether to calculate variable importance (defaults to |
save_f |
logical indicating whether to save the function estimates at all iterations (defaults to |
make_prediction |
logical indicating whether to make predictions using |
save_tree |
logical indicating whether to save trees at all iterations (defaults to |
precision |
number of significant digits to keep when using validation error to calculate early stopping time (numeric, defaults to 4) |
shrinkage |
shrinkage parameter in boosting (numeric, defaults to 1 which corresponds to no shrinkage) |
trace |
logical indicating whether to print the number of completed iterations and for RRBoost the completed combinations of LADTree hyperparameters for monitoring progress (defaults to |
Various tuning and control parameters for the RRBoost robust boosting algorithm implemented in the
function Boost
, including options for the initial fit.
A list of all input parameters
Xiaomeng Ju, [email protected]
data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] my.control <- Boost.control(max_depth_init = 2, min_leaf_size_init = 20, make_prediction = TRUE, cal_imp = FALSE) model_RRBoost_LADTree = Boost(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, x_test = xtest, y_test = ytest, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 10, ## to keep the running time low control = my.control)
data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] my.control <- Boost.control(max_depth_init = 2, min_leaf_size_init = 20, make_prediction = TRUE, cal_imp = FALSE) model_RRBoost_LADTree = Boost(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, x_test = xtest, y_test = ytest, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 10, ## to keep the running time low control = my.control)
A function to fit RRBoost (see also Boost
) where the initialization parameters are chosen
based on the performance on the validation set.
Boost.validation( x_train, y_train, x_val, y_val, x_test, y_test, type = "RRBoost", error = c("rmse", "aad"), niter = 1000, max_depth = 1, y_init = "LADTree", max_depth_init_set = c(1, 2, 3, 4), min_leaf_size_init_set = c(10, 20, 30), control = Boost.control() )
Boost.validation( x_train, y_train, x_val, y_val, x_test, y_test, type = "RRBoost", error = c("rmse", "aad"), niter = 1000, max_depth = 1, y_init = "LADTree", max_depth_init_set = c(1, 2, 3, 4), min_leaf_size_init_set = c(10, 20, 30), control = Boost.control() )
x_train |
predictor matrix for training data (matrix/dataframe) |
y_train |
response vector for training data (vector/dataframe) |
x_val |
predictor matrix for validation data (matrix/dataframe) |
y_val |
response vector for validation data (vector/dataframe) |
x_test |
predictor matrix for test data (matrix/dataframe, optional, required when |
y_test |
response vector for test data (vector/dataframe, optional, required when |
type |
type of the boosting method: "L2Boost", "LADBoost", "MBoost", "Robloss", "SBoost", "RRBoost" (character string) |
error |
a character string (or vector of character strings) indicating the types of error metrics to be evaluated on the test set. Valid options are: "rmse" (root mean squared error), "aad" (average absulute deviation), and "trmse" (trimmed root mean squared error) |
niter |
number of iterations (for RRBoost |
max_depth |
the maximum depth of the tree learners (numeric) |
y_init |
a string indicating the initial estimator to be used. Valid options are: "median" or "LADTree" (character string) |
max_depth_init_set |
a vector of possible values of the maximum depth of the initial LADTree that the algorithm choses from |
min_leaf_size_init_set |
a vector of possible values of the minimum observations per node of the initial LADTree that the algorithm choses from |
control |
a named list of control parameters, as returned by |
This function runs the RRBoost algorithm (see Boost
) on different combinations of the
parameters for the initial fit, and chooses the optimal set based on the performance on the validation set.
A list with components
the components of model |
an object returned by Boost that is trained with selected initialization parameters |
param |
a vector of selected initialization parameters (return (0,0) if selected initialization is the median of the training responses) |
Xiaomeng Ju, [email protected]
## Not run: data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] model_RRBoost_cv_LADTree = Boost.validation(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, x_test = xtest, y_test = ytest, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 1000, max_depth_init_set = 1:5, min_leaf_size_init_set = c(10,20,30), control = Boost.control(make_prediction = TRUE, cal_imp = TRUE)) ## End(Not run)
## Not run: data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] model_RRBoost_cv_LADTree = Boost.validation(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, x_test = xtest, y_test = ytest, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 1000, max_depth_init_set = 1:5, min_leaf_size_init_set = c(10,20,30), control = Boost.control(make_prediction = TRUE, cal_imp = TRUE)) ## End(Not run)
This function calculates variable importance scores for a previously
computed RRBoost
fit.
cal_imp_func(model, x_val, y_val, trace = FALSE)
cal_imp_func(model, x_val, y_val, trace = FALSE)
model |
an object returned by |
x_val |
predictor matrix for validation data (matrix/dataframe) |
y_val |
response vector for validation data (vector/dataframe) |
trace |
logical indicating whether to print the variable under calculation for monitoring progress (defaults to |
This function computes permutation variable importance scores
given an object returned by Boost
and a validation data set.
a vector of permutation variable importance scores (one per explanatory variable)
Xiaomeng Ju, [email protected]
## Not run: data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] model = Boost(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 1000, control = Boost.control(max_depth_init = 2, min_leaf_size_init = 20, save_tree = TRUE, make_prediction = FALSE, cal_imp = FALSE)) var_importance <- cal_imp_func(model, x_val = xval, y_val= yval) ## End(Not run)
## Not run: data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] model = Boost(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 1000, control = Boost.control(max_depth_init = 2, min_leaf_size_init = 20, save_tree = TRUE, make_prediction = FALSE, cal_imp = FALSE)) var_importance <- cal_imp_func(model, x_val = xval, y_val= yval) ## End(Not run)
A function to make predictions and calculate test error given an object returned by Boost and test data
cal_predict(model, x_test, y_test)
cal_predict(model, x_test, y_test)
model |
an object returned by Boost |
x_test |
predictor matrix for test data (matrix/dataframe) |
y_test |
response vector for test data (vector/dataframe) |
A function to make predictions and calculate test error given an object returned by Boost and test data
A list with with the following components:
f_t_test |
predicted values with model at the early stopping iteration using x_test as the predictors |
err_test |
a matrix of test errors before and at the early stopping iteration (returned if make_prediction = TRUE in control); the matrix dimension is the early stopping iteration by the number of error types (matches the |
f_test |
a matrix of test function estimates at all iterations (returned if save_f = TRUE in control) |
value |
a vector of test errors evaluated at the early stopping iteration |
Xiaomeng Ju, [email protected]
## Not run: data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] model = Boost(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 1000, control = Boost.control(max_depth_init = 2, min_leaf_size_init = 20, save_tree = TRUE, make_prediction = FALSE, cal_imp = FALSE)) prediction <- cal_predict(model, x_test = xtest, y_test = ytest) ## End(Not run)
## Not run: data(airfoil) n <- nrow(airfoil) n0 <- floor( 0.2 * n ) set.seed(123) idx_test <- sample(n, n0) idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) ) idx_val <- (1:n)[ -c(idx_test, idx_train) ] xx <- airfoil[, -6] yy <- airfoil$y xtrain <- xx[ idx_train, ] ytrain <- yy[ idx_train ] xval <- xx[ idx_val, ] yval <- yy[ idx_val ] xtest <- xx[ idx_test, ] ytest <- yy[ idx_test ] model = Boost(x_train = xtrain, y_train = ytrain, x_val = xval, y_val = yval, type = "RRBoost", error = "rmse", y_init = "LADTree", max_depth = 1, niter = 1000, control = Boost.control(max_depth_init = 2, min_leaf_size_init = 20, save_tree = TRUE, make_prediction = FALSE, cal_imp = FALSE)) prediction <- cal_predict(model, x_test = xtest, y_test = ytest) ## End(Not run)