robustHD: An R package for robust regression with high-dimensional data

In regression analysis with high-dimensional data, variable selection is an important step to (i) overcome computational problems, (ii) improve prediction performance by variance reduction, and (iii) increase interpretability of the resulting models due to the smaller number of variables. However, robust methods are necessary to prevent outlying data points from distorting the results. The add-on package robustHD (Alfons, 2021) for the statistical computing environment R (R Core Team, 2021) provides functionality for robust linear regression and model selection with high-dimensional data. More specifically, the implemented functionality includes robust least angle regression (Khan et al., 2007), robust groupwise least angle regression (Alfons et al., 2016), as well as sparse least trimmed squares regression (Alfons et al., 2013). The latter can be seen as a trimmed version of the popular lasso regression estimator (Tibshirani, 1996). Selecting the optimal model can be done via cross-validation or an information criterion, and various plots are available to illustrate model selection and to evaluate the final model estimates. Furthermore, the package includes functionality for pre-processing such as robust standardization and winsorization. Finally, robustHD follows a clear object-oriented design and takes advantage of C++ code and parallel computing to reduce computing time.


Summary
In regression analysis with high-dimensional data, variable selection is an important step to (i) overcome computational problems, (ii) improve prediction performance by variance reduction, and (iii) increase interpretability of the resulting models due to the smaller number of variables. However, robust methods are necessary to prevent outlying data points from distorting the results. The add-on package robustHD  for the statistical computing environment R (R Core Team, 2021) provides functionality for robust linear regression and model selection with high-dimensional data. More specifically, the implemented functionality includes robust least angle regression (Khan et al., 2007), robust groupwise least angle regression (Alfons et al., 2016), as well as sparse least trimmed squares regression (Alfons et al., 2013). The latter can be seen as a trimmed version of the popular lasso regression estimator (Tibshirani, 1996). Selecting the optimal model can be done via cross-validation or an information criterion, and various plots are available to illustrate model selection and to evaluate the final model estimates. Furthermore, the package includes functionality for pre-processing such as robust standardization and winsorization. Finally, robustHD follows a clear object-oriented design and takes advantage of C++ code and parallel computing to reduce computing time.

Statement of need
While the authors of Khan et al. (2007) did provide an R script for robust least angle regression (although the link in their paper appears to be broken), the implementation in robustHD utilizes C++ code for computational efficiency and provides the convenience of an R package. Moreover, code for robust groupwise least angle regression and sparse least trimmed squares regression is not available elsewhere. Package robustHD therefore provides researchers with access to several popular methods for robust regression and variable selection with high-dimensional data. It has been used in many benchmarking studies in the statistical literature (e.g., Chang et al., 2018;Cohen Freue et al., 2019;Kurnaz et al., 2018b;Smucler & Yohai, 2017), as well as in empirical research (e.g., Antczak-Orlewska et al., 2021;Stadlbauer et al., 2020).

Example: Robust groupwise least angle regression
Robust least angle regression (Khan et al., 2007) and robust groupwise least angle regression (Alfons et al., 2016) follow a hybrid model selection strategy: first obtain a sequence of important candidate predictors, then fit submodels along that sequence via robust regressions. Here, data on cars featured in the popular television show Top Gear are used to illustrate this functionality.
The response variable is fuel consumption in miles per gallon (MPG), with all remaining variables used as candidate predictors. Information on the car model is first removed from the data set, and the car price is log-transformed. In addition, only observations with complete information are used in this illustrative example.
# load package and data library("robustHD") data("TopGear") # keep complete observations and remove information on car model keep <-complete.cases(TopGear) TopGear <-TopGear[keep, -(1:3)] # log-transform price TopGear$Price <-log(TopGear$Price) As the Top Gear data set contains several categorical variables, robust groupwise least angle regression is used. Through the formula interface, function rgrplars() by default takes each categorical variable (factor) as a group of dummy variables while all remaining variables are taken individually. However, the group assignment can be defined by the user through argument assign. The maximum number of candidate predictor groups to be sequenced is determined by argument sMax. Furthermore, with crit = "BIC", the optimal submodel along the sequence is selected via the Bayesian information criterion (BIC). Note that each submodel along the sequence is fitted using a robust regression estimator with a non-deterministic algorithm, hence the seed of the random number generator is supplied for reproducibility.

Example: Sparse least trimmed squares regression
The well-known NCI-60 cancer cell panel (Reinhold et al., 2012) is used to illustrate the functionality for sparse least trimmed squares regression. The protein expressions for a specific protein are selected as the response variable, and the gene expressions of the 100 genes that have the highest (robustly estimated) correlations with the response variable are screened as candidate predictors.
# load package and data library("robustHD") data("nci60") # contains matrices 'protein' and 'gene' # define response variable y <-protein[, 92] # screen most correlated predictor variables correlations <-apply(gene, 2, corHuber, y) keep <-partialOrder(abs(correlations), 100, decreasing = TRUE) X <-gene [, keep] Sparse least trimmed squares is a regularized estimator of the linear regression model, whose results depend on a non-negative regularization parameter (see Alfons et al., 2013). In general, a larger value of this regularization parameter yields more regression coefficients being set to zero, which can be seen as a form of variable selection.
For convenience, sparseLTS() can internally estimate the smallest value of the regularization parameter that sets all coefficients to zero. With mode = "fraction", the values supplied via the argument lambda are then taken as fractions of this estimated value (i.e., they are multiplied with the internally estimated value). In this example, the optimal value of the the regularization parameter is selected by estimating the prediction error (crit = "PE") via 5-fold cross-validation with one replication (splits = foldControl(K = 5, R = 1)). The default prediction loss function is the root trimmed mean squared prediction error. Finally, the seed of the random number generator is supplied for reproducibility.

Related software
Package enetLTS (Kurnaz et al., 2018a) provides robust elastic-net-regularized estimators based on trimming for the linear and logistic regression models. Package pense (Kepplinger et al., 2021) contains implementations of robust S-and MM-type estimators with elastic net regularization for linear regression.