wbacon: Weighted BACON algorithms for multivariate outlier nomination (detection) and robust linear regression

Outlier nomination (detection) and robust regression are computationally hard problems. This is all the more true when the number of variables and observations grow rapidly. Among all candidate methods, the two BACON (blocked adaptive computationally efficient outlier nominators) algorithms of Billor et al. (2000) have favorable computational characteristics as they require only a few model evaluations irrespective of the sample size. This makes them popular algorithms for multivariate outlier nomination/detection and robust linear regression (at the time of writing Google Scholar reports more than 500 citations of the Billor et al. (2000) paper).


The BACON algorithms
Technically, the BACON algorithms consist of the application of series of simple statistical estimation methods such as coordinate-wise means/medians, covariance matrix, Mahalanobis distances, or least squares regression on subsets of the data. The algorithms start from an initial small subset of non-outlying ("good") data and keep adding those observations to the subset whose distances (or discrepancies in the case of the regression algorithm) are smaller than a predefined threshold value. The algorithms terminate if the subset cannot be increased further. The observations not in the final subset are nominated as outliers. We follow Billor et al. (2000) and use the term "nomination" of outliers instead of "detection" to emphasize that the algorithms should not go beyond nominating observations as potential outliers. It is left to the analyst to finally label outlying observations as such.
The BACON algorithm for multivariate outlier nomination can be initialized in two ways: version "V1" or "V2" (see Billor et al., 2000). In version V2, the algorithm is started from the coordinate-wise median. As a consequence, the resulting estimators of location and scatter are robust, i.e., the breakdown point 1 is approximately 40% (Billor et al., 2000), but the estimators are not affine equivariant estimators of the population location and scatter. However, Billor et al. (2000) show that the estimators are nearly affine equivariant.
The initialization by version V1 yields estimators that are affine equivariant by design because the algorithm is started from the coordinate-wise mean, but the estimators have a very low breakdown point.
A naive implementation of the BACON algorithms would call the (simple) estimation methods iteratively on a sequence of growing subsets of the data without bothering too much with reusing or updating existing blocks of data. This leads to an excessively large number of copy/modify operations and (unnecessary) recomputations. Overall, we would end up with a computationally inefficient implementation. For small data sets, the inefficiencies would likely go unnoticed. With large amounts of data, however, the situation is quite different.

Statement of need
The two BACON algorithms are available from the package robustX (Maechler, Stahel, et al., 2021) for the R statistical software. The BACON algorithm for multivariate outlier nomination has been extended to weighted problems (in the context of survey sampling) and incomplete data by Béguin & Hulliger (2008). The extended method is available from the R package modi (Hulliger & Sterchi, 2020). Both implementations are not explicitly targeted at large data sets. Our package fills this gap.

What the package offers
The implementation of the wbacon package is targeted at medium to large data sets and is mainly implemented in the C language. The code depends heavily on the subroutines in the libraries BLAS (Blackford et al., 2002) and LAPACK (Anderson et al., 1999). If computation time is of great importance, we recommend replacing the reference implementation of the BLAS library that ships with R by a version that has been adapted to the user's hardware (see e.g., OpenBLAS). The non-BLAS components of wbacon use multiple threads (if the compiler supports OpenMP; see OpenMP Architecture Review Board (2018)) for the computations over the p variables/columns because the computational time complexity is dominated by p. For instance, the time complexity of the BACON algorithm for multivariate outlier nomination is dominated by the complexity of the covariance matrix computation, which is of order O(np 2 ), where n denotes the number of observations. The major improvements of wbacon (in terms of computation time) over the naive implementation, however, are achieved by using partial sorting (in place of a full sort), reusing computed estimates, and employing an up-/downdating scheme for the Cholesky and QR factorizations. The computational costs of the up-/downdating schemes are far less than recomputing the entire decomposition repeatedly.

Diagnostic tools
The BACON algorithms assume that the underlying model is an appropriate description of the non-outlying observations. More precisely (Billor et al., 2000), • the outlier nomination method assumes that the "good" data have (roughly) an elliptically contoured distribution (this includes the Gaussian distribution as a special case); • the regression method assumes that the non-outlying ("good") data are described by a linear (homoscedastic) regression model and that the independent variables (having removed the regression intercept/constant, if there is a constant) follow (roughly) an elliptically contoured distribution.
We recommend that the users examine the data structure of the "good" observations to verify that the assumptions hold. The following quote from the authors of the BACON algorithms should be noted.
"Although the algorithms will often do something reasonable even when these assumptions are violated, it is hard to say what the results mean." (Billor et al., 2000, p. 289) The wbacon package provides the analyst with tools to identify potentially outlying observations. For multivariate outlier nomination, the package implements several diagnostic plots. Worth mentioning is the graph which plots the robust (Mahalanobis) distances against the univariate projection of the data that maximizes the separation criterion of Qiu & Joe (2006). This kind of diagnostic graph attempts to separate outlying from non-outlying observations as much as possible; see Willems et al. (2009). It is particularly helpful when the outliers are clustered or show patterns. For robust linear regression, the package offers the standard plotting methods that are available for objects of the class lm. In addition, it implements the plot of the robust distances of the (non-constant) design variables against the standardized residuals. This diagnostic plot been proposed by Rousseeuw & Zomeren (1990). All plotting methods can be displayed as hexagonally binned scatter plots, using the functionality of the hexbin (Carr et al., 2021) package. This option is recommended for large data sets.

Illustration
In this section, we illustrate the use of the BACON algorithm for robust linear regression. Our data are on education expenditures in 50 US states in 1975 (Chatterjee & Hadi, 2006, Chap. 5.7). The data can be loaded from the robustbase (Maechler, Rousseeuw, et al., 2021) package.
To ensure a high breakdown point, version = "V2" should not be changed to "V1" unless you have good reasons to do so. The main "turning knob" to tune the regression algorithm is alpha, which defines the (1 − α) quantile t α,ν of the Student t-distribution with ν degrees of freedom. In fact, the quantile t α/(2r+2),r−p is used as the cutoff value (see Billor et al., 2000), where r and p denote, respectively, the number observations in the set of "good" observations and the number of variables. All observations whose discrepancies (defined as the scaled residuals on the set of "good" observations and the scaled prediction error on the set of "bad" observations) are smaller (in absolute value) than the cutoff value are selected into the subset of "good" data [see document methods.pdf in the folder inst/doc of the source package]. By choosing larger values for alpha (e.g., 0.2), more observations are selected (ceteris paribus) into the subset of "good" data (and vice versa).
The parameter collect specifies the size of the initial subset, which is defined as m = p · collect. It should be chosen such that m is considerably smaller than the number of observations n. Otherwise we are at risk of selecting too many "bad" observations into the initial subset, which will eventually bias the regression estimates.
The instance reg is an object of the class wbaconlm. The printed output of wBACON_reg() is identical with the one of the stats::lm() function. In addition, we are told the size of the subset on which the final regression has been computed. The observations not in the subset are considered potential outliers (here 1 out of 50 observations). The summary() method can be used to summarize the estimated model. The methods coef(), vcov(), and predict() work exactly the same as their lm() counterparts. This is also true for the first three plot() types, that is • which = 1: Residuals vs Fitted, • which = 2: Normal Q-Q, • which = 3: Scale-Location.
The plot types 4:6 of plot.lm() are not implemented for objects of the class wbaconlm because it is not sensible to study the standard regression influence diagnostics in the presence of outliers in the model's design space (leverage observations). Instead, type four (which = 4) plots the robust Mahalanobis distances with respect to the non-constant design variables against the standardized residual. This plot has been proposed by Rousseeuw & Zomeren (1990). This plot method is also available in the package robustbase (Maechler, Rousseeuw, et al., 2021) for robust regression estimators of the class lmrob.
See vignette to learn more about the package.

Benchmarking
We compare our implementation with robustX::BACON() in terms of computational time. First, we consider estimating a robust linear regression for a Gaussian mixture distribution, where a proportion of 1 − ϵ of the observations on the p independent variables is generated by the Gaussian model, while a proportion of ϵ (the outliers) is generated by a shifted Gaussian distribution. For the outlying observations (i.e., ϵ proportion of the data), the response variable is generated by a regression coefficient which is 10 times larger than the coefficient of the non-outlying observations. We choose ϵ = 0.05; see Appendix for more details on the setup. The number of variables (p) and the number of observations (n) are varied.
For the regression exercise, our setup is intentionally limited to single-threaded computations (n_threads = 1; no OpenMP parallelization support). It is clear that when the number of variables is large, the parallelized computations are (usually) much faster. Table 1 shows the ratio of average computation time of the two implementation for some configurations of n and p. A ratio > 1.0 (< 1.0) implies that wBACON_reg() is faster (slower) than robustX:: BACON(). The average ratio refers to computation time averaged over repeated benchmarks using the R package microbenchmark (Mersmann et al., 2019). It is evident from the results in Table 1 that wBACON_reg() is considerably faster than its competitor, e.g., wBACON_reg() is on average 4.4 times faster for the setup p = 5 and n = 100. More importantly, the differences become larger as we increase n or p. The differences in computation time are mainly due to the fact that wBACON_reg() updates the regression estimates as the subset of non-outlying observations grows, while robustX::BACON() recomputes the estimates at each iteration. When thread-level parallelism is enabled in wBACON_reg(), the differences become even larger (for p ≥ 20 and n ≥ 10 3 ).  For very small data sets (e.g., n = 10 3 and p = 5), wBACON() is slower because parallelization leads to computation overhead that dominates computation time. Clearly, it would be more efficient to specify only 1 or 2 threads for such small data sets. However, the differences in computation time are hardly noticeable to the user (0.08 vs. 0.14 seconds). For larger data sets (in terms of number of variables and observations), wBACON() outperforms robustX:: BACON(); see Table 2. For instance, wBACON() is 13.7 times faster for the setup p = 200 and n = 10 6 . The differences in computation time between the two implementations become larger as we increase n or p.