LinRegOutliers: A Julia package for detecting outliers in linear regression

LinRegOutliers is a Julia package that implements a number of outlier detection algorithms for linear regression. The package also implements robust covariance matrix estimation and graphing functions which can be used to visualize the regression residuals and distances between observations, with many possible metrics (e.g., the Euclidean or Mahalanobis distances with either given or estimated covariance matrices). Our package implements many algorithms and diagnostics for model fitting with outliers under a single interface, which allows users to quickly try many different methods with reasonable default settings, while also providing a good starting framework for researchers who may want to extend the package with novel methods.


Summary
LinRegOutliers is a Julia package that implements a number of outlier detection algorithms for linear regression. The package also implements robust covariance matrix estimation and graphing functions which can be used to visualize the regression residuals and distances between observations, with many possible metrics (e.g., the Euclidean or Mahalanobis distances with either given or estimated covariance matrices). Our package implements many algorithms and diagnostics for model fitting with outliers under a single interface, which allows users to quickly try many different methods with reasonable default settings, while also providing a good starting framework for researchers who may want to extend the package with novel methods.

State of the field
In linear regression, we are given a number of data points (say, n) where each data point is represented by a vector x i , with p entries, and a dependent variable that corresponds to each of these data points, represented by the scalar y i , for i = 1, 2, . . . , n. We then seek to find a linear model which best describes the data (up to some error term, ϵ i ): where β 1 , . . . , β p are the p unknown parameters. We will assume that the ϵ i are independent and identically-distributed (i.i.d.) error terms with zero mean. Note that, if (x i ) 1 = 1 for all i = 1, . . . , n, this is equivalent to having an intercept term given by β 1 .
We can write this more conveniently by letting X be the design matrix of size n × p, whose ith row is given by the vectors x i (where (x i ) 1 = 1 if the model has an intercept), while y is an n-vector of observations, whose entries are y i , and similarly for ϵ: The usual approach to finding an estimate for β, which we callβ, is the Ordinary Least Squares (OLS) estimator given byβ = (X T X) −1 X T y, which is efficient and has good statistical properties when the error terms are all of roughly the same magnitude (i.e., there are no outliers). On the other hand, the OLS estimator is very sensitive to outliers: even if a single observation lies far from the regression hyperplane, OLS will often fail to find a good estimate for the parameters, β.
To solve this problem, a number of methods have been developed in the literature. These methods can be roughly placed in one or more of the five following categories: diagnostics, direct methods, robust methods, multivariate methods, and visual methods. Diagnostics are methods which attempt to find points that significantly affect the fit of a model (often, such points can be labeled as outliers). Diagnostics can then be used to initialize direct methods, which fit a (usually non-robust) model to a subset of points suspected to be clear of outliers; remaining points which are not outliers with respect to this fit are continually added to this subset until all points not in the subset are deemed outliers. Robust methods, on the other hand, find a best-fit model by approximately minimizing a loss function that is resistant to outliers. Some of the proposed methods are also multivariate methods, which can accommodate obtaining robust location and scale measures of multivariate data. Visual methods generally work on the principle of visualizing the statistics obtained from these mentioned methods.
As an example, the method mveltsplot constructs a 2-D plot using robust distances and scaled residuals obtained from mve and lts which are multivariate data and robust regression methods, respectively. Many direct and robust methods for regression select an initial basic or clean subset of observations using the results of diagnostics and methods for multivariate data. This is why methods that are not directly related to regression are included in the package.

Statement of need
In practice, many of the proposed methods have reasonable performance and yield similar results for most datasets, but sometimes differ widely in specific circumstances by means of masking and swamping ratios. Additionally, some of the methods are relatively complicated and, if canonical implementations are available, they are often out of date or only found in specific languages of the author's choice, making it difficult for researchers to compare the performance of these algorithms on their datasets.
We have reimplemented many of the algorithms available in the literature in Julia (Bezanson et al., 2017), an open-source, high performance programming language designed primarily for scientific computing. Our package, LinRegOutliers, is a comprehensive and simple-to-use Julia package that includes many of the algorithms in the literature for detecting outliers in linear regression. The implemented Julia methods for diagnostics, direct methods, robust methods, multivariate methods, and visual diagnostics are shown in Table 1, Table 2, Table  3, Table 4, and Table 5, respectively.  (Belsley et al., 2005) covratio DFBETA (Belsley et al., 2005) dfbeta DFFIT (Belsley et al., 2005) dffit Mahalanobis Distances (Mahalanobis, 1930) mahalanobisSquaredMatrix Cook Distances (Cook, 1977) cooks