cvCovEst : Cross-validated covariance matrix estimator selection and evaluation in R

Covariance matrices play fundamental roles in myriad statistical procedures. When the observations in a dataset far outnumber the features, asymptotic theory and empirical evidence have demonstrated the sample covariance matrix to be the optimal estimator of this parameter. This assertion does not hold when the number of observations is commensurate with or smaller than the number of features. Consequently, statisticians have derived many novel covariance matrix estimators for the high-dimensional regime, often relying on additional assumptions about the parameter’s structural characteristics (e.g., sparsity). While these estimators have greatly improved the ability to estimate covariance matrices in high-dimensional settings, objectively selecting the best estimator from among the many possible candidates remains a largely unaddressed challenge. The cvCovEst package addresses this methodological gap through its implementation of a cross-validated framework for covariance matrix estimator selection. This data-adaptive procedure’s selections are asymptotically optimal under minimal assumptions – in fact, they are equivalent to the selections that would be made if given full knowledge of the true data-generating processes (i.e., an oracle selector) (van der Laan & Dudoit, 2003).


Summary
Covariance matrices play fundamental roles in myriad statistical procedures. When the observations in a dataset far outnumber the features, asymptotic theory and empirical evidence have demonstrated the sample covariance matrix to be the optimal estimator of this parameter. This assertion does not hold when the number of observations is commensurate with or smaller than the number of features. Consequently, statisticians have derived many novel covariance matrix estimators for the high-dimensional regime, often relying on additional assumptions about the parameter's structural characteristics (e.g., sparsity). While these estimators have greatly improved the ability to estimate covariance matrices in high-dimensional settings, objectively selecting the best estimator from among the many possible candidates remains a largely unaddressed challenge. The cvCovEst package addresses this methodological gap through its implementation of a cross-validated framework for covariance matrix estimator selection. This data-adaptive procedure's selections are asymptotically optimal under minimal assumptions -in fact, they are equivalent to the selections that would be made if given full knowledge of the true data-generating processes (i.e., an oracle selector) (van der Laan & Dudoit, 2003).

Statement of Need
When the number of observations in a dataset far exceeds the number of features, the estimator of choice for the covariance matrix is the sample covariance matrix. It is efficient under minimal regularity assumptions on the data-generating distribution. In high-dimensional regimes, however, its performance is unsatisfactory: the sample covariance matrix is highly variable, and produces estimates with diverging condition numbers and over-dispersed eigenvalues (Johnstone, 2001). Analyses employing this demonstrably poor estimator may be negatively impacted.
As high-dimensional data have become widespread, researchers have derived many novel covariance matrix estimators to remediate the sample covariance matrix's shortcomings. These estimators come in many flavors, though most are constructed by regularizing the sample covariance matrix. Comprehensive reviews are provided by Fan et al. (2016) and Pourahmadi (2013), and these estimators are implemented across a diversity of R packages: CovTools (Lee & You, 2019), CVTuningCov (Wang, 2014), and nlshrink (Ramprasad, 2016) to name but a few.
This variety brings with it many challenges. Identifying an "optimal" estimator from among a collection of candidates can prove a daunting task, one whose objectivity is often compromised by the data analyst's decisions. Though data-driven approaches for selecting an optimal estimator from among estimators belonging to certain (limited) classes have been derived, the question of selecting from a diverse collection of candidate procedures remains unaddressed.

cvCovEst Framework
The solution provided by cvCovEst is a general, cross-validation-based, estimator-agnostic framework for covariance matrix estimator selection. The asymptotic optimality of selections are guaranteed under a few non-restrictive assumptions by extending the seminal work of van der Laan & Dudoit (2003), Dudoit & van der Laan (2005), and van der Vaart et al. (2006) on data-adaptive estimator selection to high-dimensional covariance matrix estimation . Here, optimality is defined as choosing an estimator with an equivalent risk difference to that which would have been selected were the underlying data-generating distribution completely known.
The cvCovEst software package implements this framework for the R language and environment for statistical computing (R Core Team, 2021). Included is a collection of covariance matrix estimators spanning the work of many researchers (Table 1). They may be employed independently of the cross-validation procedure. cvCovEst also provides a variety of plotting and summary functions. These diagnostic tools allow users to gauge the algorithm's performance, diagnose issues that might arise during estimation procedures, and build intuition about the many estimators' behaviors. Additionally, users have options to increase the crossvalidation method's computational efficiency via parallel computation. Parallelization relies on the suite of future packages (Bengtsson, 2020) by way of the origami package (Coyle & Hejazi, 2018).  (Bickel & Levina, 2008b) thresholdingEst() Applies a hard thresholding operator to the entries of the sample covariance matrix. SCAD thresholding (Fan & Li, 2001;Rothman et al., 2009) scadEst() Applies the SCAD thresholding operator to the entries of the sample covariance matrix. Adaptive LASSO (Rothman et al., 2009) adaptiveLassoEst() Applies the adaptive LASSO thresholding operator to the entries of the sample covariance matrix. Banding (Bickel & Levina, 2008a) bandingEst() Replaces the sample covariance matrix's off-diagonal bands by zeros. Tapering (Cai et al., 2010) taperingEst() Tapers the sample covariance matrix's off-diagonal bands, eventually replacing them by zeros. Optimal Linear Shrinkage (Ledoit & Wolf, 2004) linearShrinkLWEst() Asymptotically optimal shrinkage of the sample covariance matrix towards the identity. Linear Shrinkage (Ledoit & Wolf, 2004) linearShrinkEst() Shrinkage of the sample covariance matrix towards the identity, but the shrinkage is controlled by a hyperparameter.

Estimator Implementation Description
Dense Linear Shrinkage (Schäfer & Strimmer, 2005) denseLinearShrinkEst( ) Asymptotically optimal shrinkage of the sample covariance matrix towards a dense matrix whose diagonal elements are the mean of the sample covariance matrix's diagonal and whose off-diagonal elements are the mean of the sample covariance matrix's off-diagonal elements. Nonlinear Shrinkage (Ledoit & Wolf, 2020) nlShrinkLWEst() Analytical estimator for the nonlinear shrinkage of the sample covariance matrix. POET (Fan et al., 2013) poetEst() An estimator based on latent variable estimation and thresholding. Robust POET (Fan et al., 2018) robustPoetEst() A robust (and more computationally taxing) take on the POET estimator.

Examples
We briefly showcase cvCovEst's functionality through a toy example and an application to single-cell transcriptomic data.

Toy Dataset Example
Multivariate normal data are simulated using a covariance matrix with a Toeplitz structure and then fed to the cvCovEst function. A summary of the cross-validated estimation procedure is provided via the plot method.

Single Cell Transcriptomic Data
Single-cell transcriptome sequencing (scRNA-seq) measures the gene expression profiles of individual cells within a given population, permitting the identification of rare cell types and the study of developmental trajectories. The datasets resulting from these experiments are typically high-dimensional: expression data for hundreds or thousands of cells are collected for tens of thousands of genes. A critical step in most analytic workflows is therefore that of dimension reduction. In addition to facilitating visualization, this reduction is thought to have a denoising effect. That is, the effects of uninteresting biological variation are typically mitigated in these lower-dimensional embeddings.
A popular method for the dimensionality reduction of scRNA-seq is uniform manifold approximation and projection (UMAP) (McInnes et al., 2018), capable of capturing non-linear relationships between features, applied to the dataset's leading principal components. Since these principal components (PCs) are derived from the sample covariance matrix, however, they are likely to be poor estimates of the true PCs when the number of genes exceeds the number of cells. Instead, the cvCovEst estimate could be used to compute the initial dimensionality reduction.
Indeed, we find that the two-dimensional UMAP embedding resulting from the cvCovEstbased approach improves upon that of the standard PCA-based approach when applied to a dataset of 285 mouse visual cortex's cells' 1,000 most variable genes (Tasic et al., 2016). Fewer rare cells are misclustered, engendering a 47% improvement in average silhouette width. For further discussion, see Boileau et al. (2021).

Availability
A stable release of the cvCovEst package is freely-available via the Comprehensive R Archive Network. Its development version can be found on GitHub. Documentation and examples are contained in each version's manual pages, vignette, and pkgdown (Wickham & Hesselberth, 2020) website at https://philboileau.github.io/cvCovEst.