scPCA: A toolbox for sparse contrastive principal component analysis in R

Data pre-processing and exploratory data analysis are crucial steps in the data science lifecycle, often relying on dimensionality reduction techniques to extract pertinent signal. As the collection of large and complex datasets becomes the norm, the need for methods that can successfully glean pertinent information from among increasingly intricate technical artifacts is greater than ever. What’s more, many of the most historically reliable and commonly used methods have demonstrably poor performance, or even fail outright, in reducing the dimensionality of large and noisy datasets in a stable, interpretable, and relevant manner.


Summary
Data pre-processing and exploratory data analysis are crucial steps in the data science lifecycle, often relying on dimensionality reduction techniques to extract pertinent signal. As the collection of large and complex datasets becomes the norm, the need for methods that can successfully glean pertinent information from among increasingly intricate technical artifacts is greater than ever. What's more, many of the most historically reliable and commonly used methods have demonstrably poor performance, or even fail outright, in reducing the dimensionality of large and noisy datasets in a stable, interpretable, and relevant manner.
Principal component analysis (PCA) is one such method. Although popular for its interpretable results and ease of implementation, PCA's performance on high-dimensional data often leaves much to be desired. Its performance has been characterized as unstable in such settings (Johnstone & Lu, 2009), and it has been shown to often emphasize unwanted variation (e.g., batch effects) in lieu of the signal of interest.
Consequently, modifications of PCA have been developed to remedy these issues. Namely, sparse PCA (SPCA) (Zou, Hastie, & Tibshirani, 2006) was created to increase the stability and interpretability of the principal component loadings in high dimensions, while constrastive PCA (cPCA) (Abid, Zhang, Bagaria, & Zou, 2018) leverages control data to adjust for unwanted effects and capture relevant information.
Although SPCA and cPCA have proven useful in resolving individual shortcomings of PCA, neither is capable of tackling the issues of stability and relevance simultaneously. The scPCA R package implements sparse constrastive PCA (scPCA) (Boileau, Hejazi, & Dudoit, 2019), a combination of these methods, drawing on cPCA to remove unwanted effects and on SPCA to sparsify the principal component loadings. In both simulation studies and data analysis, Boileau et al. (2019) provided practical demonstrations of scPCA's ability to extract stable, interpretable, and uncontaminated signal from high-dimensional biological data. Indeed, scPCA was found to produce more informative and interpretable embeddings than linear (e.g. PCA, cPCA) and non-linear dimensionality reduction methods (e.g. UMAP (McInnes, Healy, & Melville, 2018), t-SNE (van der Maaten & Hinton, 2008)) commonly used to explore high-dimensional biological data. Such demonstrations included the re-analysis of several publicly available protein expression, microarray gene expression, and single-cell transcriptome sequencing datasets.
As the scPCA software package was specially designed for use in disentangling biological signal from technical noise in high-throughput sequencing data, a free and open-source software implementation has been made available via the Bioconductor Project (Gentleman, Carey, Huber, Irizarry, & Dudoit, 2006;Gentleman et al., 2004;Huber et al., 2015) for the R language and environment for statistical computing (R Core Team, 2020). The scPCA package also implements cPCA, previously unavailable in the R language, in two flavors: (1) the semiautomated version of Abid et al. (2018) and (2) the automated version formulated by Boileau et al. (2019). In order to interface seamlessly with data structures common in computational biology, the scPCA package integrates fully with the SingleCellExperiment container class (Lun & Risso, 2019), using the class to store the cPCA and scPCA representations generated via the reducedDims accessor method. Finally, to facilitate parallel computation, the scPCA package contains parallelized versions of each of its core subroutines, making use of the infrastructure provided by the BiocParallel package. In order to effectively use parallelization, one need only set parallel = TRUE in a call to the scPCA package, after having registered a particular parallelization backend, as per the BiocParallel documentation.