PyUoI: The Union of Intersections Framework in Python

1 Redwood Center for Theoretical Neuroscience, University of California, Berkeley, Berkeley, California, USA 2 Department of Physics, University of California, Berkeley, Berkeley, California, USA 3 Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA 4 Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA 5 Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, California, USA DOI: 10.21105/joss.01799


Summary
The increasing size and complexity of scientific data requires statistical analysis methods that scale and produce models that are both interpretable and predictive.Interpretability implies one can interpret the output of the model in terms of processes generating the data (Murdoch, Singh, Kumbier, Abbasi-Asl, & Yu, 2019).This typically requires identification of a small number of features in the actual data and accurate estimation of their contributions (Bickel et al., 2006).Meanwhile, achieving predictive power requires optimizing the performance of some statistical measure such as precision, mean squared error, etc. Across inference procedures, there is often a trade-off between interpretability and predictive power.The impact of this trade-off is particularly acute for scientific applications, where the output of the model is used to provide insight into the underlying physical processes that generated the data.
We recently introduced Union of Intersections (UoI), a flexible, modular, and scalable framework designed to enhance both the identification of features (model selection) as well as the estimation of the contributions of these features (model estimation) (Bouchard et al., 2017).UoI-based methods leverage stochastic data resampling and a range of sparsity-inducing regularization parameters to build families of potential feature sets robust to perturbations of the data, and then average nearly unbiased parameter estimates of selected features to maximize predictive accuracy.Models inferred through the UoI framework are characterized by their usage of fewer parameters with little or no loss in predictive accuracy, and reduced bias relative to benchmark approaches.
PyUoI is a Python package containing implementations of a variety of UoI-based algorithms, encompassing regression, classification, and dimensionality reduction.In order to better facilitate its usage, PyUoI's API is structured similarly to the scikit-learn package, which is a commonly used Python machine learning library (Buitinck et al., 2013;Pedregosa et al., 2011).
The UoI framework operates by fitting many models across resamples of the dataset and across a set of regularization parameters.Since these fits can be performed in parallel, the UoI framework is naturally scalable.PyUoI is equipped with mpi4py functionality to parallelize model fitting on large datasets (Dalcıń, Paz, & Storti, 2005).

Background
The Union of Intersections is not a single method or algorithm, but a flexible statistical framework into which other algorithms can be inserted.In this section, we briefly describe UoI Lasso , the UoI implementation of lasso penalized regression.UoI Lasso is similar in structure to the UoI versions of other generalized linear models (logistic and poisson).We refer the user to existing literature on the UoI variants of column subset selection and non-negative matrix factorization (Bouchard et al., 2017;Ubaru, Wu, & Bouchard, 2017).
Linear regression consists of estimating parameters β ∈ R p×1 that map a p-dimensional vector of features x ∈ R p×1 to the observation variable y ∈ R, when the N samples are corrupted by i.i.d Gaussian noise: where ϵ ∼ N (0, σ 2 ) for each sample.When the true β is thought to be sparse (i.e., some subset of the β are exactly zero), an estimate of β can be found by solving a constrained optimization problem of the form where |β| 1 is the ℓ 1 -norm of the parameters and i indexes data samples.The ℓ 1 -norm is a convenient penalty because it will tend to force parameters to be set exactly equal to zero, performing feature selection (Tibshirani, 1994).Typically, λ, the degree to which feature sparsity is enforced, is unknown and must be determined through cross-validation or a penalized score function across a set of hyperparameters {λ j } k j=1 .The key mathematical idea underlying UoI is to perform model selection through intersection (compressive) operations and model estimation through union (expansive) operations, in that order.This separation of parameter selection and estimation provides selection profiles that are more robust and parameter estimates that have less bias.This can be contrasted with a typical Lasso fit wherein parameter selection and estimation are performed simultaneously.The Lasso procedure can lead to selection profiles that are not robust to data resampling and estimates that are biased by the penalty on β.For UoI Lasso , the procedure is as follows (see Algorithm 1 for a more detailed pseudocode): • Model Selection: For each λ j in the Lasso path, generate estimates on N S resamples of the data (Line 2).The support S j (i.e., the set of non-zero parameters) for λ j consists of the features that persist in all model fits across the resamples (i.e., through an intersection) (Line 7).• Model Estimation: For each support S j , perform Ordinary Least Squares (OLS) on N E resamples of the data.The final model is obtained by averaging (i.e., taking the union) across the supports chosen according to some model selection criteria for each resample (Lines 15-16).The model selection criteria can be prediction quality on held-out data or penalized likelihood methods (e.g., AIC or BIC).
Thus, the selection module ensures that, for each λ j , only features that are stable to perturbations in the data (resamples) are allowed in the support S j .This provides a family of resample-stable model supports with varying levels of sparsity due to λ j that can be used in estimation.Then, the estimation module ensures that the most predictive supports per resample are averaged together in the final model.The estimation module uses OLS rather than Lasso to provide parameter estimates with low bias.The degree of feature compression via intersections (quantified by N S ) and the degree of feature expansion via unions (quantified by N E ) can be balanced to maximize prediction accuracy for the response variable y.
Algorithm 1 UoI Lasso Input: X ∈ R N ×p design matrix y ∈ R N ×1 response variable Regularization strengths {λ j } q j=1 Number of resamples N S and N E Loss function L(β; X, y) Model Selection 1: for k = 1 to N S do 2: Generate resample X k , y k 3: for j = 1 to q do 4:

βjk
← Lasso regression (penalty λ j ) of y k on X k 5: S k j ← {i} where βjk i ̸ = 0 6: for j = 1 to q do 7: and evaluation for j = 1 to q do 12: 14:

Features
PyUoI is split up into two modules, with the following UoI algorithms: • linear_model (generalized linear models) -Lasso penalized linear regression UoI Lasso .
The generalized linear models we have implemented include the most commonly used models in a variety of scientific disciplines, particularly in the fields of neuroscience and genomics.
Extensions to other generalized linear models (e.g., negative binomial regression, gamma regression, etc.) are left as future work.However, given the inheritance structure of the PyUoI framework, these extensions should be straightforward for the interested user.
Similar to scikit-learn, each UoI algorithm has its own Python class.Instantiations of these classes are created with specific hyperparameters and are fit to user-provided datasets.
The hyperparameters allow the user to fine-tune the number of resamples, fraction of data in each resample, and the model selection criteria used in the estimation module (in Algorithm 1, test set accuracy is used, but the Akaike and Bayesian Information Criteria are also available (Akaike, 1998;Schwarz, 1978)).
Additionally, UoI is agnostic to the specific solver used for a given model.That is, the UoI framework operates on fits obtained from performing the optimization for a specified model (such as the lasso optimization problem for linear regression).In the case of PyUoI, the generalized linear models come equipped with a coordinate descent solver (from scikit-le arn), a built-in Orthant-Wise Limited memory Quasi-Newton solver (Gong & Ye, 2015), and the pycasso solver (Ge, 2019).The choice of solver is left to the user as a hyperparameter.
If a different solver is desired, PyUoI could be extended by the user to utilize this solver in a straightforward manner.

Applications
We have used PyUoI largely in the realm of neuroscience and genomics (Bouchard et al., 2017;Ubaru et al., 2017).A few applications include: • Interpretable functional connectivity networks from neural populations in the visual, auditory, and motor cortices of various animal models; • Sparse decoding of behavioral activity from spiking neural activity; • Parts-based decomposition of electrocorticography recordings in rat auditory cortex that reflect functional cortical organization; • Extraction of characteristic single nucleotide polymorphisms for the prediction of phenotypes in mice.
However, the algorithms implemented in PyUoI are broadly applicable to problems where enforcement of sparsity at minimal cost to bias are desired.