Groupyr: Sparse Group Lasso in Python

Summary For high-dimensional supervised learning, it is often beneficial to use domain-specific knowledge to improve the performance of statistical learning models. When the problem contains covariates which form groups, researchers can include this grouping information to find parsimonious representations of the relationship between covariates and targets. These groups may arise artificially, as from the polynomial expansion of a smaller feature space, or naturally, as from the anatomical grouping of different brain regions or the geographical grouping of different cities. When the number of features is large compared to the number of observations, one seeks a subset of the features which is sparse at both the group and global level.

where G is the total number of groups, X (ℓ) is the submatrix of X with columns belonging to group ℓ, β (ℓ) is the coefficient vector of group ℓ, and p ℓ is the length of β (ℓ) . The model hyperparameter λ controls the combination of the group-lasso and the lasso, with λ = 0 giving the group lasso fit and λ = 1 yielding the lasso fit. The hyperparameter α controls the overall strength of the regularization.

Statement of need
Groupyr is a Python library that implements the sparse group lasso as scikit-learn (Buitinck et al., 2013;Pedregosa et al., 2011) compatible estimators. It satisfies the need for grouped penalized regression models that can be used interoperably in researcher's real-world scikit-learn workflows. Some pre-existing Python libraries come close to satisfying this need. Lightning (Blondel & Pedregosa, 2016) is a Python library for large-scale linear classification and regression. It supports many solvers with a combination of the L1 and L2 penalties. However, it does not allow the user to specify groups of covariates (see, for example, this GitHub issue). Another Python package, group_lasso (Moe, 2020), is a well-designed and well-documented implementation of the sparse group lasso. It meets the basic API requirements of scikit-learn compatible estimators. However, we found that our implementation in groupyr, which relies on the copt optimization library (Fabian Pedregosa, 2020), was between two and ten times faster for the problem sizes that we encounter in our research (see the repository's examples directory for a performance comparison). Additionally, we needed estimators with built-in cross-validation support using both grid search and sequential model based optimization strategies. For example, the speed and cross-validation enhancements were crucial to using groupyr in AFQ-Insight, a neuroinformatics research library (Richie-Halford et al., 2019).

Usage
Groupyr is available on the Python Package Index (PyPI) and can be installed with pip install groupyr Groupyr is compatible with the scikit-learn API and its estimators offer the same instantiate, fit, predict workflow that will be familiar to scikit-learn users. See the online documentation for a detailed description of the API and examples in both classification and regression settings. Here, we describe only the key differences necessary for scikit-learn users to get started with groupyr.
LogisticRe gressionCV estimators, users must specify the group assignments for the columns of the feature matrix X. This is done during estimator instantiation using the groups parameter, which accepts a list of numpy arrays, where the i-th array specifies the feature indices of the i-th group. If no grouping information is provided, the default behavior assigns all features to one group.
Groupyr also offers cross-validation estimators that automatically select the best values of the hyperparameters α and λ using either an exhaustive grid search (with tuning_strateg y="grid") or sequential model based optimization (SMBO) using the scikit-optimize library (with tuning_strategy="bayes"). For the grid search strategy, our implementation is more efficient than using the base estimator with scikit-learn's GridSearchCV because it makes use of warm-starting, where the model is fit along a predefined regularization path and the solution from the previous fit is used as the initial guess for the current hyperparameter value. The randomness associated with SMBO complicates the use of a warm start strategy; it can be difficult to determine which of the previously attempted hyperparameter combinations should provide the initial guess for the current evaluation. However, even without warm-starting, we find that the SMBO strategy usually outperforms grid search because far fewer evaluations are needed to arrive at the optimal hyperparameters. We provide examples of both strategies (grid search for a classification example and SMBO for a regression example) in the online documentation.