c-lasso - a Python package for constrained sparse and robust regression and classification

We introduce c-lasso, a Python package that enables sparse and robust linear regression and classification with linear equality constraints. The underlying statistical forward model is assumed to be of the following form: \[ y = X \beta + \sigma \epsilon \qquad \textrm{subject to} \qquad C\beta=0 \] Here, $X \in \mathbb{R}^{n\times d}$is a given design matrix and the vector $y \in \mathbb{R}^{n}$ is a continuous or binary response vector. The matrix $C$ is a general constraint matrix. The vector $\beta \in \mathbb{R}^{d}$ contains the unknown coefficients and $\sigma$ an unknown scale. Prominent use cases are (sparse) log-contrast regression with compositional data $X$, requiring the constraint $1_d^T \beta = 0$ (Aitchion and Bacon-Shone 1984) and the Generalized Lasso which is a special case of the described problem (see, e.g, (James, Paulson, and Rusmevichientong 2020), Example 3). The c-lasso package provides estimators for inferring unknown coefficients and scale (i.e., perspective M-estimators (Combettes and Muller 2020a)) of the form \[ \min_{\beta \in \mathbb{R}^d, \sigma \in \mathbb{R}_{0}} f\left(X\beta - y,{\sigma} \right) + \lambda \left\lVert \beta\right\rVert_1 \qquad \textrm{subject to} \qquad C\beta = 0 \] for several convex loss functions $f(\cdot,\cdot)$. This includes the constrained Lasso, the constrained scaled Lasso, and sparse Huber M-estimators with linear equality constraints.


Statement of need
Currently, there is no Python package available that can solve these ubiquitous statistical estimation problems in a fast and efficient manner. c-lasso provides algorithmic strategies, including path and proximal splitting algorithms, to solve the underlying convex optimization problems with provable convergence guarantees. The c-lasso package is intended to fill the gap between popular Python tools such as scikit-learn which cannot solve these constrained problems and general-purpose optimization solvers such as cvxpy that do not scale well for these problems and/or are inaccurate. c-lasso can solve the estimation problems at a single regularization level, across an entire regularization path, and includes three model selection strategies for determining the regularization parameter: a theoretically-derived fixed regularization, k-fold cross-validation, and stability selection. We show several use cases of the package, including an application of sparse log-contrast regression tasks for compositional microbiome data, and highlight the seamless integration into R via reticulate.

Functionalities Installation and problem instantiation
c-lasso is available on pip and can be installed in the shell using pip install c-lasso c-lasso is a stand-alone package and not yet compatible with the scikit-learn API. The central object in the c-lasso package is the instantiation of a c-lasso problem.
# Import the main class of the package from classo import classo_problem # Define a c-lasso problem instance with default setting, # given data X, y, and constraints C. problem = classo_problem(X, y, C) We next describe what type of problem instances are available and how to solve them.

Statistical problem formulations
Depending on the type of and the prior assumptions on the data, the noise ϵ, and the model parameters, c-lasso allows for different estimation problem formulations. More specifically, the package can solve the following four regression-type and two classification-type formulations:

R1
This regression problem uses the Huber loss h ρ as objective function for robust model fitting with an L 1 penalty and linear equality constraints on the β vector. The default parameter ρ is set to 1.345 (Huber, 1981 This formulation is the default problem formulation in c-lasso. It is similar to R1 but allows for joint estimation of the (constrained) β vector and the standard deviation σ in a concomitant fashion (Combettes & Müller, 2020a, 2020b This formulation combines R2 and R3 allowing robust joint estimation of the (constrained) β vector and the scale σ in a concomitant fashion (Combettes & Müller, 2020a, 2020b.
# Formulation R4 problem.formulation.huber = True problem.formulation.concomitant = True problem.formulation.classification = False C1 Constrained sparse classification with Square Hinge loss: where x i denotes the i th row of X, y i ∈ {−1, 1}, and l(·) is defined for r ∈ R as: This formulation is similar to R1 but adapted for classification tasks using the Square Hinge loss with (constrained) sparse β vector estimation (Lee & Lin, 2013 This formulation is similar to C1 but uses the Huberized Square Hinge loss l ρ for robust classification with (constrained) sparse β vector estimation (Rosset & Zhu, 2007): This formulation can be selected in c-lasso as follows: # Formulation C2 problem.formulation.huber = True problem.formulation.concomitant = False problem.formulation.classification = True

Optimization schemes
The problem formulations R1-C2 require different algorithmic strategies for efficiently solving the underlying optimization problems. The c-lasso package implements four published algorithms with provable convergence guarantees. The package also includes novel algorithmic extensions to solve Huber-type problems using the mean-shift formulation (Mishra & Müller, 2019). The following algorithmic schemes are implemented: • Path algorithms (Path-Alg): This algorithm follows the proposal in (Gaines et al., 2018;Jeon et al., 2020) and uses the fact that the solution path along λ is piecewise-affine (Rosset & Zhu, 2007). We also provide a novel efficient procedure that allows to derive the solution for the concomitant problem R3 along the path with little computational overhead.
• Douglas-Rachford-type splitting method (DR): This algorithm can solve all regression problems R1-R4. It is based on Doulgas-Rachford splitting in a higher-dimensional product space and makes use of the proximity operators of the perspective of the LS objective (Combettes & Müller, 2020a, 2020b. The Huber problem with concomitant scale R4 is reformulated as scaled Lasso problem with mean shift vector (Mishra & Müller, 2019) and thus solved in (n + d) dimensions.
• Projected primal-dual splitting method (P-PDS): This algorithm is derived from (Briceño-Arias & López Rivera, 2019) and belongs to the class of proximal splitting algorithms, extending the classical Forward-Backward (FB) (aka proximal gradient descent) algorithm to handle an additional linear equality constraint via projection. In the absence of a linear constraint, the method reduces to FB.
• Projection-free primal-dual splitting method (PF-PDS): This algorithm is a special case of an algorithm proposed in (Combettes & Pesquet, 2012) (Eq. 4.5) and also belongs to the class of proximal splitting algorithms. The algorithm does not require projection operators which may be beneficial when C has a more complex structure. In the absence of a linear constraint, the method reduces to the Forward-Backward-Forward scheme.
The following table summarizes the available algorithms and their recommended use for each problem: only option --- The following Python snippet shows how to select a specific algorithm: problem.numerical_method = "Path-Alg" # Alternative options: "DR", "P-PDS", and "PF-PDS"

Computation modes and model selection
The c-lasso package provides several computation modes and model selection schemes for tuning the regularization parameter.
• Fixed Lambda: This setting lets the user choose a fixed parameter λ or a proportion l ∈ [0, 1] such that λ = l × λ max . The default value is a scale-dependent tuning parameter that has been derived in (Shi et al., 2016) and applied in (Combettes & Müller, 2020b).
• Path Computation: This setting allows the computation of a solution path for λ parameters in an interval [λ min , λ max ]. The solution path is computed via the Path-Alg scheme or via warm-starts for other optimization schemes.
• Cross Validation: This setting allows the selection of the regularization parameter λ via k-fold cross validation for λ ∈ [λ min , λ max ]. Both the Minimum Mean Squared Error (or Deviance) (MSE) and the "One-Standard-Error rule" (1SE) are available (Hastie et al., 2009).
The Python syntax to use a specific computation mode and model selection is exemplified below: # Example how to perform ath computation and cross-validation: problem.model_selection.LAMfixed = False problem.model_selection.PATH = True problem.model_selection.CV = True problem.model_selection.StabSel = False # Example how to add stability selection to the problem instance problem.model_selection.StabSel = True Each model selection procedure has additional meta-parameters that are described in the Documentation.

Numerical benchmarks
To evaluate optimization accuracy and running time of the different algorithms available in c-lasso, we provide micro-benchmark experiments which also include cvxpy, an open source convex optimization software, for baseline comparison. All experiments have been computed using Python 3.9.1 on a MacBook Air with a 1.8 GHz Intel Core i5 processor and 8 Gb 1600 MHz DDR3 memory, operating on macOS High Sierra. Figure 1 summarizes the results for the Path-Alg, DR, and P-PDS algorithms solving the regression formulation R1 for different samples sizes n and problem dimensions p on synthetic data (using c-lasso's data generator). We observe that c-lasso's algorithms are faster and more accurate than the cvx baseline. For instance, for d = 500 features and n = 500 samples, the Path-Alg algorithm is about 70 times faster than cvx. The complete reproducible micro-benchmark is avaialable here.

Computational examples Toy example using synthetic data
We illustrate the workflow of the c-lasso package on synthetic data using the built-in routine random_data which enables the generation of test problem instances with normally distributed data X, sparse coefficient vectors β, and constraints C ∈ R k×d .
Here, we use a problem instance with n = 100, d = 100, a β with five non-zero components, σ = 0.5, and a zero-sum contraint.
The corresponding output reads: c-lasso allows standard visualization of the computed solutions, e.g., coefficient plots at fixed λ, the solution path, the stability selection profile at the selected λ, and the stability selection profile across the entire path.

Log-contrast regression on gut microbiome data
We next illustrate the application of c-lasso on the COMBO microbiome dataset (Combettes & Müller, 2020b;Lin et al., 2014;Shi et al., 2016). Here, the task is to predict the Body Mass Index (BMI) of n = 96 participants from d = 45 relative abundances of bacterial genera, and absolute calorie and fat intake measurements. The code snippet for this example is available in the README.md and the example notebook. Stability selection profiles using formulation R3 (left) and R4(right) on the COMBO dataset, reproducing Figure 5a in (Combettes & Müller, 2020b).

Calling c-lasso in R
The c-lasso package also integrates with R via the R package reticulate. We refer to reticulate's manual for technical details about connecting python environments and R. A successful use case of c-lasso is available in the R package trac (Bien et al., 2020), enabling tree-structured aggregation of predictors when features are rare.
The code snippet below shows how c-lasso is called in R to perform regression at a fixed λ λ = 0.1λ max . In R, X and C need to be of matrix type, and y of array type.