SISSO++: A C++ Implementation of the Sure-Independence Screening and Sparsifying Operator Approach

The sure independence screening and sparsifying operator (SISSO) approach (Ouyang


Summary
The sure independence screening and sparsifying operator (SISSO) approach (Ouyang et al., 2018) is an algorithm belonging to the field of artificial intelligence and more specifically a combination of symbolic regression and compressed sensing. As a symbolic regression method, SISSO is used to identify mathematical functions, i.e. the descriptors, that best predict the target property of a data set. Furthermore, the compressed sensing aspect of SISSO, allows it to find sparse linear models using tens to thousands of data points. SISSO is introduced for both regression and classification tasks. In practice, SISSO first constructs a large and exhaustive feature space of trillions of potential descriptors by taking in a set of user-provided primary features as a dataframe, and then iteratively applying a set of unary and binary operators, e.g. addition, multiplication, exponentiation, and squaring, according to a user-defined specification. From this exhaustive pool of candidate descriptors, the ones most correlated to a target property are identified via sure-independence screening, from which the low-dimensional linear models with the lowest error are found via an 0 regularization.
Because symbolic regression generates an interpretable equation, it has become an increasingly popular concept across scientific disciplines (Neumann et al., 2020;Udrescu & Tegmark, 2020;Wang et al., 2019). A particular advantage of these approaches are their capability to model complex phenomena using relatively simple descriptors. SISSO has been used successfully in the past to model, explore, and predict important material properties, including the stability of different phases (Bartel et al., 2018;Schleder et al., 2020); the catalytic activity and reactivity (Andersen et al., 2019;Han et al., 2021;W. Xu et al., 2021); and glass transition temperatures (Pilania et al., 2019). Beyond regression problems, SISSO has also been used successfully to classify materials into different crystal prototypes , or whether a material crystallizes in its ground state as a perovskite (Bartel et al., 2019), or to determine whether a material is a topological insulator or not .
The SISSO++ package is an open-source (Apache-2.0 licence), modular, and extensible C++ implementation of the SISSO method with Python bindings. Specifically, SISSO++ applies this methodology for regression, log regression, and classification problems. Additionally, the library includes multiple Python functions to facilitate the post-processing, analyzing, and visualizing of the resulting models.

Statement of need
The main goal of the SISSO++ package is to provide a user-friendly, easily extendable version of the SISSO method for the scientific community. While both a FORTRAN (Ouyang, n.d.) and a Matlab (Gasper, n.d.) implementation of SISSO exist, their lack of native Python interfaces led to the development of multiple separate Python wrappers (Waroquiers, n.d.; C. Xu, n.d.). This package looks to rectify this situation by providing an implementation that has native Python bindings and can be used both in a massively parallel environment for discovering the descriptors and on personal computing devices for analyzing and visualizing the results. For this reason, all computationally intensive task are written in C++ and support parallelization via MPI and OpenMP. Additionally, the Python bindings allow one to easily incorporate the methods into computational workflows and postprocess results. Furthermore, this enables the integration of SISSO into existing machine-learning frameworks, e.g. scikitlearn (Pedregosa et al., 2011), via submodules. The code is designed in a modular fashion, which simplifies the process of extending the code for other applications. Finally the project's extensive documentation and tutorials provide a good access point for new users of the method.

Features
The following features are implemented in SISSO++: • A C++ library for using SISSO to find analytical models for a given problem • Python bindings to be able to interface with the C++ objects in a Python environment • Postprocessing tools for visualizing models and analyzing results using Matplotlib (Hunter, 2007) • Access to solve an n-dimensional classification model using a combination of calculating the convex-hull overlap and a linear-support vector machine solver • The ability to include non-linear parameters within features (e.g. exp(αx) and ln(x + β))  • A scikit-learn interface • Complete API documentation defining all functions of the code • Tutorials and Quick-Start Guides describing the basic functionality of the code

Code Dependencies
The following libraries are used by SISSO++: • Boost Serialization, MPI, System, and Filesystem are used for MPI communication and file management • NLopt (Johnson, n.d.) is used to optimize the non-linear bias and scale parameters within features • The CLP library from Coin-OR (Forrest et al., n.d.) is used to find the number of points in the convex hull overlap region for classification problems • LIBSVM (Chang & Lin, 2011) is used to find the linear-SVM model for classification problems • pybind11 (Jakob et al., 2017) is used to create the python bindings • Scikit-learn (Pedregosa et al., 2011), (Buitinck et al., 2013) is used to update the SVM model for classification problems within Python • NumPy (Harris et al., 2020) and pandas (McKinney, 2010;team, 2020) are used to represent the data structures within Python and perform array operations.