rFBP: Replicated Focusing Belief Propagation algorithm

The rFBP project implements a scikit-learn compatible machine-learning binary classifier leveraging fully connected neural networks with a learning algorithm (Replicated Focusing Belief Propagation, rFBP) that is quickly converging and robust (less prone to brittle overfitting) for ill-posed datasets (very few samples compared to the number of features). The current implementation works only with binary features such as one-hot encoding for categorical data.


Summary
The rFBP project implements a scikit-learn compatible machine-learning binary classifier leveraging fully connected neural networks with a learning algorithm (Replicated Focusing Belief Propagation, rFBP) that is quickly converging and robust (less prone to brittle overfitting) for ill-posed datasets (very few samples compared to the number of features). The current implementation works only with binary features such as one-hot encoding for categorical data.
This library has already been widely used to successfully predict source attribution starting from GWAS (Genome Wide Association Studies) data. That study was trying to predict the animal origin for an infectious bacterial disease inside the H2020 European project COMPARE (Grant agreement ID: 643476). A full description of the pipeline used in this study is available in the abstract and slides provided into the publications folder of the project.
Algorithm application on real data:

Statement of need
The learning problem under ill-posed conditions can be tackled through statistical mechanic models joined with the so-called Large Deviation Theory (Baldassi & Braunstein, 2015;Baldassi et al., 2016b;Monasson & Zecchina, 1995a, 1995bParisi, 2007). In general, the learning problem can be split into two sub-parts: the classification problem and the generalization one. The first aims to completely store a pattern sample, i.e., a prior known ensemble of input-output associations (perfect learning, Baldassi et al.) (Baldassi et al., 2016a;Krauth & Mézard, 1989). The second one corresponds to compute a discriminant function based on a set of features of the input which guarantees a unique association of a pattern.
From a statistical point-of-view many Neural Network models have been proposed and spinglass models have emerged as the most promising ones. Starting from a balanced distribution of the system, generally based on Boltzmann distribution, and under proper conditions, we can prove that the classification problem becomes a NP-complete computational problem (Blum & Rivest, 1992). A wide range of heuristic solutions to that type of problems were proposed * co-first author † co-first author (Baldassi, Braunstein, Brunel, & Zecchina, 2007;Braunstein & Zecchina, 2006;Huang & Kabashima, 2014).
In this project we show one of these algorithms developed by Baldassi et al. (Baldassi et al., 2016a) and called Replicated Focusing Belief Propagation (rFBP). The rFBP algorithm is a learning algorithm developed to justify the learning process of a binary neural network framework. The model is based on a spin-glass distribution of neurons put on a fully connected neural network architecture. In this way each neuron is identified by a spin and so only binary weights (-1 and 1) can be assumed by each entry. The learning rule which controls the weight updates is given by the Belief Propagation method.
A first implementation of the algorithm was proposed in the original paper (Baldassi et al., 2016a) jointly with an open-source Github repository. The original version of the algorithm was written in Julia language. Julia is certainly an efficient programming language but it is not part of most machine learning developers' tool of choice. To broaden the scope and use of the method, a C++ implementation was developed with a joint Cython wrap for Python users. The C++ language guarantees better computational performances against the Julia implementation and the Python version enhances its usability. This implementation is optimized for parallel computing and is endowed with a custom C++ library called Scorer, which is able to compute a large number of statistical measurements based on a hierarchical graph scheme. With this optimized implementation and its scikit-learn compatibility we try to encourage researchers to approach these alternative algorithms and to use them more frequently on real context.
As the Julia implementation also the C++ one provides the entire rFBP framework in a single library callable via a command line interface. The library widely uses template syntaxes to perform dynamic specialization of the methods between two magnetization versions of the algorithm. The main object categories needed by the algorithm are wrapped in handy C++ objects easy to use also from the Python interface.