LINFA: a Python library for variational inference with normalizing flow and annealing

Variational inference is an increasingly popular method in statistics and machine learning for approximating probability distributions. We developed LINFA (Library for Inference with Normalizing Flow and Annealing), a Python library for variational inference to accommodate computationally expensive models and difficult-to-sample distributions with dependent parameters. We discuss the theoretical background, capabilities, and performance of LINFA in various benchmarks. LINFA is publicly available on GitHub at https://github.com/desResLab/LINFA.


Statement of need
Generating samples from a posterior distribution is a fundamental task in Bayesian inference.The development of sampling-based algorithms from the Markov chain Monte Carlo family (Gelfand & Smith, 1990;Geman & Geman, 1984;Hastings, 1970;Metropolis et al., 1953) has made solving Bayesian inverse problems accessible to a wide audience of both researchers and practitioners.However, the number of samples required by these approaches is typically significant and the convergence of Markov chains to their stationary distribution can be slow especially in high-dimensions.Additionally, satisfactory convergence may not be always easy to quantify, even if a number of metrics have been proposed in the literature over the years.More recent paradigms have been proposed in the context of variational inference (Wainwright et al., 2008), where an optimization problem is formulated to determine the optimal member of a parametric family of distributions that can approximate a target posterior density.In addition, flexible approaches to parametrize variational distributions through a composition of transformations (closely related to the concept of trasport maps, see, e.g., Villani & others (2009)) have reached popularity under the name of normalizing flows (Dinh et al., 2016;Kingma et al., 2016;Kobyzev et al., 2020;Papamakarios et al., 2021;Rezende & Mohamed, 2015).The combination of variational inference and normalizing flow has received significant recent interest in the context of general algorithms for solving inverse problems (El Moselhy & Marzouk, 2012;Rezende & Mohamed, 2015).
However, cases where the computational cost of evaluating the underlying probability distribution is significant occur quite often in engineering and applied sciences, for example when such evaluation requires the solution of an ordinary or partial differential equation.In such cases, inference can easily become intractable.Additionally, strong and nonlinear dependence between model parameters may results in difficult-to-sample posterior distributions characterized by features at multiple scales or by multiple modes.The LINFA library is specifically designed for cases where the model evaluation is computationally expensive.In such cases, the construction of an adaptively trained surrogate model is key to reducing the computational cost of inference (Wang et al., 2022).In addition, LINFA provides an adaptive annealing scheduler, where temperature increments are automatically determined based on the available variational approximant of the posterior distribution.Thus, adaptive annealing makes it easier to sample from complicated densities (Cobian et al., 2023).

Capabilities
LINFA is designed as a general inference engine and allows the user to define custom input transformations, computational models, surrogates, and likelihood functions.
1. User-defined input parameter transformations -Input transformations may reduce the complexity of inference and surrogate model construction in situations where the ranges of the input variables differ substantially or when the input parameters are bounded.
A number of pre-defined univariate transformations are provided, i.e, identity, tanh, linear, and exp.These transformations are independently defined for each input variable, using four parameters (, , , ), providing a nonlinear transformation between the normalized interval [, ] and the physical interval [𝑐, 𝑑].Additional transformations can be defined by implementing the following member functions.
• forward -It evaluates the transformation from the normalized to the physical space.
One transformation needs to be defined for each input dimension.For example, the list of lists defines a hyperbolic tangent transformation for the first two variables and an exponential transformation for the third.
• compute_log_jacob_func -This is the log Jacobian of the transformation that needs to be included in the computation of the log posterior density to account for the additional change in volume.
2. User-defined computational models -LINFA can accommodate any type of models from analytically defined posteriors with the gradient computed through automatic differentiation to legacy computational solvers for which the solution gradient is not available nor easy to compute.New models are created by implementing the methods below.
• genDataFile -This is a pre-processing function used to generate synthetic observations.It computes the model output corresponding to the default parameter values (usually defined as part of the model) and adds noise with a user-specified distribution.Observations will be stored in a file and are typically assigned to model.dataso they are available for computing the log posterior.
• solve_t -This function solves the model for multiple values of the physical input parameters specified in a matrix format (with one sample for each row and one column for each input parameter dimension).
3. User-defined surrogate models -For computational models that are too expensive for online inference, LINFA provides functionalities to create, train, and fine-tune a surrogate model.The Surrogate class implements the following functionalities: • A new surrogate model can be created using the Surrogate constructor.
• A pre-grid is defined as an a priori selected point cloud created inside the hyperrectangle defined by limits.The pre-grid can be either of type 'tensor' (tensor product grid) where the grid order (number of points in each dimension) is defined through the argument gridnum, or of type 'sobol' (using low-discrepancy quasirandom Sobol' sequences, see Sobol' (1967)), in which case the variable gridnum defines the total number of samples.
• Surrogate model Input/Output.The two functions surrogate_save() and surrogate_load() are provided to save a snapshot of a given surrogate or to read it from a file.
• The pre_train() function is provided to perform an initial training of the surrogate model on the pre-grid.In addition, the update() function is also available to re-train the model once additional training examples are available.
• The forward() function evaluates the surrogate model at multiple input realizations.If a transformation is defined, the surrogate should always be specified in the normalized domain with limits defined in terms of the normalized intervals (i.e., [𝑎, 𝑏]).
4. User-defined likelihood -A user-defined likelihood function can be defined by passing the parameters, the model, the surrogate and a coordinate transformation using log_density(x, model, surrogate, transformation), and then assigning it as a member function of the experiment class using: exp.model_logdensity = lambda x: log_density(x, model, surr, transf).
5. Linear and adaptive annealing schedulers -LINFA provides two annealing schedulers by default.The first is the 'Linear' scheduler with constant increments.The second is the 'AdaAnn' adaptive scheduler (Cobian et al., 2023) with hyperparameters reported in Table 7.For the AdaAnn scheduler, the user can also specify a different number of parameter updates to be performed at the initial temperature  0 , final temperature  1 , and for any temperature  0 <  < 1.Finally, the batch size (number of samples used to evaluate the expectations in the loss function) can also be differentiated for  = 1 and  < 1.
6. User-defined hyperparameters -A complete list of hyperparameters with a description of their functionality can be found in the Appendix.

Related software modules and packages for variational inference
Other Python modules and packages were found to provide an implementation of variational inference with a number of additional features.An incomplete list of these packages is reported below.
• Online notebooks (see this example) which implement variational inference from scratch in pytorch.
LINFA is based on normalizing flow transformations and therefore can infer non linear parameter dependence.It also provides the ability to adaptively train a surrogate model (NoFAS) which significantly reduces the computational cost of inference for the parameters of expensive computational models.Finally, LINFA provides an adaptive annealing algorithm (AdaAnn) which autonomously selects the appropriate annealing steps based on the current approximation of the posterior distribution.

Numerical benchmarks
We tested LINFA on multiple problems.These include inference on unimodal and multi-modal posterior distributions specified in closed form, ordinary differential models and dynamical systems with gradients directly computed through automatic differentiation in PyTorch, identifiable and non-identifiable physics-based models with fixed and adaptive surrogates, and high-dimensional statistical models.Some of the above tests are included with the library and systematically tested using GitHub Actions.A detailed discussion of these test cases is provided in the Appendix.To run the test type python -m unittest linfa.linfa_test_suite.NAME_example where NAME is the name of the test case, either trivial, highdim, rc, rcr, adaann or rcr_nofas_adaann.

Conclusion and Future Work
In this paper, we have introduced the LINFA library for variational inference, briefly discussed the relevant background, its capabilities, and report its performance on a number of test cases.Some interesting directions for future work are mentioned below.
Future versions will support user-defined privacy-preserving synthetic data generation and variational inference through differentially private gradient descent algorithms.This will allow the user to perform inference tasks while preserving a pre-defined privacy budget, as discussed in (Su et al., 2023).LINFA will also be extended to handle multiple models.This will open new possibilities to solve inverse problems combining variational inference and multi-fidelity surrogates (see, e.g., Siahkoohi et al. (2021)).In addition, for inverse problems with significant dependence among the parameters, it is often possible to simplify the inference task by operating on manifolds of reduced dimensionality (Brennan et al., 2020).New modules for dimensionality reduction will be developed and integrated with the LINFA library.Finally, the ELBO loss typically used in variational inference has known limitations, some of which are related to its close connection with the KL divergence.Future versions of LINFA will provide the option to use alternative losses.
distribution  0 ( 0 ) into a close approximation   (  ) of a desired target posterior density (|).This transformation can be determined by composing  bijections and evaluating the transformed density through the change of variable formula (see Villani & others (2009)).
In the context of variational inference, we seek to determine an optimal set of parameters  ∈ Λ so that   (  ) ≈ (|).
For computational convenience, normalizing flow transformations are selected to be easily invertible and their Jacobian determinant can be computed with a cost that grows linearly with the problem dimensionality.Approaches in the literature include RealNVP (Dinh et al., 2016), GLOW (Kingma & Dhariwal, 2018), and autoregressive transformations such as MAF (Papamakarios et al., 2017) and IAF (Kingma et al., 2016).Detailed reviews on a wide range of flow formulations can be found in Kobyzev et al. (2020) and Papamakarios et al. (2021).

MAF and RealNVP
LINFA implements two widely used normalizing flow formulations, MAF (Papamakarios et al., 2017) and RealNVP (Dinh et al., 2016) 2015)).In a MADE autoencoder the network connectivities are multiplied by Boolean masks so the input-output relation maintains a lower triangular structure, making the computation of the Jacobian determinant particularly simple.MAF transformations are then composed of multiple MADE layers, possibly interleaved by batch normalization layers (Ioffe & Szegedy, 2015), typically used to add stability during training and increase network accuracy (Papamakarios et al., 2017).

Normalizing flow with adaptive surrogate (NoFAS)
LINFA is designed to accommodate black-box models  ∶  →  between the random inputs  = ( 1 ,  2 , ⋯ ,   )  ∈  and the outputs ( 1 ,  2 , ⋯ ,   )  ∈ , and assumes  observations  = {  }  =1 ⊂  to be available.Our goal is to infer  and to quantify its uncertainty given .We embrace a variational Bayesian paradigm and sample from the posterior distribution (|) ∝ ℓ  (, ) (), with prior () via normalizing flows.This requires the evaluation of the gradient of the ELBO (1) with respect to the NF parameters , replacing (,   ) with (|  ) () = ℓ   (, ) (), and approximating the expectations with their MC estimates.However, the likelihood function needs to be evaluated at every MC realization, which can be costly if the model () is computationally expensive.In addition, automatic differentiation through a legacy (e.g.physics-based) solver may be an impractical, time-consuming, or require the development of an adjoint solver.
Our solution is to replace the model  with a computationally inexpensive surrogate f ∶  ×  →  parameterized by the weigths  ∈ , whose derivatives can be obtained at a relatively low computational cost, but intrinsic bias in the selected surrogate formulation, a limited number of training examples, and locally optimal  can compromise the accuracy of f .
To resolve these issues, LINFA implements NoFAS, which updates the surrogate model adaptively by smartly weighting the samples of  from NF thanks to a memory-aware loss function.Once a newly updated surrogate is obtained, the likelihood function is updated, leading to a new posterior distribution that will be approximated by VI-NF, producing, in turn, new samples for the next surrogate model update, and so on.Additional details can be found in Wang et al. (2022).

Adaptive Annealing
Annealing is a technique to parametrically smooth a target density to improve sampling efficiency and accuracy during inference.In the discrete case, this is achieved by incrementing an inverse temperature   and setting   (, ) =    (, ), for  = 0, … , , where 0 <  0 < ⋯ <   ≤ 1.The result of exponentiation produces a smooth unimodal distribution for a sufficiently small  0 , recovering the target density as   approaches 1.In other words, annealing provides a continuous deformation from an easier to approximate unimodal distribution to a desired target density.
A linear annealing scheduler with fixed temperature increments is often used in practice (see, e.g., Rezende & Mohamed (2015)), where   =  0 + (1 −  0 )/ for  = 0, … ,  with constant increments  = (1 −  0 )/.Intuitively, small temperature changes are desirable to carefully explore the parameter spaces at the beginning of the annealing process, whereas larger changes can be taken as   increases, after annealing has helped to capture important features of the target distribution (e.g., locating all the relevant modes).
The AdaAnn scheduler determines the increment   that approximately produces a pre-defined change in the KL divergence between two distributions annealed at~  and  +1 =   +   , respectively.Letting the KL divergence equal a constant  2 /2, where  is referred to as the KL tolerance, the step size   becomes The denominator is large when the support of the annealed distribution    (, ) is wider than the support of the target (, ), and progressively reduces with increasing   .Further detail on the derivation of the expression for   can be found in Cobian et al. (2023).
Observations  are generated as where  0 ∼ (0,  2 ) and ⊙ is the Hadamard product.We set the true model parameters at  * = (3, 5)  , with output  * = ( * ) = (7.99,−2.59)  , and simulate 50 sets of observations from (3).The likelihood of  given  is assumed Gaussian, and we adopt a noninformative uniform prior ().We allocate a budget of 4 × 4 = 16 model solutions to the pre-grid and use the rest to adaptively calibrate f using 2 samples every 1000 normalizing flow iterations.
Results in terms of loss profile, variational approximation, and posterior predictive distribution are shown in Figure 1.

Two-element Windkessel Model
The two-element Windkessel model (often referred to as the RC model) is the simplest representation of the human systemic circulation and requires two parameters, i.e., a resistance  ∈ [100, 1500] Barye⋅ s/ml and a capacitance  ∈ [1 × 10 −5 , 1 × 10 −2 ] ml/Barye.We provide a periodic time history of the aortic flow (see Wang et al. (2022) for additional details) and use the RC model to predict the time history of the proximal pressure   (), specifically its maximum, minimum, and average values over a typical heart cycle, while assuming the distal resistance   () as a constant in time, equal to 55 mmHg.In our experiment, we set the true resistance and capacitance as  * ,1 =  * = 1000 Barye⋅ s/ml and  * ,2 =  * = 5 × 10 −5 ml/Barye, and determine   () from a RK4 numerical solution of the following algebraic-differential system where   is the flow entering the RC system and   is the distal flow.Synthetic observations are generated by adding Gaussian noise to the true model solution  * = ( * 1 ,  * 2 ,  * 3 ) = ( ,min ,  ,max ,  ,avg ) = (78.28,101.12, 85.75), i.e.,  follows a multivariate Gaussian distribution with mean  * and a diagonal covariance matrix with entries 0.05  *  , where  = 1, 2, 3 corresponds to the maximum, minimum, and average pressures, respectively.The aim is to quantify the uncertainty in the RC model parameters given 50 repeated pressure measurements.We imposed a non-informative prior on  and .Results are shown in Figure 3.

Three-element Wndkessel Circulatory Model (NoFAS + AdaAnn)
The three-parameter Windkessel or RCR model is characterized by proximal and distal resistance parameters   ,   ∈ [100, 1500] Barye⋅s/ml, and one capacitance parameter  ∈ [1 × 10 −5 , 1 × 10 −2 ] ml/Barye.This model is not identifiable.The average distal pressure is only affected by the total system resistance, i.e. the sum   +   , leading to a negative correlation between these two parameters.Thus, an increment in the proximal resistance is compensated by a reduction in the distal resistance (so the average distal pressure remains the same) which, in turn, reduces the friction encountered by the flow exiting the capacitor.An increase in the value of  is finally needed to restore the average, minimum and maximum pressure.This leads to a positive correlation between  and   .
The output consists of the maximum, minimum, and average values of the proximal pressure   (), i.e., ( ,min ,  ,max ,  ,avg ) over one heart cycle.The true parameters are  * ,1 =  *  = 1000 Barye⋅s/ml,  * ,2 =  *  = 1000 Barye⋅s/ml, and  * = 5 × 10 −5 ml/Barye.The proximal pressure is computed from the solution of the algebraic-differential system This example also demonstrates how NoFAS can be combined with annealing for improved convergence.The results in Figure 4 are generated using the AdaAnn adaptive annealing scheduler with intial inverse temperature  0 = 0.05, KL tolerance  = 0.01 and a batch size of 100 samples.The number of parameter updates is set to 500, 5000 and 5 for  0 ,  1 and  0 <  <  1 , respectively and 1000 Monte Carlo realizations are used to evaluate the denominator in equation ( 2).The posterior samples capture well the nonlinear correlation among the parameters and generate a fairly accurate posterior predictive distribution that overlaps with the observations.Additional details can be found in Wang et al. (2022) and Cobian et al. (2023).

Friedman 1 model (AdaAnn)
We consider a modified version of the Friedman 1 dataset (Friedman, 1991) to examine the performance of our adaptive annealing scheduler in a high-dimensional context.According to the original model in Friedman (1991), the data are generated as where   ∼ (0, 1).We made a slight modification to the model in (5) as and set the true parameter combination to  = ( 1 , … ,  10 ) = (10, ± √ 20, 0.5, 10, 5, 0, 0, 0, 0, 0).Note that both (5) and ( 6) contain linear, nonlinear, and interaction terms of the input variables  1 to  10 , five of which ( 6 to  10 ) are irrelevant to .Each  is drawn independently from (0, 1).We used R package tgp (Gramacy, 2007) to generate a Friedman1 dataset with a sample size of =1000.We impose a non-informative uniform prior () and, unlike the original modal, we now expect a bimodal posterior distribution of .Results in terms of marginal statistics and their convergence for the mode with positive  ,2 are illustrated in Table 1 and Figure 5.

Hyperparameters in LINFA
This section contains the list of all hyperparameters in the library, their default values, and a description of the functionalities they control.General hyperparameters are listed in Table 6, those related to the optimization process in Table 5, and to the output folder and files in Table 2. Hyperparameters for the proposed NoFAS and AdaAnn approaches are listed in Table 3 and Table 7, respectively.Finally, a hyperparameter used to select the hardware device is described in Table 4.

Figure 1 :
Figure 1: Results from the simple two-dimensional map.Loss profile (left), posterior samples (center), and posterior predictive distribution (right).

Figure 2 :
Figure 2: Results from the high-dimensional example.The top row contains the loss profile (left) and samples from the posterior predictive distribution plus the available observations (right).Samples from the posterior distribution are instead shown in the bottom row.

Figure 3 :
Figure 3: Results from the RC model.Loss profile (left), posterior samples (center) for R and C, and the posterior predictive distribution for  ,min and  ,max (right,  ,avg not shown).

Figure 4 :
Figure 4: Results from the RCR model.The top row contains the loss profile (left) and samples from the posterior predictive distribution plus the available observations (right).Samples from the posterior distribution are instead shown in the bottom row.

Figure 5 :
Figure 5: Loss profile (left) and posterior marginal statistics (right) for positive mode in the modified Friedman test case.
samples at  = 1.T_0 int Number of initial parameter updates at  0 .T int Number of parameter updates after each temperature update.During such updates the temperature is kept fixed.Carlo samples used to evaluate the denominator in equation (2).
Rezende & Mohamed (2015)a likelihood function   () (informed by the distribution of the error ) and prior (), a NF-based approximation   () of the posterior distribution (|) can be computed by maximizing the lower bound to the log marginal likelihood log () (the so-called evidence lower bound or ELBO), or, equivalently, by minimizing a free energy bound (see, e.g.,Rezende & Mohamed (2015)).

Table 1 :
Posterior mean and standard deviation for positive mode in the modified Friedman test case.

Table 2 :
Output parameters seed intSeed for the random number generator.

Table 3 :
Surrogate model parameters (NoFAS) n_sample intBatch size used when saving results to the disk (i.e., once every save_interval iterations).budgetint Maximum allowable number of true model evaluations.

Table 5 :
Optimizer and learning rate parameters

Table 6 :
General parameters save results from the normalizing flow iterations.Saved results include posterior samples, loss profile, samples from the posterior predictive distribution, observations, and marginal statistics.

Table 7 :
Parameters for the adaptive annealing scheduler (AdaAnn) scheduler string Type of annealing scheduler ('AdaAnn' or 'fixed', default 'AdaAnn').tol float KL tolerance.It is kept constant during inference and used in the numerator of equation (2).