Variational Bayesian inference for linear and logistic regression

The article describe the model, derivation, and implementation of variational Bayesian inference for linear and logistic regression, both with and without automatic relevance determination. It has the dual function of acting as a tutorial for the derivation of variational Bayesian inference for simple models, as well as documenting, and providing brief examples for the MATLAB/Octave functions that implement this inference. These functions are freely available online.


Introduction
Linear and logistic regression are essential workhorses of statistical analysis, whose Bayesian treatment has received much recent attention (Gelman et al., 2013;Bishop, 2006;Murphy, 2012;Hastie et al., 2011).These allow specifying the a-priori uncertainty and infer a-posteriori uncertainty about regression coefficients explicity and hierarchically, by, for example, specifying how uncertain we are a-priori that these coefficients are small.However, Bayesian inference in such hierarchical models quickly becomes intractable, such that recent effort has focused on approximate inference, like Markov Chain Monte Carlo methods (Gilks et al., 1995), or variational Bayesian approximation (Beal, 2003;Bishop, 2006;Murphy, 2012).
Here, we describe such a variational treatment and implementation of Bayesian hierarchical models for both linear and logistic regression.Even though neither the statistical models nor their Bayesian approximation are particularly novel, the article provides a tutorial-style introduction to the derivation of their algorithms, together with a MATLAB/Octave implementation of these algorithms.As such, it bridges the gap between theory and practice of derivation and implementation.
The presentation of the variational inference derivation is closely aligned to that of Bishop (2006), but with essential differences.Specifically, both models include a variant with automatic relevance determination (ARD), which consists of assigning an individual hyper-prior to each regression coefficient separately.These hyperpriors are adjusted to eventually prune irrelevant coefficients (Wipf and Nagarajan, 2007) without the need for a separate validation set, unlike comparable sparsityinducing methods like the Lasso (Tibshirani, 1996).Bishop (2006) describes ARD only in the context of type-II maximum likelihood (MacKay, 1992;Neal, 1996;Tipping, 2001), where it (hyper-)parameters are tuned by maximizing the marginal likelihood (or model evidence).Here, instead, we apply the full Bayesian treatment, and find the ARD hyper-posteriors by variational Bayesian inference.
The model underlying linear regression is closely related to Griffin and Brown (2010), where the authors analyze the influence of prior choice on regression coefficient shrinkage.They promote priors in the form of scale mixtures of zero-mean normals, which allow for a larger difference in regression coefficients than would be possible under a normal prior.The authors proceed by suggesting a normalgamma prior and analyze its shrinkage properties in detail.The prior used in this work is also member of the scale mixtures of normals, and thus shares its advantageous properties.However, instead of a normal-gamma prior, this work uses a normal inverse-gamma prior in combination with another inverse-gamma hyperprior, as its conjugacy to the likelihood is advantageous for use with variational Bayesian inference.The same analysis performed by Griffin and Brown (2010) for the normal-gamma case should be amendable to the normal inverse-gamma case, but has yet to be performed.
The article is structured as follows.It first described linear regression, followed by logistic regression.For each of these, it first introduces the generative model, followed by deriving the variational Bayesian approximation.After that it introduces the ARD variants, together with their required changes to variational inference.This is followed by a detailed description of the MATLAB/Octave functions that implement this inference, and a set of examples that demonstrate their use.
All functions are implemented in the VBLinLogit library, which can be found at https://github.com/DrugowitschLab/VBLinLogit.The line numbers in this paper refer to v0.3 of this libary.

Linear regression
This section describes inference in a model performing linear regression.It is similar to that in Bishop (2006) by assuming a hyper-prior α on the regression coefficients w.However, rather than just inferring the posterior w, as done in Bishop (2006), it additionally puts an inverse-gamma prior on the variance τ −1 and infers the joint posterior of w and τ jointly.Furthermore, Bishop (2006) utilizes type-II maximum likelihood to deal with automatic relevance determination for linear regression.Here, we use the variational Bayesian approximation instead.
2.1.The model.The model assumes a linear relation between D-dimensional inputs x and outputs y and constant-variance Gaussian noise, such that the data likelihood is given by The prior on w and τ is conjugate normal inverse-gamma parametrized by α.As in Griffin and Brown (2010), this prior is member of the scale mixtures of normals.In this prior, τ appears as τ −1 in the variance of the zero-mean normal on w.Due to the gamma on τ , this τ −1 is inverse-gamma with shape a 0 , scale b 0 , and moments E τ −1 = b 0 /(a 0 − 1) for a 0 > 1 and VAR τ −1 = b 2 0 / (a 0 − 1) 2 (a 0 − 2) for a 0 > 2. The hyper-parameter α is assigned the hyperprior with moments of α −1 analogous to τ −1 .Due to the hyper-prior, there is no analytic solution to the posteriors, and variational Bayesian inference will be applied.
2.2.Variational Bayesian inference.The variational posteriors are found by maximizing the variational bound where P(D) is the model evidence.To maximize this bound, we assume that the variational distribution Q(w, τ, α), which approximates the posterior P(w, τ, α|D), Using standard results from variational Bayesian inference (Beal, 2003;Bishop, 2006), the variational posterior for w, τ that maximizes the variational bound L(Q) while holding Q(α) fixed, is given by ln with The variational posterior for α is with The expectations are evaluated with respect to the variational posterior and are given by The variational bound itself consists of In combination, this gives This bound is maximized by iterating over the updates for V N , w N , a N , b N , c N , and d N until L(Q) reaches a plateau.

Predictive density.
The predictive density is evaluated by approximating the posterior P(w, τ |D) by its variational counterpart Q(w, τ ), to get where standard results of convolving Gaussians with other Gaussians and Gamma distributions where used (Bishop, 2006;Murphy, 2012).The resulting distribution is a Student's t distribution with mean w N x, precision (1 + x V N x) −1 a N /b N , and 2a N degrees of freedom.The resulting predictive variance is (1+x 2.4.Using automatic relevance determination.Automatic relevance determination (ARD) determines the relevance of the elements of the input to determine the output by assigning a separate shrinkage prior to each element of the weight vector, which is in turn adjusted by a hyper-prior.While the data likelihood remains unchanged, the prior on w, τ is modified to be where the vector α = (α 1 , . . ., α D ) forms the diagonal of A. All of the α's are independent, such that the hyper-prior is given by Variational Bayesian inference is performed as before, resulting in the variational posteriors with parameters with expectations E w,τ (τ The variational bound changes to The predictive distribution remains unchanged, as the prior does not appear in the expression for the variational posterior P(w, τ ).where X is the N × D input matrix with x n as its rows, and y is a column vector, containing all y n 's.The optional parameters, a0, b0, c0, and d0, specify the prior parameters a 0 , b 0 , c 0 , and d 0 , respectively.If not given, they default to a 0 = 10 −2 , b 0 = 10 −4 , c 0 = 10 −2 , and d 0 = 10 −4 .For these values, the mean of τ −1 is undefined, but its mode is at b 0 /(a 0 + 1) ≈ 10 −4 , implying the a-prior most likely variance of the prior on w to be small.The variance of τ −1 is also undefined for a 0 ≤ 2, but the related variance on τ is VAR(τ ) = a 0 /b 2 0 = 10 6 .Thus, even though the default prior on τ implies some shrinkage of w → 0, this shrinkage is weak due to the prior's large width.Furthermore, the update Eq. ( 9) of a N reveals that a 0 can be interpreted as the half the a-prior number of observations.Thus the prior has the weight of 2 × 10 −2 observations, and thus loses its influence with very few observations.The same applies for the prior on α.
The returned values, w, V, an, and bn correspond to the normal inverse-gamma parameters w N , V N , a N , and b N , respectively, of the variational posterior Q * (w, τ ), Eq. ( 6).The variational parameters for Q * (α) are summarized by the returned E a = E α (α).The function additionally returns invV = V −1 N , and logdetV = ln |V N |, such that, if required, these values do not need to be re-computed.The returned L is the variational bound L(Q), Eq. ( 22), evaluated at the returned parameters.
After initializing the required data structures, initializing parameters, and precomputing some constants, the function finds the variational posterior parameters by updating Q * (w, τ ) (lines 65-73) and Q * (α) (lines 75-77).After each update, it evaluates the parameter-dependent components of the variational bound L(Q) in lines 79-82.This is repeated until either the change in L(Q) between two consecutive iterations drops below 0.001%, or the number of iterations exceeds 500.
To update V −1 N , w N , b n , and L(Q), the function vectorizes operations over n by The rest of the code follows closely the update equations derived further above.
2.5.2.Variational posterior parameters with ARD.The function vb linear fit ard .mcomputes the variational parameters with ARD, and has syntax [w, V, invV, logdetV, an, bn, E a, L] = vb linear fit ard (X, y [, a0, b0, c0, d0]) where the arguments X and y, and optional prior and hyper-prior parameters a0, b0, c0, d0, have the same structure as for vb linear fit .m.If not given, the prior / hyper-prior parameters default, as for vb linear fit .m, to a 0 = 10 −2 , b 0 = 10 −4 , c 0 = 10 −2 , and d 0 = 10 −4 .The return values w, V, invV, logdetV, an, and bn again correspond to the parameters of the variational posterior Q * (w, τ ), and L to the variational bound L(Q).The only difference to vb linear fit .m is that E a is a vector with D elements, returning the diagonal elements of E α (A).
The structure of vb linear fit .m is similar to vb linear fit .m,updating the parameters of Q * (w, τ ) and Q * (α) in lines 67-75 and lines 77-79, respectively, and computing the variational bound, L(Q) in lines 81-84.This is again repeated until either L(Q) changes by less than 0.001% between two consecutive iterations, or the number of iterations exceeds 500.2.5.3.Predictive density parameters.For a given set of variational posterior parameters, the function vb linear pred .mcomputes the parameters of the predictive density, Eq. ( 23), and has syntax [mu, lambda, nu] = vb linear pred(X, w, V, an, bn) Here X is the M × D matrix with x m as its rows.The other arguments correspond to the variational posterior parameters returned by vb linear fit .m or vb linear fit ard .m.The function returns the vectors mu and lambda with M elements, and the scalar nu.These variables specify the mean, precision, and the degrees of freedom of the predictive Student's t distribution for y m (Eq.( 23), corresponding to x m ) by the mth element of mu and lambda, and by nu, respectively.2.6.Examples.

Estimation of regression coefficients. Assuming inputs
The mean regression coefficient estimates are found by >> vb linear fit(X, y) ans = 0.9269 2.0220 2.9915 4.9798 Due to the additive noise, these estimates are close to, but do not exactly match the generative coefficients, w = (1, 2, 3, 5) .Clearly, the maximum likelihood estimate over-fits the training data, as reflected by a small training set error and a large test set error.Variational Bayesian regression also shows signs of over-fitting, but to a lesser extent than maximum likelihood.We visualize the fit of the regression coefficients by which results in Fig. 1A.As can be seen, the better fit of variational Bayesian inference is also reflected in a better estimate of the regression coefficients.Visualizing the test set predictions in the same way results in Fig. 1B.As for the regression   As in the previous example, the maximum likelihood estimator shows clear signs of over-fitting, as reflected in the large test set error.Variational Bayesian inference does not suffer from over-fitting to the same extent, as is illustrated by the significantly smaller test set error.When used with ARD, this training set error shrinks further, which indicates that ARD is better able to identify and ignore uninformative input dimensions.
A look at the regression coefficients in Fig. 2A (plotted as in the previous example) confirms this property.It illustrates that maximum likelihood was unable to detect the relevant dimensions, whereas variational Bayesian inference without ARD applied overly strong shrinkage to all dimensions.ARD, in contrast, determined the amount of shrinkage for each input dimension separately, and this way was able selectively suppress a subset of these.This is also reflected in the model predictions in Fig. 2B, which inference without ARD underestimates due to overly strong shrinkage of its regression coefficient estimates.With ARD, in contrast, the amount of bias due to shrinkage is reduced.The code for this example is available in vb linear example sparse.m. >> gen X = @(x, d) bsxfun(@power, x, 0:(d−1)); >> X = gen X(x, D); >> y = X * w + randn(N, 1); >> y test = gen X(x test, D) * w; In the above, D −1 determines the order of the generative polynomial (which in this case has 2nd order), D ML specifies the order assumed by the maximum likelihood estimate, and Ds is the range of orders tested by model selection.We only generate N = 10 training data points to make identifying the correct polynomial order difficult.
Model selection is performed by computing the training data variational bound for a set of orders, to find the order that minimizes this bound, As can be seen, model selection correctly identified the order of the generative polynomial.Plotting the variational bound for all tested polynomial orders (Fig. 3A) shows that this selection was unambiguous.
We test model prediction on the previously generated training set for variational Bayesian inference with the inferred polynomial order, and by maximum likelihood with D M L , by The wrong model (D ML = 6 rather than 3) caused the maximum likelihood estimate to perform badly on the test set, as illustrated by its large mean squared error.Plotting the prediction of both model reveals that the data under-constraints the maximum likelihood, such that its output predictions deviate noticeably from the true outputs for larger inputs (Fig. 3B).For the variational Bayesian model, in contrast, the true, generative polynomial remains within the model prediction's 95% credible intervals.The code for this example is available in vb linear example modelsel.m.Predicted outputs over inputs for the maximum likelihood estimate (blue) and variational Bayesian inference (dashed red).The black curve shows the noise-free data-generating polynomial, and the black crosses are the 10 data points based upon which the regression is performed.The shaded area indicates the 95% CIs for the output prediction of the Bayesian model.

Logistic regression
This section describes how to perform variational Bayesian inference for logistic regression with a hyper-prior on w.The basic generative model is the same as that used in Bishop (2006).In addition to this, we here provide a variant that performs automatic relevance determination.
3.1.The model.The data y is, dependent on the D-dimensional input x, assumed to be of either class y = −1 or y = 1.The log-likelihood ratio ln(P(y = 1|x, w)/P(y = −1|x, w)) is assumed to be linear in x, such that the conditional likelihood for y = 1 is given by the sigmoid Equally, P(y = −1|x, w) = 1 − P(y = 1|x, w) = 1/(1 + exp(w x)), such that Given some data D = {X, Y }, where X = {x 1 , . . ., x N } and Y = {y 1 , . . ., y N } are N input/output pairs, the aim is to find the posterior P(w|D), given some prior P(w).Unfortunately, the sigmoid data likelihood does not admit a conjugateexponential prior.Therefore, approximations need to be applied to find an analytic expression for the posterior.
The approximation that will be used is quadratic in w in the exponential, such that the conjugate Gaussian prior can be used.This prior is parametrized by the hyper-parameter α that is modeled by a conjugate Gamma distribution This hyper-prior implies the a-prior variance, α −1 of the zero-mean w to be inversegamma with moments E α −1 = b 0 /(a 0 − 1) for a 0 > 1, and where the variational distribution Q(w, α), approximating the posterior P(w, α|D), is assumed to factor into Q(w, α) = Q(w)Q(α).This approximation leads to analytic posterior expressions if the model structure is conjugate-exponential.The data likelihood does not admit a conjugate prior in the exponential family and will be approximated by the use of which is a tight lower bound on the sigmoid, with one additional parameter ξ per datum (Jaakkola and Jordan, 2000).Applying this bound, the data log-likelihood is lower-bounded by with one local variation parameter ξ n per datum.This results in the new variational bound which is a lower bound on the original variational bound, that is L(Q, ξ) ≤ L(Q).
The variational posteriors are evaluated by standard variational methods for factorized distributions.The variational posterior for w is given by ln Q * (w) = ln h(w, ξ) + E α (ln P(w|α)) with parameters The variational posterior for α results in ln Q * (α) = E w (ln P(w|α)) + ln P(α) + const. with The expectations are evaluated with respect to the variational posteriors and result in The variational bound itself is given by L(Q, ξ) = E w (ln h(w, ξ)) + E w,α (ln P(w|α)) + E α (ln P(α)) where ψ(•) is the digamma function.In combination, this gives This bound is to be maximized in order to find the variational posteriors for w and α.The expressions that maximize this bound with respect to Q(w) and Q(α), while keeping all other parameters fixed, are given by Q * (w) and Q * (α) respectively.To find the local variational parameters ξ n that maximize L(Q, ξ), its derivative with respect to ξ n is set to zero (see Bishop (2006)), resulting in The variational bound is maximized by iterating over the update equations for w N , V N , a N , b N and ξ, until L(Q, ξ) reaches a plateau.A lower bound on the marginal data log-likelihood ln P(D) is given by the variational bound itself, as ln P(D) ≥ L(Q) ≥ L(Q, ξ).

Predictive density.
In order to get the predictive density, the posterior P(w|D) is approximated by the variational posterior Q(w), and the sigmoid is lower-bounded by above bound, such that P(y = 1|x, D) = P(y = 1|x, w)P(w|D)dw The integral is solved by noting that the lower bound is exponentially quadratic in w, such that the Gaussian can be completed, to give The bound parameter ξ that maximizes ln P(y = 1|x, D) is given by Thus, the predictive density is found by iterating over the updates for w, Ṽ and ξ until ln P(y = 1|x, D) reaches a plateau.The hyper-prior P(α) does not need to be considered as it does not appear in the variational posterior Q(w).

3.4.
Using automatic relevance determination.To use automatic relevance determination (ARD), each element of the prior of w is assigned a separate prior, where A is the diagonal matrix with the vector α = (α 1 , . . ., α D ) along its diagonal.The conjugate hyper-prior P(α) is given by Note that α i determines the precision (inverse variance) of the ith element of w.A low precision makes the prior uninformative, whereas a high precision tells us that the associated element in w is most likely zero and the associated input element is therefore irrelevant for the prediction of y.Thus, such a prior structure automatically determines the relevance of each element of the input to predict the class of the output.Using the same variational Bayes inference as before, the variational posteriors are given by with where w i is the ith element of w, and The new expectations to evaluate the variation bound are As the variational posterior Q * (w) is independent of the hyper-parameters, the predictive density is evaluated as before.
3.5.Implementation.The scripts that compute the variational posterior parameters without and with ARD are vb logit fit .m and vb logit fit ard .m,respectively.vb logit fit iter .m is a slower version of vb logit fit .m that features a slightly simplified generative model without the hyper-prior, and iterates over updating each ξ n separately rather updating them for all x n 's simultaneously.vb logit pred .mcomputes the predictive density parameters for a set of input vectors.vb logit pred iter .mdoes so as well, but again iterates over updating ξ n rather than updating them all at the same time.
3.5.1.Variational posterior parameters without ARD, iterative implementation.The function vb logit fit iter .mdeviates from the generative model given by Eqs. ( 38)-(41) as it does not use a hyper-prior on α.Instead, it uses the conjugate Gaussian prior P(w) = N w|0, D −1 I , (81) where D is the input dimensionality.This prior was chosen to increase shrinkage with the number of input dimensions.The function is called by where X is the N × D input matrix with x n as its nth row, and y is the output column vector with N elements that are either −1 or 1.The returned w, V specify the parameters w N , V N of the variational posterior Eq. ( 48).The function additionally returns invV = V −1 N and logdetV = ln |V n |, such that, if required, these values do not need to be re-computed.
The function computes these parameters incrementally by adding the observations x n , y n one by one, while optimizing ξ n for each of these observations separately.Let V j and w j denote the parameters of Q * (w) after j observations have been made.Starting with w 0 = 0, V 0 = D −1 I, V −1 0 = DI, and ln |V −1 0 | = −D ln D according to the prior, V j follows the incremental update The incremental update of w j is slightly more complex, but from observing that it is easy to see that The script avoids taking the inverse of V by updating V −1 and V in parallel, where the latter is based on an application of the Sherman-Morrison formula on the V −1 update, resulting in ln |V j | can be updated in a similar way, based on the Matrix determinant lemma, such that, using ln The function initializes V 0 and w 0 in lines 41-44, and then updates its values by iterating over the x j 's for j = 1, . . ., N , starting in line 48.For each j, it first updates V and w under the assumption that ξ n = 0 and λ(ξ n ) = 1/8, leading to a simplified initial step, Then, in lines 66-90, it alternates between updating ξ j , and V j (ξ j ) and w j (ξ j ) while monitoring how these updates change the variational bound L(ξ j ).The updates are performed until the variational bound changes less than 0.001% between two consecutive updates of all parameters, or the number of iterations exceeds 500.The variational bound itself is, without the hyper-prior, given by is that the returned E a is now a vector that contains the posterior E α (α i )'s as its elements.
The implementation of vb logit fit ard .mdiffers from vb logit fit .monly by the variables b n and E a, holding the b N i 's and E α (α i )'s, now being vectors rather than scalars.Their update and use is adjusted accordingly, in line with what has been described above.
3.5.4.Predictive density parameters.Two scripts are available to compute the predictive probability P(y = 1|x, D), namely vb logit pred iter .m and vb logit pred .m. Their only difference is that the latter is a vectorized form of the former.Both are called by where X is the M × D matrix with the input vector x m as its mth row.The variational posterior parameters w, V, and invV are those returned by either of the estimation function discussed above.The returned vector out with M elements contains the estimated P(y m = 1|x m , D) as its mth element.
In terms of implementation, let us first consider vb logit pred iter .m.This function iterates over all given x m , and optimizes ξ m for each of these separately.In order to do so, it employs a simplification to logdetV xi = ln | Ṽ |/|V N |, appearing in ln P(y = 1|x, D).From the expression for Ṽ −1 and the Matrix determinant lemma it can be shown that ln Additionally, the Sherman-Morrison formula can be applied to avoid inverting Ṽ −1 by using instead.
The script initially starts with ξ m = 0 for each x m in lines 59-60, using some initial simplifications based on λ(ξ) = 1/8, as already previously discussed.It then iterates over updating ξ, Ṽ and w in lines 66-90 until the variational bound either changes less than 0.001% between two consecutive iterations, or the number of iterations exceeds 500.The variational bound is computed as a simplified version of ln P(y = 1|x, D), omitting all terms that are independent of ξ m .
The vectorized script vb logit pred .mfollows exactly the same principles, but optimizes ξ for all x at the same time, by maximizing a sum of the individual variational bounds.The different x n 's are generated such that roughly half of all outputs are 1 and the other half are -1.This is achieved by setting the first element, x n1 , of each x n to one, and by drawing the second, x n2 ∼ U(−2.5, 2.5) from a uniform distribution over the range [−2.5, 2.5].The third element is then set to U(−2.5, 2.5) − (w 1 + x n2 w 2 )/w 3 , such that, with 50% change, w 1 + x n2 w 2 + x n3 w 3 > 0. The outputs are drawn from {−1, 1} according to their probabilities of being 1.
To illustrate the use of shrinkage priors on all/individual dimensions, let us consider an example in which the majority of input dimensions are uninformative.We generate the data by As can be seen, Fisher's linear discriminant over-fits the training set and thus features a higher test set loss than both variational methods with hyper-prior.To understand the worse performance of the hyper-prior-free variational method (VBiter), it is instructive to investigates its coefficient estimates.As seen in Fig. 5A (purple diamonds), its prior on w with covariance D −1 I, together with the large input dimensionality, D = 1000, and low number of informative inputs D eff = 100, caused the coefficient estimates to be close to zero, such that it severely underestimated these coefficients.This is also reflected in its predictive class probabilities being close to 0.5 (Fig. 5B, purple diamonds).The variational method without ARD but with an adjustable degree of global shrinkage fares slightly better, but still under-estimates the informative coefficients due to the larger number of uninformative inputs (Fig 5A,red circles).This is again reflected in its predictive class probabilities being biased towards 0.5 (Fig. 5B, red circles).Finally, variational Bayesian logistic regression with ARD shows little signs of shrinkage of informative input dimensions (Fig. 5A, green squares), which is also reflected in its wider spread of predictive class probabilities (Fig. 5B, green squares).Note that neither method performs perfectly, due to the task's considerable difficulty.The code for this example is available in vb logit example highdem.m.
3.6.3.Model selection by maximizing variational bound.As for the linear case, the variational bound L(Q, ξ) can be used as a proxy for the model log-evidence, ln P(D), to choose between models of different complexity.We assume that a 1dimensional scalar input is mapped into a polynomial of order k, which is in turn used to produce the class labels.Given a set of observations of inputs and class labels, the task is to identify this order k and train the associated classifier.Assuming k = 2, we generate the data by   In the above, Ds are the different orders to be tested, and the test data constitutes of 300 inputs spanning the range from -5 to 5. The training data consists of 50 observations which are uniformly randomly distributed in the same range.
For further testing, we train a Bayesian classifier with the identified polynomial order by The slightly higher test-set loss for Fisher's linear discriminant is not surprising, given that its higher model complexity is bound to lead to over-fitting the training set data.However, as illustrated in Fig. 6B, this over-fitting does not seem particularly severe in this particular case, as both the Bayesian estimate and the linear discriminant's estimate closely follow the true class probabilities.The code for this example is available in vb logit example modelsel.m.

Figure 1 .
Figure1.Coefficient estimates and output predictions for 100dimensional linear regression example.A) True coefficients vs. coefficient estimates (mean ± 95% CIs) for variational Bayesian inference (red) and maximum likelihood (blue).The estimates of both inference approaches are horizontally shifted, such that their CIs can be distinguished.B) True outputs vs. test set output predictions (mean, ± 95% CIs when available) for variational Bayesian inference (red) and maximum likelihood (blue).For both coefficients and output predictions, the variational Bayesian estimates are on average closer to the true values.

Figure 3 .
Figure3.Variational bound and output prediction for identifying the order of the data-generating polynomial by Bayesian model selection.A) The variational bound, L(Q), for different generative models, as indexed by the order of the assumed datagenerating polynomial.This bound lower-bounds the log-model evidence, ln P(D), and shows a clear peak at order 2. The dashed line indicates the order of the true data-generating polynomial.B) Predicted outputs over inputs for the maximum likelihood estimate (blue) and variational Bayesian inference (dashed red).The black curve shows the noise-free data-generating polynomial, and the black crosses are the 10 data points based upon which the regression is performed.The shaded area indicates the 95% CIs for the output prediction of the Bayesian model.

Figure 4 .
Figure 4.Estimated coefficients and separating hyperplanes for classifiers trained on low-dimensional training set.A) true coefficients vs. their estimates (mean, ± 95% CIs where available) for variational Bayesian logistic regression with hyper-prior on w (red), variational Bayesian logistic regression without such hyperprior (purple), and Fisher's linear discriminant (blue).B) Separating hyperplanes (defined by w 1 + x 2 w 2 + x 3 w 3 = 0) for the three classifiers (same colors) and generative separating hyperplane (black).The shown observations (circles / crosses) represent the training data, with symbol type indicating class membership, symbol color indicating class probability (blue barely visible, as hidden by red line).

>>Figure 5 .
Figure 5. Coefficient estimates and output predictions for highdimensional classification with few informative dimensions.A) true coefficients vs. coefficient estimates (mean, ± 95% CIs where available) for Fisher's linear discriminant (blue) and variational Bayesian logistic regression without ARD and with (red) or without hyper-priors (purple), and with ARD and hyper-priors on w (green).The figure illustrates the overly strong shrinkage of the methods without ARD (red, purple) due to the large number of uninformative dimensions.B) true class probability vs. estimated class probability for 200 observations of the training set.The colors are the same as in panel A, not showing Fisher's linear discriminant.The red shaded areas are areas in which mis-classification occurs.The strong coefficient shrinkage of methods without ARD is reflected by their predictive class probabilities being strongly biased towards 0.5.

Figure 6 .
Figure 6.Variational bound and loss for models of different complexities, and model predictions.A) variational bound (black curve, left axis) and training/test set loss (solid/dashed red curve, right axis) for different assumed polynomial orders of the input.The black dashed line shows the true polynomial order.B) True class probability (black curve) vs. predicted class probability of variational Bayesian logistic regression (red dashed curve, using best-fit k = 2 from A) and Fisher's logistic discriminant (blue curve, assuming k = 5) over scalar input x.The blue crosses and orange circles show the provided 50 training samples.They are scattered away from 0 and 1 for better visibility.