Bayesian Updating

Bayesian updating is a method of statistical inference where Bayes' theorem is used to update the probability distributions of model parameters based on prior beliefs and available data.

Bayes' Theorem

Bayes' theorem is defined as

P (θ | Y) = \frac{P (Y | θ) P (θ)}{P (Y)},

where $P (θ)$ is the prior distribution, describing prior belief on $θ$ . $P (Y | θ)$ is the likelihood function evaluating how the data Y supports our belief. This is a function of $θ$ not $Y$ . $P (θ | Y)$ is called the posterior distribution and expresses an updated belief under data $Y$ . The term $P (Y)$ , often called the marginal likelihood, or the evidence. It can be calculated as the integral of the likelihood multiplied by the prior distribution over the sample space of $θ$

P (Y) = \int P (Y | θ) P (θ), d θ .

This term serves as a normalizing constant for the posterior probability. However, as it can be difficult or even impractical to calculate it is often disregarded. Instead, only the product of likelihood and prior is used, as it is proportional to the posterior probability

P (θ | Y) \propto P (Y | θ) P (θ) .

Based on this relationship, the posterior probability can be approximated without calculation of $P (Y)$ using a variety of sampling methods. Classic approaches such as rejection sampling can be inefficient, especially for multivariate cases due to high rejection rates. Instead, Metropolis et al., proposed the use of Markov chains to increase efficiency [20].

Markov Chain Monte Carlo

Markov chains are sequences of variables, where each variable is dependent on the last. In a discrete space $Ω$ the series of random variables ${X_{1}, X_{2}, \dots, X_{t}}$ is called a Marko chain if

p (X_{t} = x_{t} | X_{t - 1} = x_{t - 1}, \dots, X_{1} = x_{1}) = p (X_{t} = x_{t} | X_{t - 1} = x_{t - 1}) .

A Markov chain is called ergodic or irreducible when it is possible to reach each state from every other state with a positive probability. Markov chains that are ergodic and time-homogeneous, i.e. the probability between states doesn't depend on time, and have a unique stationary distribution such that

π (y) = \sum_{x \in Ω} P (y | x) π (x) .

The goal of Markov chain Monte Carlo (MCMC) sampling methods is to construct a Markov chain, whose stationary distribution is equal to the posterior distribution of Bayes' theorem. This will result in samples generated from the Markov chain being equivalent to random samples of the desired distribution. The very first MCMC algorithm is the Metropolis-Hastings (MH) Algorithm.

Metropolis Hastings

The Metropolis-Hastings algorithm, was published in 1970 by W. K. Hastings [21]. The MH algorithm is a random-walk algorithm that provides a selection criteria for choosing the next sample $(θ_{i + 1})$ in a Markov chain. This is done through a so-called proposal distribution $q (θ_{i + 1} | θ_{i})$ which is well known and relatively easy to sample from. Usually, symmetric proposal distributions centred at $(θ_{i})$ are used, for example Normal and Uniform distributions. A candidate sample $θ^{*}$ is sampled from the proposal distribution and accepted with probability $α$

α = min [1, \frac{P (θ^{*} | Y)}{P (θ_{i} | Y)} \cdot \frac{q (θ_{i} | θ^{*})}{q (θ^{*} | θ_{i})}] .

Substituting the posterior with Bayes' theorem yields

α = min [1, \frac{P (Y | θ^{*}) \cdot P (θ^{*}) / P (Y)}{P (Y | θ_{i}) \cdot P (θ_{i}) / P (Y)} \cdot \frac{q (θ_{i} | θ^{*})}{q (θ^{*} | θ_{i})}] .

Note, how the normalization constant $P (Y)$ cancels out. When the proposal is symmetric $q (θ_{i} | θ^{*}) = q (θ^{*} | θ_{i})$ the acceptance probability further simplifies to

α = min [1, \frac{P (θ^{*} | Y)}{P (θ_{i} | Y)}] .

In practice, a random number $r \sim U (0, 1)$ is sampled, and the candidate sample is accepted if $a \leq r$

θ_{i + 1} = θ^{*} if a \leq r, θ_{i + 1} = θ_{i} otherwise.

As an example consider a synthetic data sequence Y as the outcome of 100 Bernoulli trials with unknown success probability p (here p=0.8).

julia

 n = 100
 Y = rand(n) .<= 0.8

The likelihood function which, similar to a Model must accept a DataFrame, follows a Binomial distribution and returns the likelihood for each row in the DataFrame as a vector. The prior is chosen as a beta distribution with $α = β = 1$ (uniform on $[0, 1]$ ). It is often beneficial to use the log-likelihood and log-prior for numerical reasons.

julia

    function loglikelihood(df)
            return [
                sum(logpdf.(Binomial.(n, df_i.p), sum(Y))) for df_i in eachrow(df)
            ]
        end

logprior = df -> logpdf.(Beta(1,1), df.p)

UncertaintyQuantification.jl implements a variant of the MH algorithm known as single-component Metropolis-Hastings, where the proposal and acceptance step is performed independently for each dimension. To run the algorithm, we must first define the SingleComponentMetropolisHastings object which requires the UnivariateDistribution as a proposal, a NamedTuple for x0 which defines the starting point of the Markov chain, the number of samples and the number of burn-in samples. The burn-in samples are used to start the chain but later discarded.

julia

    proposal = Normal(0, 0.2)
    x0 = (;p=0.5)
    n_samples= 4000
    burnin = 500

    mh = SingleComponentMetropolisHastings(proposal, x0, n_samples, burnin)

SingleComponentMetropolisHastings(Normal{Float64}(μ=0.0, σ=0.2), (p = 0.5,), 4000, 500, true)

The final optional argument islog=true can be omitted when passing the log-likelihood and log-prior. When set to false, the algorithm will log for both the likelihood and prior. Finally, the algorithm is executed using the bayesianupdating function. This function returns the samples and the average acceptance rate.

julia

mh_samples, α   = bayesianupdating(logprior, loglikelihood, mh)

The following figure shows a histogram of the samples returned by the Metropolis-Hastings algorithm. For comparison, we also plot the analytical posterior distribution obtained using conjugate priors [22].

As a second example we will attempt to sample from a bimodal target distribution in two dimensions. The prior is uniform over $[- 2, 2]$ in each dimension and the likelihood is a mixture of two Gaussian functions centred at $[0.5, 0.5]$ and $[- 0.5, - 0.5]$ . The standard deviation for both Gaussians are identical and if small enough will effectively disconnect the two functions.

julia

prior = Uniform(-2, 2)
logprior = df -> logpdf.(prior, df.x) .+ logpdf.(prior, df.y)

N1 = MvNormal([-0.5, -0.5], 0.1)

N2 = MvNormal([0.5, 0.5], 0.1)

loglikelihood =
    df -> log.([0.5 * pdf(N1, collect(x)) + 0.5 * pdf(N2, collect(x)) for x in eachrow(df)])

n = 2000
burnin = 500

x0 = (; x=0.0, y=0.0)

proposal = Normal()

mh = SingleComponentMetropolisHastings(proposal, x0, n, burnin)

mh_samples, α = bayesianupdating(logprior, loglikelihood, mh)

The scatter plot clearly shows that the MH algorithm has converged to only one of the two peaks of the bimodal target (contour also plotted). In fact, this is a known weakness of the MH algorithm. However, there are a number of alternative MCMC methods that aim to solve this problem. One of these methods, known as Transitional Markov Chain Monte Carlo [23], will be presented next.

Transitional Markov Chain Monte Carlo

The Transitional Markov Chain Monte Carlo (TMCMC) method [23] is an extension of the MH algorithm where instead of directly sampling a complex posterior distribution, samples are obtained from a series of simpler transitional distributions. The samples are obtained from independent single-step Markov Chains. The transitional distributions are defined as

P^{j} \propto P (Y | θ)^{β_{j}} \cdot P (θ),

where $j \in {1, \dots, m}$ is the number of the transition step and $β_{j}$ is a tempering parameter with $β_{1} < \dots, β_{m} = 1$ . This enables a slow transition from the prior to the posterior distribution. An important part of the TMCMC algorithm is that the tempering parameter has to be selected as to ensure the transition is smooth and gradual. The algorithm's authors suggest choosing the parameter such that a coefficient of variation of 100% is maintained in the likelihood $P (Y θ_{i})^{β_{j} - β_{j - 1}}$ . At each level $j$ the starting points for the independent Markov Chains are randomly samples (with replacement) from the current set of samples using statistical weights

w (θ_{i}) = \frac{P (Y | θ_{i})^{β_{j} - β_{j - 1}}}{\sum_{i = 1}^{N} P (Y | θ_{i})^{β_{j} - β_{j - 1}}} .

The complete TMCMC algorithm can be summarized as

Set $j = 0$ and $β_{j} = 0$ . Sample $θ_{i} \sim P (θ)$ .
Set $j = j + 1$ .
Compute the next tempering parameter $β_{j}$ .
Determine the weights $w (θ_{i})$ .
Generate a single-step Markov chain for each $θ_{i}$ .
Repeat steps (2) to (5) until (and including) $(β_{j} = 1)$ .

Returning to the bimodal example, this time using the TMCMC algorithm. In order to apply a different MCMC algorithm we only need to construct a TransitionalMarkovChainMonteCarlo object and pass it to the bayesianupdating method. The definition of prior and likelihood remains the same. In difference to the SingleComponentMetropolisHastings the log evidence is returned instead of the acceptance rate.

julia

tmcmc = TransitionalMarkovChainMonteCarlo(RandomVariable.(Uniform(-2,2), [:x, :y]), n, burnin)

tmcmc_samples, S = bayesianupdating(logprior, loglikelihood, tmcmc)

The resulting scatter plot shows how TMCMC is able to sample both peaks of the bimodal target distribution. The standard implementation of TMCMC uses a multivariate Gaussian proposal distribution centred at each $θ_{i}$ with covariance matrix $Σ$ estimated from the current likelihood scaled by a factor $β^{2}$ . This scaling factor defaults to $0.2$ as suggested by the authors, but can optionally be passed to the constructor as a fourth argument. Application of different MCMC Algorithms nested in the TMCMC give rise to variants of the algorithm. For example, it is possible to use the previously introduced SingleComponentMetropolisHastings resulting in SingleComponentTransitionalMarkovChainMonteCarlo.

Note

SingleComponentTransitionalMarkovChainMonteCarlo is currently not available but planned for implementation.

For convenience, the prior can be automatically constructed from the random variables passed to TransitionalMarkovChainMonteCarlo.

julia

tmcmc = TransitionalMarkovChainMonteCarlo(RandomVariable.(Uniform(-2,2), [:x, :y]), n, burnin)

tmcmc_samples, S = bayesianupdating(loglikelihood, tmcmc)

Maximum likelihood and maximum a posteriori estimates

Instead of using sample-based methods it is also possible to calculate the local maxima of likelihood ( $P (Y | θ)$ ) or posterior ( $P (θ | Y)$ ). While this does not directly give an esimate of the full parameter distribution, it allows for the estimation of the most probable parameter values. Calculating the maximum (or maxima, depending on the use case) is referred to as maximum likelihood estimate (MLE), using the full (unnormalized) posterior is referred to as maximum a posteriori (MAP) estimate. Note that technically MLE is not a Bayesian estimate as the prior distribution is ignored. More formally,

θ_{MLE} = \underset{θ}{\arg max} P (Y | θ)

θ_{MAP} = \underset{θ}{\arg max} P (Y | θ) P (θ)

Generally, the MAP estimate can be considered as regularized version of the MLE, since the prior distribution is used to constrain the estimates. Both estimates are found with optimization schemes and thus come with the usual drawbacks, including convergence to local maxima. Both methods are therefore sensitive to initial conditions. Further note that both estimates do not calculate the mean but the mode of the respective distributions, i.e. for distributions that have a high variance these estimates do not provide much information about the parmater distribution.

The implementation in UncertaintyQuantification.jl is analgous to the sampling methods. A MaximumAPosterioriBayesian or a MaximumLikelihoodBayesian object is created that takes the important settings as input. These are the prior, the optimization method to use and the starting points for optimization. For convenience, multiple points can be specified here. The optimization method is given as a string such that the Optim.jl package does not need to be included in the script. Finally, the bayesianupdate-function can be used again with the known syntax, i.e. a likelihood, UQ-Model-array and the desired object for the method are given and a DataFrame is returned. The log-densities are given in the DataFrame as the variable :maxval. Below is an example for the implementation.

julia

μ = 0
σ = .2

prior = RandomVariable.(Normal(μ,σ), [:x, :y])

N1 = MvNormal([-0.5, -0.5], 0.1)
N2 = MvNormal([0.5, 0.5], 0.1)

loglikelihood =
    df -> log.([0.5 * pdf(N1, collect(x)) + 0.5 * pdf(N2, collect(x)) for x in eachrow(df)])

priorFunction =
    df -> prod.([pdf(Normal(μ, σ), collect(x)) for x in eachrow(df)])

x0 = [[.1, .1],[-.1,-.1]]

MAP = MaximumAPosterioriBayesian(prior, "LBFGS", x0)
MLE = MaximumLikelihoodBayesian(prior, "LBFGS", x0)

mapestimate = bayesianupdating(loglikelihood, UQModel[], MAP)
mlestimate = bayesianupdating(loglikelihood, UQModel[], MLE)

contour!(xs, ys, likelihood_eval.*prior_eval/.05, lim = [-2,2], c = :black)
scatter!(mapestimate.x, mapestimate.y; lim=[-2, 2], label = "MAP estimate", c = :black)
scatter!(mlestimate.x, mlestimate.y; lim=[-2,2], label = "ML estimate", c = :red)
plot!([0,0],[0,0],c = :red, label = "Likelihood")
plot!([0,0],[0,0],c = :blue, label = "Prior")
plot!([0,0],[0,0],c = :black, label = "Posterior")

┌ Warning: Multiple series with different color share a colorbar.
│ Colorbar may not reflect all series correctly.
└ @ Plots ~/.julia/packages/Plots/Ec1L1/src/backends/gr.jl:528
┌ Warning: Multiple series with different line color share a colorbar.
│ Colorbar may not reflect all series correctly.
└ @ Plots ~/.julia/packages/Plots/Ec1L1/src/backends/gr.jl:528

The figure shows the (bimodal) likelihood in red and the prior distribution in blue. The difference in MAP and MLE is clearly visible, as the MLE conincides directly with the maxima of the likelihood, while MAP is shifted in the direction of the prior mean.

Bayesian calibration of computer simulations

UncertaintyQuantification.jl allows for complex computer models to be included in the Bayesian updating procedure, if for example one wishes to infer unknown model parameters from experimental data of model outputs. Several models can be evaluated in order to compute the likelihood function, by passing a vector of UQModels to the bayesianupdating method. These will be executed before the likelihood is evaluated and models outputs will be available in the DataFrame inside the likelihood function. There is no restriction on the type of UQModel used. For example, it is possible to use an ExternalModel and call an external solver.

For a complete example refer to the Inverse eigenvalue problem.

Bayesian Updating ​

Bayes' Theorem ​

Markov Chain Monte Carlo ​

Metropolis Hastings ​

Transitional Markov Chain Monte Carlo ​

Maximum likelihood and maximum a posteriori estimates ​

Bayesian calibration of computer simulations ​

Bayesian Updating

Bayes' Theorem

Markov Chain Monte Carlo

Metropolis Hastings

Transitional Markov Chain Monte Carlo

Maximum likelihood and maximum a posteriori estimates

Bayesian calibration of computer simulations