I heard this ML Summer school 2009 @ Cambridge intro talk by Iain Murray. It's an excellent 2h27m introduction. Slides are also provided.

MCMC = Markov Chain Monte Carlo

- a class of randomized methods that construct a trajectory through a (possible) high-dimensional state space that holds the Markov property, i.e. the next state only depends on the previous state, and can be used to sample (approximate) from probability distributions

Its a random walker, that explores the state space such that the visited states are distributed according to a certain distribution. The tracjetory of this random walker is simulated by a Markov chain.

*[source: Murray, MLSS 2009 slides]*

The objective of MCMC algorithms is to asymptotically generate states x according to a desired distribution P(x) and uses a stochastic process to fulfill it.

This simple intuition is also underlined in the following youtube video:

Monte Carlo algorithms are randomized algorithms. They can be used to yield approximations where analytical computation is not tractable.

Consider e.g. computing PI, where computation is actually tractable, but hey this is just an example to show, that we could also compute it with a Monte Carlo algorithm

The area in a unit circle is PI*r^2 = PI (since r=1). So in one quarter it is PI/4.

We could now randomly sample points (x,y) in the unit rectangle [0,1]^2 and check whether each point (x,y) is within that circle, i.e. whether (x<1) and (y<1). In that case we increase an counter A. Furthermore we count how many samples B we have drawn in total.

Then the ratio of “samples inside circle” / “total nr of samples” = A/B = PI/4 ⇒ 4*A/B = PI.

Thus we can approximate PI by using an Monte Carlo algorithm. Great!

*[source: Murray, MLSS 2009 slides]*

If I give you a probability density function (pdf) f(x), how can you give me samples drawn from that pdf?

Rejection sampling is a simple algorithm to give you samples from that pdf! Cool

Wikipedia explains the basic idea quite well:

Rejection sampling (in german: “Verwerfungsmethode”) is based on the observation that to sample a random variable one can sample uniformly from the region under the graph of its density function.

To visualize this motivation, imagine graphing the distribution of a random variable onto a large rectangular board and throwing darts at it. Assume that the darts are uniformly distributed around the board. Now take off (reject) all of the darts that are outside the area under the curve. The remaining darts will be distributed uniformly within the area under the curve, and the x-positions of these darts will be distributed according to the random variable's density. This is because there is the most room for the darts to land where the curve is highest and thus the probability density is greatest.

So the algorithm has 3 steps to draw a sample from the pdf:

- sample a point (an x-position) from the proposal distribution
- draw a vertical line at this x-position, up to the curve of the proposal distribution
- sample uniformly along this line (i.e. uniformly from 0 to the value of the proposal distribution), and reject points outside of the desired distribution (i.e. greater than the value of the desired distribution at this point)

But there is a big disadvantage of this simple approach:

In high dimensions, the pdf may only be >0 in some very restricted area. So you will generate a lot of rejections before a useful sample is generated! This makes the algorithm inefficient and impractical (curse of dimensionality)

Solution: in high dimensions, it is necessary to use a different approach, typically a Markov chain Monte Carlo method such as Metropolis sampling or Gibbs sampling. But both more sophisticated methods still adopt this basic idea.

*[source: Murray, MLSS 2009 slides]*

If it is hard to get samples x_l from the true pdf p(), you can use a trick to sample from another pdf q() and re-weighten these samples x_l by p(x_l)/q(x_l):

It's a Monte Carlo based search / optimization algorithm. Let's say you want to maximize some energy E that depends on your parameters θ, i.e. max_{arg θ} E(θ).

Then the Metropolis algorithm starts with some start parameters, and in each step for your current location in parameter space θ_t a new proposal state x is computed.

But we accept that new proposal x for the new location θ_t+1=x only if …

- it is better, i.e. E(x)>=E(θ), i.e. ΔE=E(x)-E(θ)>=0
- if it is worse, i.e. E(x)<E(θ) AND rand() < min(1,exp(-ΔE/k*T))

Else we set θ_t+1=θ_t

T is called temperature, k is called Boltzmann constant.

The main point here is that we accept also worse (energy) states than previously visited with some probability depending on the 'worseness' of this new energy state.

Thereby it is similar to the Hill climbing algorithm. But the difference is, that the hill climber only goes to 'better' states, i.e. will never accept worse states and thereby is not able to run through small local minima while it is traversing the search / state space.

Making the temperature T time depending, i.e. T(t), is called simulated annealing and can be used, to start the search / optimization with considering a large amount of the search space and then gradually focussing the search while simulated time increases.

Iain Murray explains the Metropolis algorithm in his talk using the example of doing linear regression:

*[source: Murray, MLSS 2009 slides]*

It generates a random walk using a proposal density and a method for rejecting proposed moves.

What's the relation to the algorithm above?

Well, the Metropolis algorithm is a special case of the Metropolis-Hastings algorithm. So this one here is more general.

The only difference is, that the Metropolis-Hastings (MH) uses a proposal distribution Q(x'|x) that proposes a new state x' given the curent state x.

And where do I get Q from?

Well it is mostly arbitrary! It only has to be symmetric, i.e. Q(x'|x)=Q(x|x'), which is also called 'detailed balance'. The acceptance probability is then:

r = min(1, p*(x') / p*(x))

If it is not symmetric (asymmetric), we need the so called Hastings correction for the acceptance ratio:

r = min(1,alpha) alpha = p*(x')q(x|x') / p*(x)q(x|x')

also shown in the following slide:

*[source: Murray, MLSS 2009 slides]*

The correction is needed to compensate for the possibility that the distribution itself might favor certain states.

Iain Murray also shows in his talk how simple this algorithm is , here is the whole Octave/Matlab code of this algorithm:

*[source: Murray, MLSS 2009 slides]*

And what's the benefit from this algorithm?

We can use it to sample from arbitrary distributions P! So, the random walker will give us samples distributed according to P! Great

Why? The intuition behind this is very good described here:

This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. Note that the acceptance ratio a indicates how probable the new proposed sample is with respect to the current sample, according to the distribution P(x). If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of P(x)), we will always accept the move. However, if we attempt to move to a less probable point, we will sometimes reject the move, and the more the relative drop in probability, the more likely we are to reject the new point. Thus, we will tend to stay in (and return large numbers of samples from) high-density regions of P(x), while only occasionally visiting low-density regions. Intuitively, this is why this algorithm works, and returns samples that follow the desired distribution P(x).

*[source: Wikipedia]]*

The influence of different step sizes (here: step size = sigma of Gaussian) in the proposal distribution Q, is visualized in the following demo slide by Iain Murray:

*[source: Murray, MLSS 2009 slides]*

In the Metropolis Hastings algorithm we reject a lot of local moves. The Gibb's sampling algorithm is a variant with no rejections! Great

*[source: Murray, MLSS 2009 slides]*

The basic idea of Gibbs sampling is to sample each variable in turn, conditioned on the values of all other variables in the distribution. Doing this for all d variables of a d-dimensional state vector x, yields the new state vector x'.

Other names of this method in physics: “Glauber dynamics”, “heat bath”.

Software for Gibbs sampling:

- BUGS = “Bayesian updating Using Gibbs”
- JAGS = “Just Another Gibbs Sampler”

Disadvantages of Gibbs sampling:

- can be quite slow: we sequentially draw each new argument conditioned on all the others

Here is a good example showing how Gibbs sampling works:

Suppose we want to simulate the random vector (X, Y, Z) having joint density f (x, y, z). Suppose we can ﬁnd and simulate from the conditional distributions of X | (Y, Z) , Y | (X, Z) and Z | (X, Y ). To simulate X, Y, Z, we start with initial values x_0 , y_0 , z_0 . These could be sampled from an arbitrary distribution or just arbitrary numbers. We ﬁrst update the X from X | (Y = y_0 , Z = z_0 ) getting x_1 . We then update Y from Y | (X = x_1 , Z = z_0 ) getting y_1 and then update Z from Z | (X = x_1 , Y = y_1 ). This ﬁnishes the ﬁrst cycle. Notice that we always update from the most recent values for the random variable. We can continue iterating this chain as many cycles as we need.

It is easily seen that this is a (time-homogenious) Markov chain cycle by cycle and that f (x, y, z) is a stationary distribution for this Markov chain. Therefore, if the conditions of the ergodic theorem are satisﬁed, then f (x, y, z) is a limiting distribution of this chain. Therefore if we run the chain a long time (“burn it in”), the observations we get will have this distribution.

Well, it is said that the samples from the random walker are distributed according to the probability distribution P. But that is only true for the 'last' samples of the chain!

The Markov chain is started from an arbitrary initial value x_0 and the algorithm is run for many iterations until this initial state is “forgotten”. These starting samples have to be discarded and the corresponding phase of this stochastic process (MCMC is a stochastic process) is known as “burn-in”. The remaining set of accepted values of x represent a sample from the distribution P(x).

The following example shows 3 MCMC chains used for searching the minimum of the 3D Rosenbrock optimizer test function (which is non-convex). Note that only the red points are used at the end. The other points are discarded, since they stem from the “burn-in” period, where we start at a random point in the 3D state space, which does not have to do something with the underlying pdf.

We say that the Markov chain “has entered its stationary distribution” (“to mix”), i.e. entering the phase of the red points. In practice, for simplicity often just some proportion of the initial samples (e.g. 25%) are discarded.

*[source: Wikipedia]*

- Q(x)>0, whenever P(x)>0, so that Q does not “ignore” any states that have nonzero probability relative to P

Contra #1: Samples are correlated.

The samples are correlated. Even though over the long term they do correctly follow \displaystyle P(x), a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that if we want a set of independent samples, we have to throw away the majority of samples and only take every nth sample, for some value of n (typically determined by examining the auto-correlation between adjacent samples). Auto-correlation can be reduced by increasing the jumping width (the average size of a jump, which is related to the variance of the jumping distribution), but this will also increase the likelihood of rejection of the proposed jump. Too large or too small a jumping size will lead to a slow-mixing Markov chain, i.e. a highly correlated set of samples, so that a very large number of samples will be needed to get a reasonable estimate of any desired property of the distribution.

[source: Wikipedia]

Contra #2: “Burn-in” period is needed

Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a burn-in period is typically necessary, where an initial number of samples (e.g. the first 1,000 or so) are thrown away.

[source: Wikipedia]

- easy to implement
- very flexible method

- a stationary distribution π for a Markov chain is a distribution over the states such that if we start the Markov chain in π, we stay in π

irreducible (german: “unzerlegbar”, “nicht verminderbar”)

We say that the Markov chain is irreducible if we can get from any state to any other states (possibly in several steps).

We say that a Markov chain has period k > 1 if it can only return to its present state X t at times t + k, t + 2k, … We say a Markov chain is aperiodic if does not have period k for any k > 1.

We say that the Markov chain is recurrent if we are sure to come back to any state and transient otherwise. We say that a recurrent state is positive recurrent if the expected time till we return is finite. We say it is null recurrent if this time is infinite.

A state i is said to be ergodic if it is aperiodic and positive recurrent. If all states in an irreducible Markov chain are ergodic, then the chain is said to be ergodic.

Ergodic theorem: A Markov chain which is aperiodic, irreducible and positive recurrent has a limiting distribution.

1. aperiodic: guarantees that the Markov chains we consider have some chance of staying where they are.

2. irreducible: Often, the Markov chains have bad ”mixing” which means that they are nearly reducible, i.e., that there are proportions of the state space between which transitions rarely occur. Although the Markov chain would converge in this situation, the convergence is very slow. Often modiﬁcations are made to the Markov chain to improve mixing.

3. positive recurrence: is often hard to establish for the Markov chains. One nice situation is if a Markov chain has only a finite number of states and is irreducible, then it must be positive recurrent.

Time-homogeneous Markov chains are processes where

P(x_t+1|x_t) = P(X_t|x_t-1)

i.e. the transition probabilities are independent of time / do not change as time goes by.

it means: sample states from the MC x_t, x_t-1, x_t-2, … are distributed as drawn from P(X)

- “compared with MCMC methods, particle filters estimate the distribution of only one of the latent variables at a time, rather than attempting to estimate them all at once, and produce a set of weighted samples, rather than a (usually much larger) set of unweighted samples” (source: Wikipedia)

- this paper cites another paper where it was found, that …

It has been found that PSO can outperform MCMC in certain situations, in particular when there are a large number of local maxima present and/or the dimensionality of the search space is very high

public/mcmc.txt · Last modified: 2012/11/26 11:34 (external edit) · []