Metropolis–Hastings algorithm

Metropolis–Hastings algorithm

In statistics and statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult. This sequence can be used to approximate the distribution (e.g. to generate a histogram) or to compute an integral (e.g. an expected value). Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, there are usually other methods (e.g. adaptive rejection sampling) that can directly return independent samples from the distribution, and these are free from the problem of autocorrelated samples that is inherent in MCMC methods.
History
The algorithm was named after Nicholas Metropolis, who authored the 1953 paper Equation of State Calculations by Fast Computing Machines together with Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller and Edward Teller. This paper proposed the algorithm for the case of symmetrical proposal distributions, and W. K. Hastings extended it to the more general case in 1970.[1]
Some controversy exists with regard to credit for development of the algorithm. Metropolis had coined the term "Monte Carlo" in an earlier paper with Stanislav Ulam, was familiar with the computational aspects of the method, and led the group in the Theoretical Division that designed and built the MANIAC I computer used in the experiments in 1952. However, prior to 2003, there was no detailed account of the algorithm's development. Then, shortly before his death, Marshall Rosenbluth attended a 2003 conference at LANL marking the 50th anniversary of the 1953 publication. At this conference, Rosenbluth described the algorithm and its development in a presentation titled "Genesis of the Monte Carlo Algorithm for Statistical Mechanics".[2] Further historical clarification is made by Gubernatis in a 2005 journal article[3] recounting the 50th anniversary conference. Rosenbluth makes it clear that he and his wife Arianna did the work, and that Metropolis played no role in the development other than providing computer time.
This contradicts an account by Edward Teller, who states in his memoirs that the five authors of the 1953 paper worked together for "days (and nights)."[4] In contrast, the detailed account by Rosenbluth credits Teller with a crucial but early suggestion to "take advantage of statistical mechanics and take ensemble averages instead of following detailed kinematics". This, says Rosenbluth, started him thinking about the generalized Monte Carlo approach -- a topic which he says he had discussed often with Von Neumann. Arianna recounted (to Gubernatis in 2003) that Augusta Teller started the computer work, but that Arianna herself took it over and wrote the code from scratch. In an oral history recorded shortly before his death[5], Rosenbluth again credits Teller with posing the original problem, himself with solving it, and Arianna with programming the computer. In terms of reputation there is little reason to question Rosenbluth's account. In a biographical memoir of Rosenbluth, Freeman Dyson writes
Many times I came to Rosenbluth, asking him a question [...] and receiving an answer in two minutes. Then it would usually take me a week of hard work to understand in detail why Rosenbluth's answer was right. He had an amazing ability to see through a complicated physical situation and reach the right answer by physical arguments. Enrico Fermi was the only other physicist I have known who was equal to Rosenbluth in his intuitive grasp of physics.[6]
Intuition
For the purpose of illustration, the Metropolis algorithm, a special case of the Metropolis–Hastings algorithm where the proposal function is symmetric, is described below.
Metropolis algorithm (symmetric proposal distribution)
Initialization: Choose an arbitrary point to be the first sample, and choose an arbitrary probability density (sometimes written ) that suggests a candidate for the next sample value , given the previous sample value . For the Metropolis algorithm, must be symmetric; in other words, it must satisfy . A usual choice is to let be a Gaussian distribution centered at , so that points closer to are more likely to be visited next—making the sequence of samples into a random walk. The function is referred to as the proposal density or jumping distribution.
For each iteration t: Generate : Generate a candidate for the next sample by picking from the distribution . Calculate : Calculate the acceptance ratio , which will be used to decide whether to accept or reject the candidate. Because f is proportional to the density of P, we have that . Accept or Reject : Generate a uniform random number on [0,1]. If accept the candidate by setting , If reject the candidate and set , instead.
Compared with an algorithm like adaptive rejection sampling[7] that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages:
The samples are correlated. Even though over the long term they do correctly follow , a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that if we want a set of independent samples, we have to throw away the majority of samples and only take every nth sample, for some value of n (typically determined by examining the autocorrelation between adjacent samples). Autocorrelation can be reduced by increasing the jumping width (the average size of a jump, which is related to the variance of the jumping distribution), but this will also increase the likelihood of rejection of the proposed jump. Too large or too small a jumping size will lead to a slow-mixing Markov chain, i.e. a highly correlated set of samples, so that a very large number of samples will be needed to get a reasonable estimate of any desired property of the distribution.
Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a burn-in period is typically necessary,[8] where an initial number of samples (e.g. the first 1,000 or so) are thrown away.
On the other hand, most simple rejection sampling methods suffer from the "curse of dimensionality", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from hierarchical Bayesian models and other high-dimensional statistical models used nowadays in many disciplines.
In multivariate distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the right jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as Gibbs sampling, involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. This is especially applicable when the multivariate distribution is composed out of a set of individual random variables in which each variable is conditioned on only a small number of other variables, as is the case in most typical hierarchical models. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are the adaptive rejection sampling methods,[7][9][10][11] the adaptive rejection Metropolis sampling algorithm[12] or its improvements[13][14] (see matlab code [25] ), a simple one-dimensional Metropolis–Hastings step, or slice sampling.
Formal derivation
existence of stationary distribution: there must exist a stationary distribution . A sufficient but not necessary condition is detailed balance which requires that each transition is reversible: for every pair of states , the probability of being in state and transitioning to state must be equal to the probability of being in state and transitioning to state , .
uniqueness of stationary distribution: the stationary distribution must be unique. This is guaranteed by ergodicity of the Markov process, which requires that every state must (1) be aperiodic—the system does not return to the same state at fixed intervals; and (2) be positive recurrent—the expected number of steps for returning to the same state is finite.
which is re-written as
Inserting this relation in the previous equation, we have
The next step in the derivation is to choose an acceptance ratio that fulfils the condition above. One common choice is the Metropolis choice:
The Metropolis–Hastings algorithm thus consists in the following:
Initialise Pick an initial state ; Set ;
Iterate Generate: randomly generate a candidate state x' according to ; Calculate: calculate the acceptance probability ; Accept or Reject: generate a uniform random number ; if , accept the new state and set ; if , reject the new state, and copy the old state forward ; Increment: set ;
Use in numerical integration
where A(x) is an arbitrary function of interest. For example, consider a statistic E(x) and its probability distribution P(E), which is a marginal distribution. Suppose that the goal is to estimate P(E) for E on the tail of P(E). Formally, P(E) can be written as
Step-by-step instructions
where
See also
Detailed balance
Genetic algorithms
Gibbs sampling
Mean field particle methods
Metropolis-adjusted Langevin algorithm
Metropolis light transport
Multiple-try Metropolis
Parallel tempering
Preconditioned Crank–Nicolson algorithm
Sequential Monte Carlo
Simulated annealing