This article is mainly translated from: Mcmc:the metropolis-hastings Sampler

In the previous article, we discussed how the Metropolis sampling algorithm uses Markov chains to sample from a complex, or non-normalized, target probability distribution. The Metropolis algorithm first recommends a new state \ (x^*\) in the Markov chain based on the previous state \ (x^{(t-1)}\), which is sampled from the root suggested distribution \ (q (x^*|x^{(t-1)}) \). The algorithm accepts or rejects \ (x^*\) based on the value of the target distribution function on \ (x^*\).

A limiting condition for the Metropolis sampling method is that the recommended distribution \ (Q (x^*|x^{(t-1)}) must be symmetric. This limitation is derived from the use of Markov chain sampling: A necessary condition for sampling from the steady-state distribution of a Markov chain is that the transfer probability of t,\ (x^{(t-1)}\rightarrow x^{(t)}\) must be equal to \ (x^{(t)}\rightarrow x^{( t-1)}\), the condition is called **reversibility** or **fine-grained stability** . However, a symmetric distribution may not be suitable for many problems, such as we want to sample the distribution defined in \ ([0,\infty) \).

To enable the use of asymmetric recommended distributions, the Metropolis-hastings algorithm introduces an additional modifier (c\), defined by the recommended distribution:

\ (c = \frac{q (x^{(t-1)}|x^*)}{q (x^*|x^{(t-1)})}\)

The correction factor adjusts the transfer operator so that the transfer probability of the \ (x^{(t-1)}\rightarrow x^{(t)}\) is equal to the transfer probability of \ (x^{(t)}\rightarrow x^{(t-1)}\), regardless of the recommended distribution.

The metropolis-hasting algorithm is implemented in exactly the same way as the Metropolis sample, except that a correction factor is needed to calculate the acceptance probability \ (\alpha\). In order to get the M sampling points, the metropolis-hasting algorithm is as follows:

1. Set T = 0

2. Generate an initial state \ (x^{(0)}\thicksim\pi^{(0)}\)

3. Repeat until \ (t=m\)

Set \ (t=t+1\)

Generate a proposal state \ (x^*\) from \ (q (x|x^{(t-1)}) \)

Calculate the proposal correction factor \ (C=\frac{q (x^{(t-1)}|x^*)}{q (x^*|x^{(t-1)})}\)

Calculate the acceptance probability \ (\alpha = \mathrm{min}\left (1, \frac{p (x^*)}{p (x^{(t-1)})}\times C\right)

Draw a random number \ (u\) from \ (\mathrm{unif} (0,1) \)

If \ (U\leq\alpha\) accept the proposal state \ (x^*\) and set \ (x^{(t)}=x^*\)

else set \ (x^{(t)}=x^{(t-1)}\)

Many people think that the metropolis-hastings algorithm is a generalization of the Metropolis algorithm. This is because if the recommended distribution is symmetric, the correction factor is 1, and the Metropolis sample is obtained.

Translator by:

Here the article only gives the algorithm, but does not speak the principle, I roughly say my own understanding:

If we want to use the Markov chain to sample the target distribution \ (\PI (x) \), we need to ensure that the steady-state distribution of the Markov chain is the target distribution \ (\PI (x) \). Defines the transfer operator of the Markov chain \ (T (a, b) \), which represents the transfer probability of \ (A\rightarrow b\). So, according to the steady-state distribution of the detailed stability conditions, we want to achieve this effect:

\ (\pi (a) \cdot t (A, a) = \pi (b) \cdot t (b,a) \)

As with the Metropolis algorithm, we introduce the referral transfer operator \ (Q (a, b) \) and define a new acceptance probability:

\ (\alpha = \mathrm{min}\left (1, \frac{q (b,a)}{q (A, a)}\cdot\frac{\pi (b)}{\pi (a)}\right) \)

Therefore, the transfer probability of \ (A\rightarrow b\) can be calculated as follows:

\ (\pi (a) \cdot T (A, b) = \pi (a) Q (A, b) \cdot\alpha\)

\ (=\pi (a) Q (b) \cdot\mathrm{min}\left (1,\frac{q (b,a)}{q (A, a)}\cdot\frac{\pi (b)}{\pi (a)}\right) \)

\ (=\mathrm{min} (\pi (a) Q (b), \pi (b) Q (b,a)) \)

\ (=\pi (b) Q (b,a) \cdot\mathrm{min}\left (\frac{\pi (a) Q (b)}{\pi (b) Q (b,a)},1\right) \)

\ (=\pi (b) T (b,a) \)

Sure enough, to achieve a fine balance, this does not require \ (q (A, b) = q (b,a) \) is the previous Metropolis algorithm required in the symmetry conditions.

The above explanation explains why the algorithm can be carefully balanced, and we are thinking about:

We hope that the steady-state distribution of the Markov chain is the target distribution \ (\PI (x) \), which means that the \pi (x) \) achieves a fine balance. However, in general, the recommended transfer operator \ (Q (A, A, b) does not meet the detailed balance (otherwise it will not have to calculate the acceptance rate) that is:

\ (\pi (a) \cdot Q (A, a) \neq \pi (b) \cdot q (b,a) \)

Thus, a correction factor, the acceptance probability \ (\alpha\), is needed to make:

\ (\pi (a) \cdot Q (A, a) \cdot\alpha (A, b) = \pi (b.) \cdot Q (b,a) \cdot\alpha (b,a) \)

How to take \ (\alpha\) to ensure that the equation is set up? The simplest is to make:

\ (\alpha (a) = \pi (b) \cdot Q (b,a) \), and

\ (\alpha (b,a) = \pi (a) \cdot Q (A, b) \)

Here \ (\alpha\) is called the probability of acceptance, because we are based on the recommended distribution \ (Q (a, b) \) The State B is sampled, there is a probability of \ (\alpha (b) \) to accept this sample. Thus, the proposed distribution and correction factors constitute a Markov chain which converges to the target distribution.

However, the problem is that \ (\alpha (b) \) is likely to be small, which will lead to the Markov chain in the sampling process in situ, recommended the distribution of one, was rejected, and recommended one, also rejected ... Each sample of the Markov chain retains the previous sample point. In this case, we can reduce the \alpha (b) \) and \ (\alpha (b,a) \) on both sides of the fine balance equation so that one reaches 1, which greatly increases the acceptance rate without altering the fine balance equation. Therefore, our acceptance rate \ (\tilde{\alpha} (a, b) \) can be expressed as:

\ (\tilde{\alpha} (b) = \mathrm{min}\left (\frac{\pi (b) Q (b,a)}{\pi (a) Q (A, B)}, 1\right) \)

The meaning of the above equation is:

1. If \ (\pi (b) Q (b,a) >\pi (a) q (a) \), then the same proportional scaling \ (\alpha (b) \) ratio \ (\alpha (b,a) \) first reached 1,\ (\mathrm{min}\left (\frac{\ Pi (b) q (b,a)}{\pi (a) Q (b)},1\right) =1\), i.e., the acceptance rate is 1, we accept the transfer of \ (A\rightarrow b\).

2. If \ (\pi (b) Q (b,a) <\pi (a) q (a) \), then the same proportional scaling \ (\alpha (b,a) \) ratio \ (\alpha (b) \) first reached 1,\ (\mathrm{min}\left (\frac{\ Pi (b) q (b,a)}{\pi (a) Q (A, b)},1\right) =\frac{\pi (b) Q (b,a)}{\pi (a) Q (b)}\), acceptance rate is \ (\FRAC{\PI (b) Q (b,a)}{\pi (a) Q (b) }\), to accept the transfer of \ (A\rightarrow b\) at this acceptance rate.

After this transformation, we transform the Metropolis algorithm into a metropolis-hastings algorithm so that we do not have to look for a strictly symmetric recommended transfer operator.

Example:sampling from a Bayesian posterior with improper prior (sampling Bayesian posteriori probabilities through inappropriate prior probabilities)

In many applications, including regression and density estimation, we usually need to identify a hypothetical model \ (P (y|\theta) \) parameter \ (\theta\), so that the model can best match the observed data \ (y\). The model function \ (p (Y|\theta) \) is commonly referred to as a likelihood function. In Bayesian methods, there is usually an explicit prior distribution in the model parameters (P (\theta) \) to control the possible values of the parameters.

The parameters of the model are determined by the posterior distribution \ (P (\theta|y) \), which is a probability distribution of all possible parameters based on the observed values. The posterior distribution can be determined by Bayes theorem:

\ (P (\theta|y) =\frac{p (Y|\theta) p (\theta)}{p (y)}\)

where \ (P (y) \) is a normalized constant, which is often difficult to solve directly. Because it needs to calculate the parameters and \ (y\) all possible values and then accumulate the probabilities.

If we use the following model (likelihood function):

\ (P (y|\theta) = \mathrm{gamma} (y; A, b) \), where

\ (\mathrm{gamma} (y; b) =\frac{b^a}{\gamma (A)}y^{a-1}e^{-by}\)

\ (\gamma () \) is the Gamma function. Therefore, the parameters of the model are:

\ (\theta = [A, b]\)

Parameter A controls the shape of the distribution, and parameter B controls the indentation. B=1, a a from 0 traversal to 5 of the likelihood surface as shown.

Likelyhood surface and conditional probability P (y| a=2,b=1) in green

conditional probability distribution \ (P (y| A=2,b=1) \) on the likelihood surface is expressed in green, in MATLAB can use the following statement to draw it out:

Plot (0:.1:10,gampdf (0:.1:10,2,1)); %gamma (2,1)

Now, let's assume that the model's parameters have the following prior probabilities:

\ (P (b=1) =1\)

And

\ (P (a) =\mathrm{sin} (\pi a) ^2\)

The first priori probability declaration B takes only one value (i.e. 1), so we can think of it as a constant. The second (unconventional) priori probability declaration, A's probability variation conforms to the sine function. (Note that both prior probability distributions are referred to as **improper Apriori (improper priors)**because their integral values are not 1). Since B is a constant, we only need to estimate the value of a.

It turns out that although normalization constants \ (P (y) \) can be difficult to calculate, we can still use the metropolis-hastings algorithm to sample from \ (P (a|y) \), even if you don't know \ (P (y) \). In particular, we can ignore the normalized constant \ (P (y) \) A posteriori probability sample that has never been normalized directly:

\ (P (a|y) \propto p (y| A) p (a) \)

\ (y\) changes from 0 to 10 in the posterior distribution plane as shown. The right blue line of the image indicates a priori probability of parameter A (P (A) \). If we have a data point \ (y=1.5\), we want to estimate the posterior distribution by metropolis-hasting algorithm (p (a|y=1.5) \). The magenta curve in the product indicates the specific target distribution.

Posterior surface, prior distribution (blue), and target distribution (pink)

Using symmetric suggested distributions, such as normal distribution, the sampling efficiency from \ (P (a|y=1.5) \) is very low because the posterior distribution defines the field as a positive real number. An asymmetric, recommended distribution of the same defined domain will allow the algorithm to converge better to the posterior distribution. One of the distribution functions defined on a positive real number is an exponential distribution.

\ (q (A) = \mathrm{exp} (\MU) = \mu e^{-\mu a}\),

This distribution contains a parameter \ (\mu\), which controls the position and scaling of the probability distribution function. The target posterior distribution and recommended distribution (\ (\mu=5\)) are shown.

Target posterior p (a|y) and proposal distribution Q (A)

We found that the recommended probability distributions have a good coverage of the posterior distribution. The metropolis-hastings sampling algorithm of MATLAB code block at the bottom of this paper is used to obtain the Markov chain trajectory and the sampling result:

Metropolis-hastings Markov Chain and samples

By the way, notice that in this sampling method, the recommended distribution function does not depend on the previous sample, but only on the parameter \ (\mu\) (see the 88th line of Matlab code below). Each recommendation state \ (x^*\) is completely independent from the previous state acquisition. So this is an example of an **independent sampler** (**Independence Sampler**), a special metropolis-hastings sampling algorithm. Independent samplers are known for their extreme performance, either very well or poorly. The quality of the effect depends on the choice of recommended distributions and the coverage of recommended distributions. It is often difficult to find a distribution of such a recommendation in practice.

Here's the code for the Metropolis-hastings sampler.

% metropolis-hastings BAYESIAN posteriorrand (' seed ', 12345)% PRIOR over scale PARAMETERSB = 1; % DEFINE Likelihoodlikelihood = inline (' (B.^a/gamma (A)). *y.^ (A-1). *exp (-(B.*y)) ', ' Y ', ' A ', ' B '); % CALCULATE and visualize the likelihood surfaceyy = Linspace (0,10,100); AA = Linspace (0.1,5,100), Likesurf = Zeros (Numel (yy), Numel (AA)), for IA = 1:numel (AA); Likesurf (:, IA) =likelihood (yy (:), AA (IA), B); end; Figure;surf (Likesurf); Ylabel (' P (y| A) '); Xlabel (' A '); ColorMap Hot% DISPLAY CONDITIONAL @ A = 2hold on; ly = Plot3 (Ones (1,numel (AA)) *40,1:100,likesurf (:, +), ' g ', ' linewidth ', 3) Xlim ([0 100]); Ylim ([0 100]); Axis Normalset (GCA, ' XTick ', [0,100]); Set (GCA, ' Xticklabel ', [0 5]); Set (GCA, ' Ytick ', [0,100]); Set (GCA, ' Yticklabel ', [0]); view (65,25) Legend (Ly, ' P (y| a=2) ', ' Location ', ' northeast '), and hold Off;title (' P (y| A) '); % DEFINE PRIOR over SHAPE parametersprior = inline (' Sin (pi*a). ^2 ', ' A '); % DEFINE the Posteriorp = inline (' (B.^a/gamma (A)). *y.^ (A-1). *exp (-(B.*y)). *sin (pi*a). ^2 ', ' y ', ' A ', ' B '); % CALCULATE and DISPLAY the posterior surfacepostsurf = zeros (Size (Likesurf)); for IA = 1:numel (AA); Postsurf (:, IA) =p (yy (:), AA (IA), B); end; Figuresurf (Postsurf); Ylabel (' Y '); Xlabel (' A '); ColorMap Hot% DISPLAY the Priorhold on; PA = Plot3 (1:100,ones (1,numel (AA)) *100,prior (aa), ' B ', ' LineWidth ', 3)% SAMPLE from P (A | y = 1.5) y = 1.5;target = Postsurf ( :); % DISPLAY POSTERIORPSA = plot3 (1:100, Ones (1,numel (AA)) *16,postsurf (+,:), ' m ', ' LineWidth ', 3) Xlim ([0]); Ylim ([0 100 ]); Axis Normalset (GCA, ' XTick ', [0,100]); Set (GCA, ' Xticklabel ', [0 5]); Set (GCA, ' Ytick ', [0,100]); Set (GCA, ' Yticklabel ', [0]), view (65,25) legend ([pa,psa],{' P (A) ', ' p (a|y = 1.5) '}, ' location ', ' northeast '); Offtitle (' P (a|y) '); % INITIALIZE The metropolis-hastings sampler% DEFINE proposal densityq = inline (' exppdf (X,MU) ', ' x ', ' mu '); % MEAN for Proposal Densitymu = 5; % DISPLAY TARGET and proposalfigure; Hold on;th = Plot (Aa,target, ' m ', ' linewidth ', 2); QH = Plot (Aa,q (Aa,mu), ' K ', ' linewidth ', 2) Legend ([th,qh],{' Target, p (A) ', ' proposal, Q (A) '}); XlabeL (' A '); % SOME Constantsnsamples = 5000;burnin = 500;minn = 0.1; Maxx = 5; % Intiialze Samplerx = zeros (1, nsamples); x (1) = Mu;t = 1; % RUN metropolis-hastings Samplerwhile T < nsamples t = t+1; % SAMPLE from proposal Xstar = Exprnd (MU); % CORRECTION FACTOR c = q (x (t-1), mu)/q (XSTAR,MU); % CALCULATE the (corrected) acceptance RATIO alpha = min ([1, P (y,xstar,b)/P (y,x (t-1), B) *c]); % ACCEPT OR REJECT? u = rand; if u < alpha x (t) = Xstar; else x (t) = x (t-1); EndEnd% DISPLAY MARKOV chainfigure;subplot (211), Stairs (x (1:t), 1:t, ' K '), Hold ON;HB = Plot ([0 maxx/2],[burnin Burnin], ' g--', ' linewidth ', 2) ylabel (' t '); Xlabel (' samples, A '); set (GCA, ' ydir ', ' reverse '); Ylim ([0 t]) axis Tight;xlim ([0 Maxx]); title (' Markov Chain Path '); Legend (HB, ' Burnin '); % DISPLAY Samplessubplot (212), nBins = 100;samplebins = Linspace (minn,maxx,nbins), counts = hist (x (Burnin:end), Samplebins); Bar (Samplebins, counts/sum (counts), ' K '); Xlabel (' samples, A '); Ylabel (' P(A | y) '); title (' Samples '); Xlim ([0])% OVERLAY TARGET distributionhold on;plot (AA, Target/sum (TARGET), ' m ', ' Linewid Th ', 2); Legend (' Sampled distribution ', sprintf (' Target posterior ')) axis tight

Conclusion

Here we explore how to generalize the metropolis-hastings algorithm from the Metropolis algorithm to sample the complex (non-normalized) probability distributions using asymmetric recommended distributions. One drawback of the metropolis-hastings algorithm is that not all recommended samples are accepted, so it wastes valuable computing resources. This disadvantage is more pronounced when sampling high-dimensional distributions. This is why the Gibbs sampling was introduced. In the next article we will introduce Gibbs sampling, which takes advantage of conditional probabilities and preserves all recommended states in a Markov chain.

Mcmc:the metropolis-hastings Sampler