Expectation maximization (EM) algorithm note

Source: Internet
Author: User

EM algorithm, before the pattern recognition class, deduced, in the "statistical learning method" did not have the patience to see a few times, the personal feeling said too theory, at that time did not understand, then learn LDA, want to realize their own EM algorithm, and forget, it seems that the study is not careful enough, not deep understanding, now do some notes. This article is to read a few blog and "Statistical Learning method" After the notes, just to do their own records, many places are directly quoted.

First, initial knowledge

1. Iteration

The EM algorithm itself can be understood as an iterative algorithm, very abstract & simple to describe the iteration is, for example, we have two formulas a=f (b), B=g (a), need to solve, we can randomly assign a value to a, in accordance with B=G (a) to calculate B, get B, in accordance with B to get a, so reciprocating, Until a, B is basically the same.

2. Implicit variable issues

The EM algorithm is well suited for solving problems involving implicit variables, citing an example from the statistical learning approach (the weakened version of pLSA):

eg. there are 3 coins, respectively, recorded as A,b,c, the probability of throwing positive is ∏,p,q, the first toss of a coin, if the front continues to throw a coin B, is the opposite of throwing a coin C, the final appearance of a positive record of 1, the negative is recorded as 0; after independent repetition n experiments, a series of experimental results are obtained y
     = (Y1,y2,......, Yn).

Here y= (y1,y2,......, Yn) T is called the observation variable, but there is also a variable that can not be directly observed, but need to know, that is, the result of throwing a, can be recorded as z= (z1,z2,......, Zn) T, and some known parameters, we can be unified as θ= (∏, p, q). With some of the above symbols, we can tell the distribution of y:

The likelihood function of Y is obtained, the first thought is the likelihood estimation of the parameters, and the general steps of the Maximum likelihood estimation (MLE) are reviewed below:

General steps to find the maximum likelihood function estimate: ( 1 ) write out the likelihood function; 2 ) The likelihood function takes the logarithm, and organizes; 3 The derivative is 0, and the likelihood equation is obtained; 4) to solve the likelihood equation, the obtained parameter is the request

In fact, the maximum likelihood can think, we assume already know to θ, in the case of known θ, produce y, it is natural, if we see the result produced a lot of Yi, then P (yi|θ) must be relatively large. Now we think in turn, we already know Y,
, then the most probable parameter to make the result appear is the parameter we estimate.

Unfortunately, the above steps, there is no analytic solution, so we have to use the EM algorithm.

3. Jensen Inequalities

Review some concepts in the optimization theory. Set F is a function that defines the field as a real number, and if for all real numbers x, then f is the convex function. When x is a vector, if its Hessian matrix H is a semi-positive (), then f is a convex function. If or, then the f is a strictly convex function.

The Jensen inequalities are expressed as follows:

If f is a convex function and x is a random variable, then

In particular, if f is a strictly convex function, then if and only if, that is, X is a constant.

If you use a diagram, it will be clear:

In the figure, the real line f is the convex function, x is a random variable, the probability of 0.5 is a, and the probability of 0.5 is B. (Just like a coin toss). The expected value of X is the median of A and B, which can be seen in the figure.

When F is a (strict) concave function, and only if-F is a (strict) convex function.

When the Jensen inequality is applied to the concave function, it is opposite in the direction of the equal sign.

Second, EM algorithm

The above example, there is a very poised variable, we can not directly know, that is, a coin throwing results, but if we know the output of a certain output after throwing a what is, we can easily use the maximum likelihood (of course, this example with simple intuition can also know) to get P,Q estimates.

eg.  1. If the throw result of a coin has X-times positive, N-x (under this hypothesis, the estimated value of the ∏), then we can get the estimated value of p when we count the last positive and negative cases in the X-times, and the same for q;  2. After getting the P,q value, we can easily turn to ask, how do you know the previous hypothesis is correct? In the case of known p,q, our likelihood function can be solved, so that we can get a new ∏  3. Under the new ∏ value, we can make a new estimate of the p,q. So, if we finally converge, we get an estimate of the parameter θ.

The above is very abstract, the following specifically said (the following part of the quote (EM algorithm) the EM algorithm).

Given the training sample is, the sample is independent, then the likelihood function of the sample is as follows:

The first step is to take the maximum likelihood logarithm, the second step is to find the probability of joint distribution for each possible category Z of each sample and (the probability of the edge distribution of x is obtained after summing the z). However, it is difficult to get theta directly, because there is a hidden variable z, but after the general determination of Z, the solution is easy.

EM is an effective method to solve the problem of implicit variable optimization. Unexpectedly cannot be maximized directly, we can constantly build the nether (e-Step), and then optimize the nether (M-step). This sentence is more abstract, see below.

For each example I, let's say that the sample implied variable z of a distribution, satisfies the condition is. (If z is continuous, then it is a probability density function, which requires the summation symbol to be replaced with the integral symbol). For example, to cluster students in the class, assuming that the hidden variable z is height, then it is a continuous Gaussian distribution. If the hidden variables are male and female, then the Bernoulli distribution (∏ in the three-coin example mentioned above can be understood here, for each I, the ∏,∏ is the Bernoulli distribution).

The following formula can be obtained from the previous description:

(1) to (2) more directly, is the numerator denominator multiplied by an equal function.

(2) to (3) using the Jensen inequality.

Consider the concave function (the second derivative is less than 0) , and it can be understood as expectation. after getting (3) The formula, I can understand that, the likelihood function L (θ) a lower bound, if the lower bound, if the lower bound value and l (θ) equal, we can use the value to the right of the inequality to replace the L (θ).

For the choice, there are many possibilities, that kind of better? Assuming already given, then the value is determined by and (in fact, it should be, but only unknown here). First we think that when and both are determined, that is, when the random variable in the Jensen inequality is known, we can know that when the random variable is constant, the inequality is equal, namely:

C is constant and does not depend on (but is actually dependent on X (i), so it is not the same for different i,c, but it is constant). Further derivation of this formula, we know, then there is, then there is the following formula:

At this point, we have released after fixing other parameters, the calculation formula is the posterior probability, solves the problem of how to choose. This step is the e-step, which establishes the nether. The next M-step is the lower bound (which can be adjusted even more in the nether) after a given, adjusted, to the maximum extent. Then the general EM algorithm steps are as follows:

Loop repeats until convergence {

(e step) for each I, calculate

(M-Step) calculation

So how to prove that the EM algorithm will converge, in fact, the following formula is good:

    

It is shown here that L (θ) is monotonically ascending, and at the end it can converge to its maximum value. Specific explanations:

(4) is to satisfy all the parameters, and its equation set up the condition is fixed, and adjust the Q when the establishment (that is, if the equation is established), here is not necessarily equal, so it is not possible to take the equal sign.

(4) to (5) is the definition of M-step, is the fixed T-step, fixed Q adjustment results obtained

(5) to (6) is the condition under which the equation of the preceding e-step is guaranteed.

That is, e-step will pull the nether to a specific value (here) the same height, and then found that the nether can still rise, so after M-step, the nether is pulled up, but not to the same height as another specific value, that is, the lower bound is less, then e-step to the nether to the same height as this particular value, Repeat until the maximum value of the nether is pulled up.

A blog has a very image of the picture, quoted here

    

If you define

From the preceding derivation we know that EM can be regarded as J- coordinate ascending method , e-step fixed, optimized, M-Step fixed optimization.

The basic principle of the EM algorithm is this.

Reference

1. Statistical Learning Methods

2. http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006936.html

3. http://blog.csdn.net/zouxy09/article/details/8537620

4. Andrew Ng Course

Expectation maximization (EM) algorithm note

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.