EM is one of the algorithms that I've always wanted to learn deeply, the first time I heard it was in the HMM section of NLP class, in order to solve the hmm parameter estimation problem, the EM algorithm was used. The words in the MT later are also used in the Zizhong. It is also mentioned in Mitchell's book that EM can be used in Bayesian networks.
The following mainly introduces the entire derivation process of EM.
1. Jensen Inequalities
Review some concepts in the optimization theory. Set F is a function that defines the field as a real number, and if for all real numbers x, then f is the convex function. When x is a vector, if its Hessian matrix H is a semi-positive (), then f is a convex function. If or, then the f is a strictly convex function.
The Jensen inequalities are expressed as follows:
If f is a convex function and x is a random variable, then
In particular, if f is a strictly convex function, then if and only if, that is, X is a constant.
Here we will shorthand for.
If you use a diagram, it will be clear:
In the figure, the real line f is the convex function, x is a random variable, the probability of 0.5 is a, and the probability of 0.5 is B. (Just like a coin toss). The expected value of X is the median of A and B, which can be seen in the figure.
When F is a (strict) concave function, and only if-F is a (strict) convex function.
When the Jensen inequality is applied to the concave function, it is opposite in the direction of the equal sign.
2. EM algorithm
Given the training sample is, the sample is independent, we want to find each sample implied category Z, can make P (x,z) the largest. The maximum likelihood estimates for P (X,Z) are as follows:
The first step is to take the maximum likelihood logarithm, and the second step is to find the probability of joint distribution for each possible category Z of each sample. However, it is difficult to direct the general, because there is a hidden variable z exists, but after the general determination of Z, the solution is easy.
EM is an effective method to solve the problem of implicit variable optimization. Unexpectedly cannot be maximized directly, we can constantly build the nether (e-Step), and then optimize the nether (M-step). This sentence is more abstract, see below.
For each example I, let's say that the sample implied variable z of a distribution, satisfies the condition is. (If z is continuous, then it is a probability density function, which requires the summation symbol to be replaced with the integral symbol). For example, to cluster students in the class, assuming that the hidden variable z is height, then it is a continuous Gaussian distribution. If the hidden variables are male and female, then Bernoulli is distributed.
The following formula can be obtained from the previous description:
(1) to (2) more directly, is the numerator denominator multiplied by an equal function. (2) to (3) using the Jensen inequality, taking into account the concave function (the second derivative is less than 0), and
is the expectation (recall the lazy statistician rule in the expected formula)
Set Y is the function of the random variable X (g is a continuous function), then (1) x is a discrete random variable, its distribution law is, k=1,2,...。 If absolutely convergent, there is (2) x is a continuous type of random variable, its probability density is, if absolute convergence, then there is |
Corresponds to the above problem, y is, X is, yes, G is the map to. This explains the expectation in the formula (2), and then the Jensen inequality when it is based on the concave function:
Can get (3).
This process can be seen as a pair of the nether. For the choice, there are many possibilities, that kind of better? If given, then the value is determined by the and. We can adjust these two probabilities to make the nether rising to approximate the real value, then when is the adjustment good? When an inequality becomes an equation, it means that our adjusted probabilities are equivalent. According to this idea, we need to find the conditions for the equation to be established. According to the Jensen inequality, in order for the equation to be set up, the random variable needs to be converted into a constant value, which is:
C is constant and does not depend on. Further derivation of this formula, we know, then there is, (multiple equality numerator denominator added constant, this think each sample of two probability ratio is C), then there is the following formula:
At this point, we have released after fixing other parameters, the calculation formula is the posterior probability, solves the problem of how to choose. This step is the e-step, which establishes the nether. The next M-step is the lower bound (which can be adjusted even more in the nether) after a given, adjusted, to the maximum extent. Then the general EM algorithm steps are as follows:
Loop repeats until convergence { (e step) for each I, calculate (M-Step) calculation |
So how exactly does it make sure that em converges? The assumption and is the result of the EM-T and t+1 iterations. If we prove that, in other words, the maximal likelihood estimate is monotonically increasing, we will eventually reach the maximum likelihood estimate. Below to prove that after the selection we get e step
This step guarantees that the equation in the Jensen inequality is established at the given time, i.e.
Then the M step, fixed, and will be regarded as a variable, the derivation of the above, get, so after some deduction will have the following formula is established:
Explain step (4), when obtained, is only maximized, that is, the lower bound, and does not make the equation set up, the equation is set only in fixed, and according to the E-step can be established.
Besides, according to the formula we got earlier, for all the and to be set up
Step (5) uses the definition of M-step, M-step is to adjust to, make the nether maximum. Therefore (5) is established, (6) is the result of the previous equation.
This proves that there will be a monotonic increase. One way of convergence is no longer changing, and there is a small change in the range.
Explain again (4), (5), (6). First (4) all the parameters are satisfied, and its equation set up conditions only fixed, and adjust the Q when the establishment, and the Step (4) is only fixed q, adjustment, can not guarantee that the equation must be established. (4) to (5) is the definition of M-step, (5) to (6) is the condition of the equality established by the preceding e-step. That is, e-step will pull the nether to a specific value (here) the same height, and then found that the nether can still rise, so after M-step, the nether is pulled up, but not to the same height as the other specific values, then e-step to the lower bound to the same height as this particular value, repeat, until the maximum value.
If we define
From the preceding derivation we know that EM can be regarded as J-coordinate ascending method, e-step fixed, optimized, M-Step fixed optimization.
3. Re-examining mixed Gaussian models
We already know the essence of EM and the derivation process, and look again at the mixed Gaussian model. The parameters and formulas of the mixed Gaussian model mentioned earlier are based on many assumptions, some of which are not explained. For the sake of simplicity, here in M step only the derivation method given and.
E-step is simple, according to the general EM formula to get:
The simple explanation is that the probability of the implicit class J for each sample I can be computed by a posteriori probability.
In M-step, we need to maximize the maximum likelihood estimate after fixing, i.e.
This is the type of k that will unfold after the case, unknown parameters and.
Fixed and, on derivative
equals 0 o'clock, gets
This is the update formula in our previous model.
The updated formula is then deduced. Look what you got before.
After and determined, a string above the molecule is constant, and the formula that actually needs to be optimized is:
Need to know is, also need to meet certain constraints is.
We are familiar with this optimization problem, and we construct the Lagrange multiplier directly.
One more thing, but this will be automatically met in the formula you get.
Derivative,
equals 0, gets
In other words, to use again, get
And so it magically gets.
Then you get the update formula in M step:
The derivation is similar, but slightly more complex, after all, is the matrix. The results are given in the previous mixed Gaussian model.
4. Summary
If the sample is considered as an observation value, and the potential category is considered as a hidden variable, then the clustering problem is the parameter estimation problem, but the parameters of the clustering problem are divided into the implicit class variables and other parameters, which is like finding the extremum of a curve in the X-y coordinate system, but the curve function cannot be directly derivative. So what gradient descent method does not apply. However, after one variable is fixed, the other one can be obtained by derivation, so we can use the coordinate rising method to fix one variable at a time, to find the other extremum, and then to approximate the extremum gradually. Corresponding to EM, the e-step estimates the implied variable, m-step estimates the other parameters, alternating the extremum to the maximum. There is also the concept of "hard" designation and "soft" designation, "soft" designation seems more reasonable, but the computational amount is larger, "hard" designation is more practical in some cases such as k-means (it would be cumbersome to keep a sample point to all other centers).
In addition, the proof method of the convergence of EM is very cow, can use the concave function of log, can also think of the method of creating the nether, flattening the lower bounds of the function, and optimizing the lower bound to gradually approximate the maximum value. And each iteration is guaranteed to be monotonous. The most important thing is to prove that the mathematical formula is very delicate, the probability that the numerator denominator is multiplied by Z becomes the expectation to set up the Jensen inequality, how the predecessors all thought.
In the Mitchell machine learning also cited an example of EM application, it is clear that the class is to put the height of students together, asked to gather into two classes. These heights can be seen as the Gaussian distribution of male height and the Gaussian distribution of female height. Therefore, how to estimate each sample is a male or female, and then in determining the situation of male and female, how to estimate the mean and variance, which also gives a formula, interested can be consulted.
Reference http://blog.csdn.net/junnan321/article/details/8483343
Principle of "Turn" em algorithm