Em is what I always want to learn.AlgorithmThe first time I heard about hmm in NLP, I used the EM algorithm to solve the HMM parameter estimation problem. It is also used in the word alignment in the subsequent Mt. In Mitchell's book, we also mentioned that EM can be used in Bayesian Networks.
The entire derivation process of EM is described below.
1. Jenkins Inequality
Review some concepts in optimization theory. If F is a function with a definite field as a real number, F is a convex function for all real numbers x. When X is a vector, F is a convex function if the Hessian matrix H is a semi-definite. If or, F is a strictly convex function.
The Jenkins inequality is described as follows:
If F is a convex function and X is a random variable
In particular, if F is a strictly convex function, then if and only if, that is, X is a constant.
This is short.
It will be clear if you use the graph:
In the figure, the solid line f is a convex function, X is a random variable, 0.5 of the probability is A, and 0.5 of the probability is B. (Just like throwing a coin ). The expected values of X are the values of A and B.
When F is a (strictly) Concave Function and only when-F is a (strictly) convex function.
When applying the Jenkins inequality to a concave function, the non-equal sign direction is reversed, that is.
2. EM Algorithm
The given training samples are separated from each other. We want to find the implicit Class Z of each sample to maximize p (x, z. The maximum likelihood estimation of P (x, z) is as follows:
The first step is to obtain the logarithm of the maximum likelihood. The second step is to calculate the Union distribution probability and for each possible Class Z of each sample. However, it is generally difficult to directly find the hidden variable z, because the Hidden variable z exists, but after Z is determined, it is easy to solve the problem.
Em is an effective method to solve the problem of implicit variable optimization. We can't directly maximize it. We can constantly establish the lower bound (Step E) and then optimize the lower bound (step m ). This sentence is abstract.
For each sample I, let indicate a certain distribution of the implicit variable Z in this sample. The condition is met. (If Z is continuous, It is a probability density function, and the sum symbol must be replaced with the integral symbol ). For example, to cluster the students in the class, assuming that the Hidden variable Z is height, it is a continuous Gaussian distribution. If the hidden variables are male and female, it is the bernuoli distribution.
The following formula can be obtained from the content described above:
(1) to (2) the comparison is direct, that is, the numerator denominator and multiplication take an equal function. (2) to (3) using the Jenkins inequality, considering that it is a concave function (the second derivative is less than 0), and
Expectation (recall the lazy statistician rule in the expectation formula)
If y is a function of random variable X (G is a continuous function ), (1) X is a discrete random variable. Its Distribution Law is k = 1, 2 ,.... If absolute convergence exists (2) X is a continuous random variable whose probability density is. If absolute convergence exists |
Corresponding to the above problem, Y is, X is, yes, and G is the ing. This explains the expectation in formula 2, and then according to the Jenkins inequality in the concave function:
You can get (3 ).
This process can be considered as a lower bound of the pair. There are many possibilities for choice. Which one is better? If given, the value is determined by sum. We can adjust these two probabilities so that the lower bound keeps rising to approximate the real value. So when will it be adjusted? When the inequality becomes an equation, it means that the adjusted probability can be equivalent. Based on this idea, we need to find the conditions for equality. According to the Jenkins inequality, to make the equation true, we need to change the random variable to a constant value. Here we get:
C is a constant and does not depend on it. Further derivation is made for this formula. We know that there will be (multiple equality molecules have the same denominator addition, And this considers the two probability ratios of each sample to be C). Then there is the following formula:
So far, after we have fixed other parameters, the calculation formula is posterior probability, which solves the problem of how to choose. This step is Step E, which establishes the lower bound. The next M step is to adjust the lower limit after the given value (after the fixed value is reached, the lower limit can be adjusted to a greater value ). The general steps of the EM algorithm are as follows:
Loop repeats until convergence { (Step E) calculate (M step) computing |
So how can we ensure em convergence? Assume that the sum is the result of the T and t + 1 iterations of em. If we prove that the maximum likelihood estimation increases monotonically, we will eventually reach the maximum likelihood estimation. The following shows that after the selection, we get Step E.
This step ensures that, at the given time, the equation in the Jenkins inequality is true, that is
Then perform the M step, fix it, and take it as a variable. After the above derivation, the formula is as follows:
In step (4) of interpretation, when the equation is obtained, it is only the maximization, that is, the lower bound, but not the equality. The equation can be established only when it is fixed and followed by Step E.
In addition, according to the following formula we have obtained, all vertices are valid.
STEP (5) utilizes the definition of step M. Step m is to adjust to maximize the lower bound. Therefore, (5) is true, and (6) is the result of the previous equation.
This proves that it will increase monotonically. One way to converge is to stop changing, and the other is to minimize the variation.
Explain again (4), (5), and (6 ). First, (4) All parameters are satisfied, while the equation establishment condition is fixed and Q is adjusted. Step (4) is only fixed and adjusted, the equation cannot be guaranteed. (4) to (5) is the definition of step m, and (5) to (6) is the condition for establishing the equations guaranteed by Step E. That is to say, Step E will pull the lower bound to the same height as a specific value (here). At this time, it is found that the lower bound can still rise. Therefore, after step m, the lower bound is pulled up again, but it cannot reach the same height as another specific value. Then, Step E pulls the lower bound to the same height as this specific value and repeats until the maximum value.
If we define
From the previous derivation, we know that EM can be viewed as J's coordinate rising method, Step E fixation, optimization, and step m fixation optimization.
3. Review the Gaussian Mixture Model
Now that we know the essence and derivation process of EM, let's look at the Gaussian mixture model again. The parameters and calculation formulas of the Gaussian mixture model mentioned above are obtained based on many assumptions, and some of them are not explained. For simplicity, only the derivation method of sum is given in step m.
Step E is very simple. According to the General em formula:
The simple explanation is that the probability that the implicit class of sample I is J can be calculated by posterior probability.
In step m, we need to maximize the maximum likelihood after a fixed number, that is
This is how the K cases are expanded. Unknown Parameters and.
Fixed sum, right-directed
When the value is 0
This is the update formula in our previous model.
Then the updated formula is deduced. See the previous
After the sum is determined, the string above the numerator is constant. The formula to be optimized is:
It must be known that certain constraints must also be met.
We are very familiar with this optimization problem. We can directly construct the Laplace multiplier.
Another point is, but this will be automatically satisfied in the obtained formula.
Export,
0.
That is to say, use again to obtain
This is amazing.
Then we can get the updated formula in step m:
The derivation is similar, but a little more complicated. After all, it is a matrix. The results have been given in the previous Gaussian mixture model.
4. Summary
If we regard the sample as the observed value and the potential category as the Hidden variable, then the clustering problem is also the parameter estimation problem, except that the parameters in the clustering problem are classified into implicit category variables and other parameters, this is like finding the Extreme Value of a curve in the x-y coordinate system. However, the curve function cannot be directly derived, so the gradient descent method is not applicable. However, after a variable is fixed, the other one can be obtained through derivation. Therefore, you can use the coordinate rising method to fix one variable at a time, calculate the extreme values for the other, and gradually approach the extreme values. Corresponding to Em, Step E estimates the hidden variables, and step m estimates other parameters, which in turn pushes the extreme values to the maximum. Em also has the concepts specified by "hard" and "soft". The "soft" designation seems more reasonable, but requires a large amount of computing, "hard" indicates that K-means is more practical in some scenarios (it is very troublesome to keep the probability of a sample point to all other centers ).
In addition, the method of Em's convergence proof is indeed very good. It can use the log's concave function nature, and can also think of the method of creating the lower bound, flattening the lower bound of the function, and optimizing the lower bound to gradually approach the maximum. In addition, each iteration can be monotonous. The most important thing is to prove that the mathematical formula is very subtle. It is hard to say that the probability of multiplying the denominator of a numerator by Z has become the expectation to put on the Jenkins inequality. What did our predecessors think of it.
In Mitchell's machine learning book, an example of the EM application is also given. It is clear that the height of the students in the class is put together and they need to be clustered into two classes. These Heights can be seen as the Gaussian distribution of boys' height and the Gaussian distribution of girls' height. Therefore, it becomes how to estimate whether each sample is a boy or a girl, and then how to estimate the mean and variance when determining the male and female. The formula is also provided. If you are interested, you can refer to it.