1. Introduction
The probability models we discussed before are all only the observed variables (observable variable), that is, these variables can be observed, so the given data can be directly using the method of maximum likelihood estimation or Bayesian estimation, but when the model contains Implicit variables (latent variable) , it is not possible to simply use these estimation methods.
as in Gaussian mixture and EM algorithm The Gaussian mixture discussed in this paper is a typical example of implicit variables, and the application of EM algorithm in Gaussian mixture model is given, and we will discuss some of the original rational things.
2.Jensen Inequalities
Order is a function of the value of a real number, then if it is a convex function , if the independent variable x is a vector, then when the function of the Haisen matrix is a semi-definite (), is the convex function, which is the function of the convex function of the condition in the vector input generalization.
If, it is called a strict convex function , the corresponding vector input when the generalization is.
The theorem order is a convex function, and the order is a random variable, then
When the strictly convex function, when and only when the probability of 10%. That is, when the constant quantity, the equality of the above inequality is established.
Note that the above E is intended to mean, in practice, when writing variable expectations, you will omit the parentheses, that is.
An explanation of the above theorem is given in the following diagram:
The solid line in this figure represents a convex function, a random variable has a probability of 0.5 to take a, and a probability of 0.5 to take B, so the expectation is in the middle of a A/b, that is, the mean value of a, B.
As can be seen, on the y-axis, in between, because it is a convex function, it must be as shown,
So in many cases, many people memorize this inequality, but remember the above diagram, which is easier to understand.
Note : If it is a (strict) concave function, even if (strictly) the convex function (that is,), then the Jensen inequality is still the same, but the opposite is not equal to the same direction:
3.EM algorithm
Suppose that there is an M-independent sample of an estimation problem, based on which I wish to fit the parameters of the model, then the logarithmic likelihood function:
Here, the implicit variable, if it can be observed, the maximum likelihood estimate will become very easy, but now do not have to be observed, is the hidden variable.
In this case, the EM algorithm gives a very effective method of maximum likelihood estimation: The lower bound (e-step) is constructed repeatedly, and then the lower bound (M-step) is maximized.
For each, the order represents the distribution of the implicit variable, that is, consider:
The derivation from (2) to (3) uses the above Jensen inequality , at this time a concave function, because, considering the above distribution,
is precisely the number of expectations, which can be obtained by Jensen inequalities:
This can be launched from (2) (3).
However, due to the existence of hidden variables, direct maximization is difficult! Imagine if you can make a direct and its lower bound, then any can make the nether increase, can also increase, so the natural is to choose the lower bound to achieve a great deal of parameters.
How to make the nether, that is, the above inequality equals, the key is how to deal with the hidden variables, the following discussion.
Now, for any distribution, (3) the lower bound of the likelihood function is given. What is the distribution of the distribution, there can be a lot of choices, in the end which one to choose?
when discussing the Jensen inequality above, it can be seen that the condition of equality in inequality is that the random variable becomes "constant ", and to obtain the nether value, it is necessary to ask
Where the constant c is independent of the variable, which is easy to do, when we choose the distribution, the following conditions can be met:
Because, so we can know:
Pay attention to understanding how the above equation is going to come out!!
Therefore, the distribution can be set to: Under the parameters, given after the posterior distribution.
After setting the distribution of the hidden variable, the problem of maximizing the likelihood function is converted to maximize its lower bound, which is e step!
in M step, it is to adjust the parameters to maximize the above mentioned formula (3).
Repeating e-step and M-step is the EM algorithm:
Repeat iteration until convergence {
}
How do we know the algorithm converges ?
If and is two consecutive iterations after the parameter, need to prove .
As mentioned above, as we select the distribution again, select:, then:
The parameter is derived from the equation on the right side of the maximum, so:
Note that the first inequality (4) comes from:
This formula is for any and all set up, of course for and also set up. For inequalities (5), because they are selected by the following process of maximum:
So in the place, the value of the equation is greater than the value in the equation!
The formula (6) is chosen by the method discussed above to make the Jensen inequality equal!
Therefore, theEM algorithm makes the likelihood function monotone convergent . In the above described EM algorithm, said is "iterative iteration until Convergence", a common method of checking convergence is : If after two consecutive iterations, the likelihood function value changes very small (within a tolerable range), the EM algorithm changes are already very slow, you can stop the iteration.
Note: If you define:
From the previous derivation, we know. The EM algorithm is considered to be about the gradient rise of function J : The E-step is about the parameter q,m step is about the parameter.
4. Correction of Gaussian mixture
In Gaussian mixture and EM algorithm, we use EM algorithm to optimize the Gaussian mixture model, and to fit the parameters.
E-Step:
This represents the probability of being taken under distribution.
M step: Consider the parameters to maximize the value:
To maximize, to the above equation about the partial derivative:
To make this partial derivative 0, find the Update method:
This is the conclusion that has been reached in Gaussian mixture and EM algorithm.
Then consider how to update the parameters, write only the relevant items, and discover that only the maximization is needed:
Because, all of the and is 1, so this is a constrained optimization problem, referring to the simple explanatory Lagrange duality (Lagrange duality), constructs the Lagrangian function:
Where β is a Lagrangian multiplier. To seek partial derivative:
The partial derivative is 0, resulting in:
That is: Use constraints:, get: (note here:).
You can then get the update rules for the parameters:
For the parameter update rules, and how the entire EM algorithm applies to the Gaussian mixture model optimization, refer to: Gaussian mixture and EM algorithm!
5. Summary
The so-called EM algorithm is that when the implicit variable is contained, the distribution of the hidden variable is set to a posterior distribution with the observation variable as the precondition, so that the likelihood function of the parameter is equal to the lower bound, and the likelihood function is greatly improved by the lower bound of the boundary. From the process of avoiding the direct maximum likelihood function, because of the unknown hidden variables caused by the difficulties ! The EM algorithm is mainly two-step, e-step chooses the suitable implicit variable distribution (a posterior distribution which takes the observation variable as precondition), makes the likelihood function of the parameter equal to its lower bound; M-step: The lower bound of the maximum likelihood function, fitting the parameters.
An explanation of the EM algorithm principle