First, preface
In the statistical calculation, the maximal expectation (EM) algorithm is the algorithm of finding the maximum likelihood estimation or the maximum posterior estimation in the probability model, in which the probabilistic model relies on the invisible hidden variable (latent Variable). Maximum expectations are often used in the field of data clustering for machine learning and computer vision.
The maximum expectation algorithm is calculated by alternating two steps, the first step is to calculate the expectation (E), use the existing estimate of the hidden variable, calculate its maximum likelihood estimate; the second step is to maximize (M) and maximize the maximum likelihood value calculated on the E step to calculate the value of the parameter. The parameter estimates found on M-step are used in the next E-step calculation, and the process is constantly alternating.
Its application in machine learning algorithms including Kmeans algorithm, GMM algorithm and so on, in common, is a parameter initialization, constantly iterative search for better parameters of the method.
Derivation of 2.1 logarithmic likelihood function by EM algorithm
Suppose we have a sample set {x (1),..., X (M)}, which contains m independent samples. But each sample I corresponds to the category Z (i) is unknown (equivalent to clustering, contact Kmeans), also known as the implied variable, so we need to estimate the probability model P (x,z) of the parameter θ, but because it contains the implied variable z, it is difficult to use the maximum likelihood solution.
That is to say our goal is to find the right theta and Z to make L (θ) the largest. Then we may think, you are more than an unknown variable, ah, I can also distinguish between the unknown θ and z-biased, and then equal to 0, the solution is not the same? In essence, this is possible, but using the Jensen inequality We bypass the two-variable-biased approach, and instead use a method that combines the lower bounds of a variable maximization function (which in fact may be a derivative) with a variable derivative . There are formulas:
where (2) to (3) conversion is the use of Jensen inequalities .
2.2 Jensen Inequality 2.2.1 Why not to ask for biased guidance
In essence we need to maximize (1) formula (1), we recall the solution of the edge probability density function of a variable under the joint probability density, and note that Z is also a random variable. For each of the possible categories of the sample I z to the right of the equality of the joint probability density function and the equation to the left of the random variable x edge probability density, that is, the likelihood function, but you can see there is "and" logarithm, after the derivation form will be very complex (you can imagine the log (F1 (x) + F2 (x ) + F3 (x) + ...) Complex functions, so it is difficult to solve the unknown parameters z and θ. OK, can we make some changes to the (1) formula? We see (2), (2) formula is only the numerator denominator multiplied by an equal function, or there is a "sum of the logarithm" ah, or can not solve, then why do this? Let's take a look at (3) and find that (3) becomes "logarithmic and", so it's easy to take a derivative. We notice that the equal sign becomes an equal, why can it be so changed? This is the place where the Jensen inequality is greatly apparent. And look at the derivation below.
2.2.2 Jensen Inequalities
Set F is a function that defines the field as a real number, if for all real numbers x. If the two derivative of all real x,f (x) is greater than or equal to 0, then f is the convex function. When x is a vector, if its Hessian matrix H is semi-positive, then f is the convex function. If it is only greater than 0, not equal to 0, then the "F" is a strictly convex function.
The Jensen inequalities are expressed as follows:
- If f is a convex function, X is a random variable, then: E[f (x)]>=f (E[x])
- In particular, if f is a strictly convex function, the equation is taken when and only if X is a constant.
In the figure, the real line f is the convex function, x is a random variable, the probability of 0.5 is a, and the probability of 0.5 is B. (Just like a coin toss). The expected value of X is the median of a and B, and E[f (x)]>=f (e[x]) can be seen in the figure.
When the Jensen inequality is applied to the concave function, the opposite direction is reversed, and the above (2) to (3) type is the concave function .
Back to the formula (2), the detailed deduction is as follows:
Concave function: e[f (x)]>=f (E (x))
Over here
x=,
E (x) =
F (x) =log (x)
The
Here, the present formula (3) is easy to derivative, but the formula (2) and formula (3) is not equal to Ah, the maximum value of formula (2) is not the maximum value of formula (3) Ah, and we want the maximum value of the formula (2), How to do?
Now we need a little imagination, the above formula (2) and formula (3) inequalities can be written: the likelihood function number L (θ) >=j (z,q), then we can continue to maximize the Nether J, so that l (θ) continuously improve, and eventually reach its maximum value.
See, we fix θ, adjust q (z) to increase the Nether J (z,q) to equal to the L (θ) at this point θ (green curve to blue curve), then fix Q (z), adjust θ to make the Nether J (z,q) reach the maximum (θ T to θt+1), then pin θ, adjust Q (z) ... Until the θ* is convergent to the maximum value of the likelihood function L (θ). Here are two questions: when is the Nether J (z,q) equal to the L (θ) at this point θ? Why does it always converge?
The first question first (answer the second question, see 2.3), in the Jensen inequality, says that when the argument x is constant, the equation is set. And here, that is:
Because q is the probability density function of the random variable z (i), it can be obtained: the numerator and equals C (the numerator denominator sums all Z (i): the numerator denominator of multiple equations is added to the same, this considers that the two probability ratio of each sample is C), then:
At this point, we have introduced the calculation formula of Q (z), which makes the nether pull up after the fixed parameter θ, to solve the problem of how Q (z) is selected. This step is e-step, which establishes the lower bound of the L (θ), which is the function of the inequality. The next M-step is to adjust θ after the given q (z), to make the Nether J of the L (θ) greater (the nether can be adjusted even larger after fixing Q (z)).
2.3 Em Algorithm Overview
The expected maximum algorithm is a maximum likelihood estimation method that solves the probabilistic model parameters from incomplete data or data sets with data loss (there are hidden variables).
Algorithm flow of EM:
- Initialize the distribution parameter θ;
- Repeat the following steps until convergence:
e Step: According to the parameter initial value or the model parameter of the last iteration, the posterior probability of the recessive variable is calculated, which is the expectation of the recessive variable. As the current estimate for hidden variables:
m Step: maximize the likelihood function to obtain a new parameter value
Say the M step and answer why it converges:
With this constant iteration, you can get the parameter θ that maximizes the likelihood function L (θ). Then you have to answer the second question, will it converge?
Perceptual saying, because the nether is constantly improving, so the maximum likelihood estimate monotonically increases, then finally we will reach the maximum likelihood estimate max value. Rational analysis of the words, you will get the following things:
The blog mentions how this came about:
When step (4) is obtained, only the maximum, that is, the lower bound, without making the equation set up, the equation is established only if it is fixed θ and the qi is obtained by the E step.
Besides, according to the formula we got earlier, all Qi and Theta are established.
Step (5) uses the definition of M-step, M-step is to adjust to, make the nether maximum. Therefore (5) is established, (6) is the result of the previous equation.
This proves that there will be a monotonic increase. One way of convergence is no longer changing, and there is a small change in the range.
Explain again (4), (5), (6). First (4) all the parameters are satisfied, and its equation set up conditions only in the fixed θ, and adjust the Q when the establishment, and the Step (5) is only fixed q, adjust θ, can not guarantee that the equation must be established. (4) to (5) is the definition of M-step, (5) to (6) is the condition of the equality established by the preceding e-step. That is, e-step will pull the nether to a specific value (here) the same height, and then found that the nether can still rise, so after M-step, the nether is pulled up, but not to the same height as the other specific values, then e-step to the lower bound to the same height as this particular value, repeat, until the maximum value.
Generally speaking:
- From (2) to (3) Take "=" is the "to maximize the function of the lower bound by a variable" step, that is, e-step, Lenovo Kmeans the average value of each class data as the centroid of this step, is in the case of data belonging to a class (fixed a variable) Maximize a probability.
- Instead of fixing z or Q (z), making the (3) value larger is the step of "finding a derivative of a variable" , i.e. M-step.
This article refers to the following excellent blog post and clarify some of the logic:
http://blog.csdn.net/zouxy09/article/details/8537620
Http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006936.html
Some things still speak a little abstract, but believe that the continuous deep learning and application to the actual scene should have some understanding, mutual encouragement.
EM algorithm sniffing