Detailed derivation and explanation of "machine learning" em algorithm
Today do not want to learn, fry a cold, talk about machine learning ten algorithms known in the EM algorithm, the article inside some personal understanding, if there are errors and omissions, but also please the reader to enlighten.
It is well known that maximum likelihood estimation is a widely used parameter estimation method. For example, I have some information about the height of the northeast, and I know that the probability model of height is Gaussian distribution, so we can estimate the two parameters, mean value and variance of Gaussian distribution by using the method of maximal likelihood function. This method basically all probability textbooks will say, I do not say more, not clear please Baidu.
However, now I am facing this situation, my data is the Sichuan and northeast of the height of the collection, but for each of the specific data, does not demarcate it from the "Northeast" or "Sichuan people", I think if the probability density of this data set to draw out, about this appearance:
All right, don't throw up the groove, you can paint it like this. I'm already very attentive. = =
In fact, this Shuangfeng probability density function is a model, called Gaussian mixture model (GMM), writing:
To add a formula to a blog is really laborious = = This model is well understood, is the K Gaussian model weighted composition, α is the weight of each Gaussian distribution, θ is the parameter. For the parameter estimation of GMM model, it is necessary to use EM algorithm. More generally, the EM algorithm is suitable for estimating the probability model with implicit variables, what is the implicit variable? is not the observed variable, for the above example of Sichuan and northeast, for each height, it comes from Sichuan or northeast, is a hidden variable.
Why use EM, let's take a specific look at the above question. If maximum likelihood estimation is used-this is the simplest idea we have at the beginning, then the likelihood function that we need to be large should be this:
However, we do not know the expression of P (x;θ), some students say I know ah, is not the above mixed Gaussian model? It's just a little more than a parameter.
Think about it, GMM in the Theta is composed of Sichuan and the northeast of the two parts yo, if you want to estimate the height of Sichuan people mean, directly with GMM do likelihood function, will Sichuan and northeast people all consider into, obviously inappropriate.
Another idea is to consider the implicit variables, if we already know which samples come from Sichuan, and which samples are from the northeast, that's good. To mark a sample from which population with z=0 or z=1, z is the implicit variable, and the likelihood function that needs to be maximized becomes:
There is no egg, however, because the implicit variable does not know. To estimate whether the sample is from Sichuan or northeast, we need to have model parameters, to estimate the model parameters, we must first know that a sample is from Sichuan or northeast of the possibility ...
Is it chicken eggs or eggs and chickens?
No, our way is to assume. First, suppose a model parameter θ, and then each sample from the Sichuan/northeast probability P (zi) can be calculated, p (xi,zi) =p (Xi|zi) p (zi), and x|z=0 obey the distribution of Sichuan, X|z=1 obey the northeast distribution, so the likelihood function can be written in the function containing θ, To make it great we can get a new theta. The new θ is more responsive to the data pattern than the original because it considers which distribution the sample is from. With this better theta we re-compute the probability of each sample from the Sichuan and the northeast, the probability of using better θ is more accurate, and with more accurate information, we can continue to estimate theta as above, naturally this time the theta will be better than the last time, so thriving, Until convergence (the parameter changes are not obvious), in theory, the EM algorithm is finished.
However, things are not so simple, the above ideas theoretically feasible, practice is not. Mainly because the likelihood function has "and the log" this one, log inside is a and the form, a derivative of this picture not too beautiful, direct strong you have to face the "two normal distribution probability density function add" to do the denominator, "two normal distribution, respectively, and then add" to do the fractional form of the molecule. M this thing add up to make it equal to 0, ask for the analytic solution of θ, you think of your maths level not too high.
What to do? First introduce an inequality, called Jensen Inequality, is said:
X is a random variable, f (x) is a convex function (the second derivative is large or equal to 0), then there are:
When and only if X is constant, the equals sign is set.
If f (X) is a concave function, the non-equal sign is reversed
As for this inequality, I neither intend to prove it nor intend to explain it, I hope you admit it is right.
Halfway to kill a Jensen inequality, to use it to solve the above predicament is also due to the meaning, otherwise say what it does. The direct maximum likelihood function cannot be done, so if we can find a tight lower bound of the likelihood function to optimize it, and guarantee that each iteration can make the total likelihood function increase, it is the same. What do you say? Draw a picture and you'll understand:
The picture is not good, much forgive me. The horizontal axis is the parameter, the ordinate is the likelihood function, first we initialize a θ1, according to it to find the likelihood function a tight lower bound, that is, the first black short line in the figure, the value of black short-term is less than the likelihood function value, but at least one point can satisfy the equal sign (so called tight lower bound), Maximize the small black short-term we hit to at least with the likelihood function just equal position, the corresponding horizontal axis is our new θ2, so to do, as long as the guarantee with Theta Update, each time to maximize the small black short-term value is larger than the last, then the algorithm convergence, and finally can maximize the likelihood function of the maximum value.
To construct this small black short line, it depends on Jensen inequality. Note that our log function here is a concave function, so we use the Jensen version of the inequality. According to the Jensen function, it is necessary to write the contents of log in a form of mathematical expectation, notice that the sum in log is about the sum of the implicit variable z, so naturally, this mathematical expectation must be related to Z, if Q (z) is the distribution function of z, then it can be constructed as follows:
The formula is more, I do not knock, directly to the content of my ppt come:
So the log actually constructs a random variable y,y is the function of Z, y takes p/q the probability of the value is Q, this point is very clear.
Constructs the mathematical expectation, the next step according to the Jensen inequality carries on the contraction:
With this step, let's take a look at the whole equation:
That is to say we have found a lower bound of the likelihood function, so is it possible to optimize it? No, it says it's important to make sure the nether is tight, which is at least a bit of an equal sign. By the Jensen inequality, the equation is established by the condition that the random variable is constant, specifically here, is:
And because Q (z) is the distribution function of Z, so:
C by the past, you can get C is P (xi,z) to Z sum, so we finally know:
Get Q (z), done, q (z) is P (zi|xi), or write P (zi), is one thing, representing the first I data is the probability from Zi.
So the EM algorithm comes out, and it does this:
First, initialize the parameter θ
(1) E-step: Calculate the probability that each sample belongs to Zi according to the parameter θ, that is, the probability that this height comes from Sichuan or northeast, this probability is Q
(2) M-step: According to the calculated Q, the lower bound of the likelihood function containing θ is obtained and maximized, and the new parameter θ is obtained.
Repeat (1) and (2) until convergence, you can see, from the thought, and the beginning is no different, but the direct maximization of the likelihood function is not good, the curve to the nation.
As to why such an iteration will ensure that the likelihood function monotone, that is, the convergence of the EM algorithm proof, I will not write, and later have time to consider the supplement. It is necessary to explain that the EM algorithm is convergent in the general situation, but it is not guaranteed to converge to the global optimal, that is, it is possible to enter the local optimal. EM algorithm has been applied in mixed Gaussian model and hidden Markov model, and is one of the famous ten algorithms of data mining.
On the sauce ~ What are the mistakes and different opinions welcome comments, my deduction and ideas and the other online is not exactly the same, a matter of opinion ~
Detailed derivation and explanation of "machine learning" em algorithm