Introduction of EM algorithm
In one of the EM algorithm-the problem is introduced in the issue of the coin, the object function of the model, mentioned that the maximum likelihood estimation of the implicit variable to be solved with the EM algorithm, and then listed the EM algorithm of the simple process, of course, the last to see the EM algorithm at the heart is Meng Circle, we also briefly analyzed a bit, Look back at the simple introduction to the EM algorithm:
Input: Observation variable data Y, implicit variable data Z, joint distribution $p (Y,z|\theta) $, conditional distribution $p (z| Y,\theta) $
Output: Model parameter $\theta$
(1) Select the parameter initial value $\theta^{(0)}$, to iterate;
(2) E-Step: note $\theta^{(i)}$ for the first iteration of the parameter $\theta$ estimates, in the i+1 iteration of the e-step, calculated:
$ $Q (\theta,\theta^{(i)}) =e_z[logp (Y,z | \theta) |\color{red}{y,\theta^{(i)}}]\ =\sum_z{logp (Y,z|\theta) \color{red} {P (z| y,\theta^{(i)})}} \tag{1}$$
(3) M Step: Make $q (\theta,\theta^{(i)}) $ $\theta$, determine the parameter estimates for i+1 iterations $\theta^{(i+1)}$
$$\theta^{(i+1)}=argmax_\theta Q (\theta,\theta^{(i)}) \tag{2}$$
(4) Repeat step (2) and step (3) until the convergence
the function $q (\theta,\theta^{(i)}) in the E-step above is the core of the EM algorithm , called the Q function.
The Q function is the log-likelihood function of the complete data $logP (y,z | \theta) $ about the observed data $y$ and the current parameter $\theta^{(i)}$, for the unobserved data Z $P of conditional probability distributions (z| y,\theta^{(i)}) $ expectation.
Let's wait and see the Q function, where there are a lot of key words. First of all, it is clear that the Q function is an expectation, and this is not a problem; second, this expectation is the expectation of a function (a logarithmic likelihood function under full data) about a probability distribution (conditional probability distribution of unobserved data z under XXX conditions). As you can read here, you may not understand the expectations of a function about a probability distribution . I am in this to insert a small episode introduced, understand can skip:
Knowledge Point one: conditional mathematical expectation
The function above involves the expectation of a probability distribution , called conditional mathematical expectation in mathematics.
First, we are already familiar with the conditional probabilities, that is, the probability that the event ${y=y_j}$ occurs under the condition that the event $ {x=x_i}$ has occurred, is recorded as $p{y=y_j| x=x_i}$;
The conditional expectation is the expected value of a real random variable relative to a conditional probability distribution. Set X and Y are discrete random variables, then the condition of X is expected to be a function of X's range of values in Y under the given event y = y condition:
{:. Center}
?????? (3)
{:. Center}
The personal feeling can be understood as the weighted average under each conditional probability distribution.
So continue to understand the Q function, see the formula in E Step (1), the function $logp (y,z| \theta) $ is about Z, and in the $y,\theta^{(i)}$ condition refers to the implied variable Z under this condition, that is, in the probability distribution $p (z| y,\theta^{(i)}) $ condition, so the deformation of the red part of Equation 1 is well understood. The logarithmic likelihood function $logp (y,z| \theta) $ is the logarithmic likelihood function of the complete data, which has an implicit variable z, so it is necessary to add the conditional probability distribution of Z to the conditional mathematical expectation of Z in this function.
After obtaining the conditional mathematical expectation of the implied variables in e-step , we have to take a value to get the model parameter $\theta$ so that the value of Q function is maximal (the maximum likelihood estimation derivative). So, in m step , for $q (\theta,\theta^{(i)}) $ Max, get $\theta^{(i+1)}$, complete one iteration $\theta^{(i)} \to \theta^{(i+1)}$, We later prove that each iteration is bound to increase the value of the Q function or to achieve local optimality (the second part provides proof). Finally, the stop iteration condition is generally required to set a relatively small value $\epsilon_1,\epsilon_2$, if $| is satisfied | \theta^{(i+1)}-\theta^{(i)}| | <\epsilon_1$ or $| | Q (\theta^{(i+1)},\theta^{(i+1)})-Q (\theta^{(i)},\theta^{(i)}) | | <\epsilon_2$.
Two, EM algorithm export
Why can the EM algorithm approximate the maximum likelihood estimate of the observed data ? We face a probabilistic model with implicit variables, the goal is to maximize the observed data (incomplete data) y about the parameter $\theta$ logarithmic likelihood function, namely maximization:
$ $L (\theta) =logp (Y|\theta) =log\sum_z{logp (Y,z|\theta)} =log (\sum_z{p (y| Z,\theta) P (Z|\theta)}) \tag{4}$$
The difficulty with a formula is that the equation (4) contains the non-observed data, and contains the logarithm (or integral).
The EM algorithm is progressively approximated by iteration (\theta) $. This assumes that the estimated value of $\theta$ after this iteration is $\theta^{(i)}$, then we calculate whether the new estimate $\theta$ can make the (\theta) $ increase, i.e. (\theta) >l (\theta^{(i)}) $, and gradually reach the maximum value? So we consider the difference between the two:
$ $L (\theta)-L (\theta^{(i)}) =log \left (\sum_z{p (y| Z,\theta) P (Z|\theta)} \right)-logp (y|\theta^{(i)}) \tag{5}$$
For the formula (5) We need a variant, but the deformation needs to know Jensen inequality.
Knowledge Point two: Jensen inequality (Johnson inequality)
Todo
To understand a little bit about the Jensen inequality, we continue to look at the formula (5), first transforming the formula (5), and multiplying the numerator denominator by a $\color{blue}{p in the first part (y| z,\theta^{(i)})}$, for clarity, we mark the blue and the brackets as follows:
$$
\begin{align}
L (\theta)-L (\theta^{(i)}) &=log \left (\sum_z \left[{\color{blue}{p (z| y,\theta^{(i)})} \frac{p (y| Z,\theta) P (Z|\theta)}{\color{blue}{p (z| y,\theta^{(i)})}}} \right] \right)-log\color{forestgreen}{p (y|\theta^{(i)})} \
&\ge \sum_z \left[\color{blue}{p (z| y,\theta^{(i)})} Log \left (\frac{p (y| Z,\theta) P (Z|\theta)}{\color{blue}{p (z| y,\theta^{(i)})}} \right) \right]-log\color{forestgreen}{p (y|\theta^{(i)})} \
&= \sum_z \left[\color{blue}{p (z| y,\theta^{(i)})} Log \left (\frac{p (y| Z,\theta) P (Z|\theta)}{\color{blue}{p (z| y,\theta^{(i)})}} \right) \right]-\underbrace{\color{blue}{\sum_z{p (z| y,\theta^{(i)}})}}_{=1} log\color{forestgreen}{p (y|\theta^{(i)})} \
&= \sum_z \left[\color{blue}{p (z| y,\theta^{(i)})} Log \left (\frac{p (y| Z,\theta) P (Z|\theta)}{\color{blue}{p (z| y,\theta^{(i)})}\color{forestgreen}{p (y|\theta^{(i)})} \right) \right]
\end{align}
\tag{6}$$
Here we make
$ $B (\theta,\theta^{(i)}) = L (\theta^{(i)}) + \sum_z \left[\color{blue}{p (z| y,\theta^{(i)})} Log \left (\frac{p (y| Z,\theta) P (Z|\theta)}{\color{blue}{p (z| y,\theta^{(i)})}\color{forestgreen}{p (y|\theta^{(i)})} \right) \right] \tag{7}$$
You can get:
$ $L (\theta) \ge B (\theta,\theta^{(i)}) \tag{8}$$
You can know that the $b (\theta,\theta^{(i)}) $ function is a lower bound of (\theta) $, and is known by the formula (7):
$ $L (\theta^{(i)}) = B (\theta^{(i)},\theta^{(i)}) $$
Therefore, any $\theta$ that can make $b (\theta,\theta^{(i)}) $ increase can also make the (\theta) $ increase. In order for the "(\theta) $ to have as large a growth as possible, choose $\theta^{(i+1)}$ make $b (\theta,\theta^{(i)}) $ reach great, namely:
$$\theta^{(i+1)}=argmax_\theta B (\theta,\theta^{(i)}) \tag{9}$$
Now seek $\theta^{(i+1)}$, omitting the const term:
$$
\begin{align}
\theta^{(i+1)} &= Argmax_\theta \left (L (\theta^{(i)}) + \sum_z \color{blue}{p (z| y,\theta^{(i)})} Log \left (\frac{p (y| Z,\theta) P (Z|\theta)}{\color{blue}{p (z| y,\theta^{(i)})}\color{forestgreen}{p (y|\theta^{(i)})} \right) \right) \
&= Argmax_\theta \left (\sum_z \color{blue}{p (z| y,\theta^{(i)})} Log \left (P (y| Z,\theta) P (Z|\theta) \right) \right) \
&= Argmax_\theta \left (\sum_z \color{blue}{p (z| y,\theta^{(i)})} Log \left (P (y| Z,\theta) \right) \right) \
&= Argmax_\theta Q (\theta,\theta^{(i)})
\end{align}
\TAG{10}
$$
The equation (10) is equivalent to the one iteration of the EM algorithm, that is, the Q function and its maximization. The EM algorithm is an algorithm for solving the maximum logarithm likelihood function by solving the maximal approximation of the nether.
Three, EM algorithm application
EM algorithm has many applications, such as classification, regression, labeling and other tasks. The more extensive is GMM mixed Gaussian model, hmm hidden Markov training problem and so on.
EM algorithm (two)-an approach to the algorithm