First, basic understanding
The EM (expectation maximization algorithm) algorithm is the desired maximization algorithm. The name of the very science, is the algorithm in the name of the two steps in the name, an e-step calculation of expectations, a M-step calculation to maximize, and then put in the name is OK.
The EM algorithm is an iterative algorithm , which is presented by Demspster and others in 1977, which is used for maximum likelihood estimation of probabilistic model parameters with implied variables , or maximal posteriori probability estimation. It can be noted here that the EM algorithm is aimed at the problem with the implied variables, and similar to the maximum likelihood estimation, that is, the original maximum likelihood estimation method cannot solve the problem with the implied variable, so the EM algorithm is produced. That is, if the variables of the probabilistic model are all observational variables, then the given data can be used to estimate the model parameters directly using the maximum likelihood abdominal or Bayesian estimation method. (EM algorithm is the initial value sensitive , and the EM algorithm cannot guarantee to find the global optimal solution.) )
At first understand this algorithm, a lot of problems linger, but always do not understand, a lot of introductions also have their own understanding and focus, then I have to write a version of their own, first I encountered the problem roughly as follows (constantly added), if you encounter problems similar to mine, may wish to look at this blog:
- Why iterate, what's the problem?
- What the hell is e-step?
- What is M-step doing again?
Second, the problem leads
In the introduction of various theoretical knowledge, examples are always so approachable, we start with the rich version of the example (three-coin model) in the Hangyuan Li Teacher's book:
Suppose there are 3 coins, which are recorded as a,b,c. The probability of these coins appearing on the front is $\PI, p, q$. Next we'll do an experiment: Toss the coin A, and select the coin B or C according to its result, choose the coin B on the front, the reverse option coin C, and then toss the selected coin (b or C), Record the results of the throw, the front is recorded as 1, throw the reverse of 0, repeat 10 experiments independently (here can also be generalized to repeat the n-th experiment), the results are as follows:
$$1,1,0,1,0,0,1,0,1,1$$
So we have a problem to solve, now only give you the result sequence, such as the above $1,1,0,1,0,0,1,0,1,1$, and then tell you the above rules, but not to observe the process of tossing coins, let you estimate the probability of three coins face. That is to tell you, I toss a coin like this, the result shows you, let you beg $\pi, p, q$ the specific value is how much. For example, you can guess according to your own experience that three coins are uniform, namely $\pi=0.5, p=0.5, q=0.5$. Of course it's okay, but you didn't use the result data (01 sequence) given above, perhaps using this data, and then based on probability knowledge, we can guess more accurately. Then we need to use the EM algorithm for a more accurate version.
let us first introduce the physical meaning of the variables used : Using $y$ to represent the observed variable, that is, 0 or 1 of the observed results, using $\theta = (\PI, p, Q) $ to represent the required parameters; The random variable z represents the implicit variable, which indicates the result of no observed toss of a coin. We know whether Y is equal to 0 or 1, and we don't know if z is 0 or 1 (positive or negative). So at this point, the probability that a y appears 0 or 1 can be expressed as:
$ $P (Y|\theta) =\sum_z{p (Y,z|\theta)}=\color{blue}{\sum_z}{p (Z|\theta) P (Y|z,\theta)}=\pi p^y (1-p) ^{1-y}\color{red }+ (1-\PI) q^y (1-q) ^{1-y} \tag{1}$$
To explain what this expression specifically means, a simple sentence should be the full probability formula, calculated in the case of $\theta$ these parameters (conditions), and the random variable z in various cases (tired and), Y appears probability $p (Y|\theta) $. For example, if we look at the first value of a sequence of 1 (that is, Y=1), then Y=1 appears in two cases (i.e. the sum of probabilities in both cases): ① When a coin toss A is positive, the coin B is selected, the coin b appears positive, the ② is the flip coin A is the opposite, the coin C appears, and the coin C This is the process of tossing a coin that we cannot see, that is, the implied variable. The addition of these two cases corresponds to the summation of the random variable z-Blue in the above formula and the Red plus sign in the back. When Y=1, we rewrite the above formula as follows:
$ $P (Y=1|\theta) =\sum_z{p (Y=1,z|\theta)}=\color{blue}{\sum_z}{p (Z|\theta) p (y=1|z,\theta)} = \pi P + (1-\PI) Q \tag{2}$ $
By now, we have only understood a probability formula that is related to the final observation, and it is a special case. And then we're going to speed it up ...
We represent the observed data as $y= (y_1,y_2,..., y_n) ^t$, the unobserved data is expressed as $z= (z_1,z_2,..., z_n) ^t$, so the likelihood function of the observed data can be expressed as:
$ $P (Y|\theta) =\sum_z{p (Y,z|\theta)}=\sum_z{p (Z|\theta) P (y| Z,\theta)} \tag{3}$$
This is actually the joint probability of Y, that is $p (y) =p (y=y_1) *p (y=y_2) *...*p (y=y_n) $, here deliberately multiplication sign (*) written out (into the following formula of the cumulative symbol). The above means that the full probability of z is obtained for each $p (y=y_i), so the upper formula can be changed to:
$ $P (Y|\theta) =\prod_{j=1}^{n}[\pi p^{y_i} (1-p) ^{1-y_i}+ (1-\PI) q^{y_i} (1-Q) ^{1-y_i}] \tag{4}$$
That's the equivalent, we're going to do it now. A probability $p (Y|\theta) $ for the current observation results, then the present result appears, it proves that the existence is reasonable, to arrange a set of parameters $\theta$ so that the present observation results the largest probability of occurrence, namely:
$$\hat{\theta}=arg{max}_{\theta}logp (Y|\theta) \tag{5}$$
Ha, the target has, the next is how to beg, however this problem is not analytic solution, only by iterative method solution. It seems that the EM algorithm, the EM algorithm how to obtain the parameters of this model? We went on to see!!!
In general, the data of the random variable is represented by Y, and Z represents the data of the hidden random variable. Y and Z together are called full data , and observational data y is also known as incomplete data. Assuming that the observed data Y, whose probability is distributed $p (Y|\theta) $, where $\theta$ is the model parameter to be estimated, then the likelihood function of incomplete data is $p (Y|\theta) $, the logarithmic likelihood function is "(\theta) =logp (Y|\theta) $; Assuming that the joint probability of Y and Z is $p (Y,z|\theta) $, then the log likelihood function of the full data is $logp (Y,z|\theta) $.
In fact, there is no clear understanding here why not directly according to the formula (5) To find the maximum, just because $logp (Y|\theta) $ is not independent existence, but there is an implied variable z? For example, in the problem of model training for hidden Markov (see blog post), in the case of seaweed, we are observing the algae sequence $o={dry,damp,soggy}$, but obviously the weather factors have a decisive effect on this explicit state, so when we model the algae, Take into account the implied weather factors, or in the three-coin model, each set of explicit data, we do not know whether the coin is thrown B or a coin C, thus encountering information loss situation, we can not simply use the maximum likelihood estimate to obtain the analytic solution. That should be the point of EM innovation.
A preliminary study of EM algorithm
The EM algorithm first selects the initial value of the parameter $\theta^{(0)}= (\pi^{(0)},p^{(0)},q^{(0)}) $, the initial value can be specified by experience or arbitrarily set, but we know that the EM algorithm is initial value sensitive, and then we will see exactly how much influence. After setting the initial value, the following steps can be used to iterate over the estimation of the parameters until the convergence is reached. The estimated value of the iteration parameter for the first time is $\theta^{(i)}= (\pi^{(i)},p^{(i)},q^{(i)}), then the iteration of the i+1 is as follows: ( It doesn't matter here, we'll explain it a little bit )
E-step: Calculate the probability of the observed data from the coin-toss B under the model Wipe book:
$$\mu^{(i+1)}=\frac{\pi^{(i)} {p^{(i)}}^{y_i} (1-p^{(i)}) ^{1-y_i}}{\pi^{(i)} {p^{(i)}}^{y_i} (1-p^{(i)}) ^{1-y_i}+ ( 1-\pi^{(i)}) {q^{(i)}}^{y_i} (1-q^{(i)}) ^{1-y_i}} \tag{6}$$
M step: Calculate new valuation of model parameters
$$\pi^{(i+1)}=\frac{1}{n}\sum_{j=1}^{n}\mu^{(i+1)} \tag{7}$$
$ $p ^{(i+1)}=\frac{\sum_{j=1}^{n}\mu^{(i+1)}y_j}{\sum_{j=1}^{n}\mu^{(i+1)}} \tag{8}$$
$ $q ^{(i+1)}=\frac{\sum_{j=1}^{n} (1-\mu^{(i+1)}) Y_j}{\sum_{j=1}^{n} (1-\mu^{(I+1)})} \tag{9}$$
See here, I have a bit of a circle, just fine, how to pass this iteration on this, the heart is like this:
[Image from the network, Invasion and deletion]
In fact, the above is just a brief introduction of the target function to solve the specific source of the problem, and the beginning of the solution to the problem, the EM algorithm part of the end of the very soon, the patience of the classmate may step-by-step to the middle of the understanding of the steps to complete, can also cross this big step, I ... What to do .... Let's explain here a little bit about how to add details here:
In the formula for e-step (6), we calculated the probability of the toss B from the coin, why do we calculate this???
- The e-step is to calculate the expectation. of course, the unknown variable, where the unknown variable is $\mu$ (the physical meaning is that the coin was thrown at the time B or C). Huh? Why is the unknown variable?
- Why is the expectation of an unknown variable calculated? because we want to get the most out of the data we're looking at, we must know which coin to toss. only in this way can I follow the probability distribution of the coin toss B or the probability distribution of the toss C, and multiply the probability distribution of the observed results (because the coin toss is independent in the second step), and then let the probability be the largest (maximum likelihood thought). So disappointed that I don't know which coin was thrown? What to do, then even a hope!!! It would be reasonable to estimate whether the coin B or the coin C was chosen at that time.
- Know how to calculate and why to count expectations, then know how to count expectations? I need a model parameter to calculate, but I just want to use this to compute the model parameters Ah, if I know the model parameters are still here to waste the strength of what? This is more embarrassing, deadlock! Finish the Duzi. That's going to break the deadlock. Set a model parameter (the initial value of the model parameter in e-step), and take this set of parameters to calculate the expectation.
- We all know it's not going to work out, and now we're finally getting the value of an unknown variable using some seemingly reasonable calculation. At this point, if the value is only reasonable, the expected value is certainly more reasonable than the "casual" setting of the model parameters . So why not take a more reasonable value instead of the original unknown variable's expectation, and then to greatly change the likelihood function of the observed variable, calculate (update) the model parameter (start M step). The result of the model parameters. Now that the first calculation of the EM algorithm is over, we have a calculated expectation and the calculated model parameters.
- Eh But the model parameter is not the same as the "casual" setting. For the moment, we think that this model parameter is more reasonable than just the "random" setting (which can be proved later), so now that I have a more reasonable model parameter, let's calculate the expected value of the unknown variable. It's not the same as it was just now. Then repeat the calculation of the model parameters ... Then, recalculate the expected value ... So repeat and repeat until at some point two values are not changed (meet the Stop condition), everyone (model parameters and unknown variables) do not change, good happy, stop it.
- At this point, the EM algorithm is over.
Actually here, I hope you have a new problem, is that you two back and forth convergence I can accept, but here is not a more implicit variable? Tube him so much what to do, what iteration, convergence, I have learned the bias of the people, give them a biased guide package, all derivative, so that the derivative equals 0, is dry! Then you really don't know anything about power! Let's explain why we have to iterate and solve the problem:
In fact, the objective function we are asking for is a form that, for clarity, we add brackets and colors to clearly display the range of the log function:
$ $L (\theta) =logp (Y|\theta) =log\sum_z{logp (Y,z|\theta)} =log (\sum_z{p (y| Z,\theta) P (Z|\theta)}) \tag{10}$$
As you can see, here is the form of the function in log and it's complicated to ask, don't believe you imagine $f (x) =log (f_1 (x) +f_2 (x) +...+f_n (x)) $ derivative. Well! That's the reason!
After that, we re-examine the EM algorithm, and we'll continue with the second of the EM algorithm in the next section of the algorithm.
EM algorithm (i)-Problem extraction