EM algorithm 1-principle

Source: Internet
Author: User

The EM algorithm is used for maximum likelihood estimation of probabilistic model parameters with implied variables. What is the probabilistic model of implied variables? For example, suppose there are 3 coins, which are recorded as a,b,c, and the probability of their appearing on the front is r,p,q respectively. Each experiment first toss a coin, if the appearance is positive on the cast B, if the opposite side of the cast C, the positive is recorded as 1, the negative is recorded as 0. Independent 10 experiments, the results are as follows: 1101001011. If only this result, without knowing the process, asks how to estimate r,q,p? That is, we can see the results of each observation, but this result is produced by B or C, we do not know, that is, the result of a we do not know, this is the so-called implicit variable. If the observed variable is represented by Y, the implied variable (the result of a) is represented by Z, then the likelihood function of the observed data is:

\ (P (Y|\theta) =\prod_i{rp^{y_i} (1-p) ^{1-y_i}+ (1-r) q^{y_i} (1-q) ^{1-y_i}}\)

Generalization of the above model can be summarized as the observed data {\ (x_1,x_2,... x_m\)}, by a model with the observed variable x and the implied variable z, the model parameter is \ (\theta\), we want to maximize the following this likelihood:

\ (L (\theta) =\displaystyle\sum_{i}^{m}logp (X_i;\theta) =\displaystyle\sum_{i}^{m}log\sum_{z_i}p (X_i,z_i;\theta) \ )。

It is very difficult to solve this optimization problem directly. The EM algorithm is solved by iterative method, which is divided into expectation step and maximization step. Its main idea is to find the lower boundary of the target function, then improve the lower boundary gradually, and then get an optimal solution, but the optimal solution is not necessarily the global optimal.

Let's take a look at how the nether is derived---

\ (\displaystyle\sum_{i}^{m}logp (X_i;\theta) \)

\ (=\displaystyle\sum_{i}^{m}log\sum_{z_i}p (X_i,z_i;\theta) \)

For I, suppose \ (q_i\) is a probability distribution on Z

\ (=\displaystyle\sum_{i}^{m}log\sum_{z_i}q_i (z_i) \frac{p (X_i,z_i;\theta)}{q_i (z_i)}\)

\ (>=\displaystyle\sum_{i}^{m}\sum_{z_i}q_i (z_i) log\frac{p (X_i,z_i;\theta)}{q_i (z_i)}\)----(eq1)

This step uses the Jensen inequality because the log function is concave (the second derivative is less than 0), so there is a log (E (x)) >=e (log (x)). and \ (q_i\) is the probability distribution, so you can take the \ (\sum_{z_i}q_i (z_i) \frac{p (X_i,z_i;\theta)}{q_i (z_i)}\) as an expectation, and then apply the Jensen inequality [2], you can get the above results.

Now there is a lower limit, but the qi inside is not known. How to Determine Qi? If we already have a guess value of \ (\theta\), then here naturally let the lower bound at \ (\theta\) value and likelihood function at \ (\theta\) The closer the better, let the inequality eq1 at \ (\theta\) get an equal sign. Because the log function is a strictly concave function, the equals sign will only be formed when e (x) ==x (constant equals), for example, when X is a constant. Based on the nature of the above, make

\ (\frac{p (X_i,z_i;\theta)}{q_i (z_i)}=c\)

Based on this, you can launch

\ (\frac{\sum_zp (X_i,z;\theta)}{\sum_zq_i (z)}=c\) (This is easy to launch, A1/b1=c,a2/b2=c,a3/b3=c = (A2+A2+A3)/(B1+B2+B3) =c)

Yes

\ (Q_i (z_i) =\frac{p (X_i,z_i;\theta)}{\sum_zp (X_i,z;\theta)}\)

\ (=\frac{p (X_i,z_i;\theta)}{p (X_i;\theta)}\)

\ (=p (Z_i|x_i;\theta) \)

Therefore, Qi is a posteriori probability for a given XI and \ (\theta\) Zi.

This is the e-step, summed up, assuming known \ (\theta\), first to find out the lower limit of the likelihood function, and then find the distribution of the implied variable QI.

In the next m step, because the E step has been given a QI, this step is to maximize eq1 's \ (\theta\) value, which is the maximum point for the lower limit.

Then the M-Step (\theta\) to enter the E-step, the cycle, until convergence.

Repeat until convergence{

E-step:for each I,set

\ (Q_i (z_i): =p (Z_i|x_i;\theta) \)

M-step:set

\ (\theta:=argmax_{\theta}\displaystyle\sum_{i}^{m}\sum_{z_i}q_i (z_i) log\frac{p (X_i,z_i|\theta)}{Q_i (z_i)}\)

}

The following picture more visually describes the EM process, the image from the [4],e step to move the lower bound to \ (\theta\) value and the same as the target function, m step to find the maximum value of the Nether function as the new \ (\theta\)

If definition \ (J (Q,\theta) =\displaystyle\sum_{i}^{m}\sum_{z_i}q_i (z_i) log\frac{p (X_i,z_i;\theta)}{q_i (z_i)}\)

Then, the EM algorithm can be regarded as the axis descent process of function J, and the E-step maximizes q,m (\theta\).

The EM algorithm is convergent, and the specific proofs refer to [3], but the EM algorithm is likely to fall into the local optimal, which is sensitive to the initial value.

The following is an attempt to solve the three-coin problem of the article using the EM algorithm.

Assuming that the J-step iteration has passed, there is now \ (\theta^j= (r^j,c^j,q^j) \) (to avoid the confusion of writing, the positive probability p of B inside the parameter is changed to C)

E-Step:

Here requirements \ (P (z_i|x_i;\theta^j) \), because it is a binary problem, in order to describe the simple, can directly seek positive probability, according to Bayesian probability formula:

(In order to write a simple, the following is the number of iterations of the upper corner Mark J removed, remember, r,c,q is known)

\ (P (Z_i=1|x_i;\theta) =\frac{p (X_i|z_i=1;\theta) p (Z_i=1;\theta)}{p (X_i|z_i=1;\theta) p (Z_i=1;\theta) +p (z_i=0;\ Theta) P (x_i|z_i=0;\theta)}\)

\ (=\frac{rc^{x_i} (1-c) ^{(1-x_i)}}{rc^{x_i} (1-c) ^{1-x_i}+ (1-r) q^{x_i} (1-q) ^{(1-x_i)}}\)

Put \ (P (Z_i=1|x_i;\theta) \) do \ (\mu^{(j+1)}\) is the j+1 iteration of the value obtained, in order to write clearly (Cnblog to the formula to support some of the bad AH), or to remove the upper corner mark

M Step:

Now \ (P (Z_i|x_i;\theta) \) already know, start to solve the following optimization problem

\ (J (\theta) =\sum\mu_ilog\frac{p (X_i,z_i=1;\theta)}{\mu_i}+ (1-\mu_i) log\frac{p (X_i,z_i=0;\theta)}{1-\mu_i}\)

\ (=\sum\mu_ilog\frac{rc^{x_i} (1-c) ^{(1-x_i)}}{\mu_i}+ (1-\mu_i) log\frac{(1-r) q^{x_i} (1-q) ^{(1-x_i)}}{1-\mu_i}\)

Order \ (\frac{\partial J (\theta)}{r}=0\)

It's easy to get \ (r=\frac{1}{m}\sum\mu_i\)

Order \ (\frac{\partial J (\theta)}{c}=0\)

Equally easy to get \ (c=\frac{\sum\mu_ix_i}{\sum\mu_i}\)

Order \ (\frac{\partial J (\theta)}{q}=0\)

Equally easy to get \ (C=\frac{\sum (1-\mu_i) x_i}{\sum (1-\mu_i)}\) reference [1][5]

Reference:

[1] Hangyuan Li "Statistical Learning method"

[2] Jensen Inequalities: http://www.cnblogs.com/naniJser/p/5642288.html

[3] Andrew Ng's Lecture Notes on machine learning courses: http://cs229.stanford.edu/notes/cs229-notes8.pdf

[4] Introduction of the EM algorithm blog:http://blog.csdn.net/zouxy09/article/details/8537620

[5]http://chenrudan.github.io/blog/2015/12/02/emexample.html

EM algorithm 1-principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.