I. Introduction of EM
EM (expectation mmaximization) is an iterative algorithm for maximum likelihood estimation of probabilistic model parameters with implicit variables (latent Variable), or an EM algorithm with a maximum posteriori probability estimation, which consists of two steps, the desired e-step, and the M-step of seeking great.
The EM algorithm can be regarded as an algorithm for calculating maximum likelihood under special circumstances.
Realistic data often has some strange problems, such as missing data, including hidden variables. When these problems occur, it is often difficult to calculate the maximum likelihood function, and the EM algorithm can solve the problem.
The EM algorithm has many applications, such as the most classic hidden Markov model. In economics, in addition to the HMM models that have gradually begun to be valued (such as Yin and Zhao, 2015), other areas may also involve EM algorithms, such as the discrete Choice Methods with Simulation in train An EM algorithm for the mixed logit model is given. second, the EM algorithm of the preparatory knowledge
1. Maximum Likelihood estimation
(1) Examples: Classical problems--the height of students
We need to investigate the height distribution of boys and girls in our school. Suppose you find 100 boys and 100 girls on campus. There are 200 of them in total. They were divided into two groups according to sex, and then the height of the 100 Boys was statistically sampled first. Suppose that their height is subject to Gaussian distribution. But the mean U and variance of this distribution ∂2 we don't know, these two parameters are what we want to estimate. Remember as θ=[u,∂]t.
Problem: We know the model and some samples of the probability distributions that the sample obeys, without knowing the parameters in the model.
We have two known: (1) sample distribution Model (2) The randomly sampled samples need to be obtained by the maximum likelihood estimation, including: The parameters of the model
In general: Maximum likelihood estimation is a statistical method used to estimate model parameters.
(2) How to estimate
Problem math: (1) Sample set x={x1,x2,..., XN} n=100 (2) probability density: P (xi|θ) The probability of pumping the male I (height) of 100 samples between the two independent distribution, so I also draw the probability of the 100 boys is their respective probability of the product. is the probability of extracting the 100 samples from a population of P (x|θ), which is the joint probability of each sample in the sample set X, expressed in the following form:
This probability reflects the probability that the probability density function, when the parameter is θ, is obtained by X for this set of samples. You need to find a parameter θ, which corresponds to the maximum likelihood function l (θ), which means that the 100 males are the most likely to be pumped. This maximum likelihood estimator, called θ, is recorded as
(3) General steps to find the maximum likelihood function estimate
First, write the likelihood function:
Second, the likelihood function takes the logarithm, and organizes:
Then, the derivative number is 0, and the likelihood equation is obtained.
Finally, the solution of the likelihood equation, the obtained parameter is the request.
(4) Summary
In most cases we calculate the result based on a known condition, and the maximum likelihood estimate is the one that already knows the result and then seeks to make the result the most probable condition, as an estimate.
2. Jensen Inequalities
(1) Definition
Set F is a function that defines the field as a real number, if for all real numbers x. If the two derivative of all real x,f (x) is greater than or equal to 0, then f is the convex function. Jensen inequalities are expressed as follows: If f is a convex function and X is a random variable, then: E[f (x)]>=f (E[x]). When and only if X is a constant, the equation is taken as an equal.
(2) Example
In the figure, the real line f is the convex function, x is a random variable, the probability of 0.5 is a, and the probability of 0.5 is B. The expected value of X is the median of a and B, and E[f (x)]>=f (e[x]) can be seen in the figure. When a Jensen inequality is applied to a concave function, it is reversed in an equal direction.
Three, EM algorithm derivation
If the parameter we are concerned with is θ, the observed data is Y and the hidden variable is Z, then according to the full probability formula:
In theory, maximum likelihood estimation can be obtained by maximizing the logarithm of this density function. The problem, however, is that it is very difficult to integrate Z in many cases, especially if the dimensions of Z are as large as the sample size, and it is a horrible thing to calculate numerical integrals at this time.
And the EM algorithm can solve this problem. According to Bayesian law, we can get:
The essence of the EM algorithm is to use this formula to deal with the dimension of Z.
Intuitively, the EM algorithm is a guessing process: given a guessing theta ', you can calculate the probability of a hidden variable taking each value based on this guessing Theta ' and the actual data. With the probability of Z, the more likely θ is calculated based on this probability.
To be precise, the EM algorithm is the following iterative process:
Train's "discrete Choice Methods with Simulation" has a very graphic description of the above process:
The ll for the above ln[p (y|θ)],ε can be roughly equivalent to the objective function of the iterative process described above. It can be proved that at θ_t, LL and ε are tangent, and ε<=ll. Thus, each time the ε function is minimized, a best guess of θ is given. From this point of view, the EM algorithm provides an optimization algorithm for calculating the maximum likelihood function, but the most classical Quasi-Newton method uses the derivative information to update θ directly.
The key of using EM algorithm is to find the H function. Here is an example of the most classic mixed-normal, assuming:
That is, the observed data y is based on the probability that p comes from two population, and the mean values of two populations are μ and-μ respectively.
We can calculate:
Where Θ={μ,p}.
Again, you can calculate:
Thus, the above iterative process can be written as:
Given an initial value, the result can be obtained by constantly iterating over the above optimal.
Two examples of EM
Let's take a look at what the EM algorithm is probably used for. (three-coin model) There are 3 coins, each of which is recorded as A,b,c. The probability of these coins appearing on the front is $\pi,q,p$ respectively. Carry out the following experiments, toss a coin, according to the results of a coin B or a coin C, the front of the coin B, the reverse of the coin C, and then toss the selected coins, face up the result is recorded as 1, opposite the result is recorded as 0, independent repeat n times (here is 10 times), the observations are as follows: 1 1,0,1,0,0,1,0,1,1. Estimate the probability of three coins facing upward. Parameter estimation of Gaussian mixture model
Iv. interpretation of Algorithms
EM algorithm, that is, when you can not directly from the observation of the parameter estimation, by adding some to meet the specific conditions of the implicit variable, to simplify the model, to be able to iteratively estimate the distribution of the parameters of an algorithm.
Set X as the observed data (obvious), and z as the implicit variable (latent variable) for the parameter to be estimated.
When we do a maximum likelihood estimate, we need to greatly calculate the logarithmic likelihood function to find the maximum likelihood estimation of the parameters, so our goal is to make a logarithmic likelihood function (the following is the log instead of LN, don't ask why I write this, I do not know, most of the documents are so written)
Equation 1
Equation 2
Note that we add the implicit variable z, to meet two conditions: After adding the hidden variable, you can achieve the purpose of simplifying the model (nonsense) with the implicit variable, without changing the original distribution of the edge (important)
In the face of Equation 2, we looked at him very concise, but, but, but, (important things and sad things to say three times.) In reality he is very complicated. For example, the logarithmic likelihood function of a mixed Gaussian model with implicit variables is as follows:
This thing, asking for the maximum value, or difficult, so we have to adopt the EM algorithm. The general idea of EM is that we are going to maximize the maximum likelihood function through iteration, and finally the maximal likelihood function is maximized, and then the parameters that need to be estimated are solved.
a few interesting explanations
Simple version: Guess (e-step), Reflection (M-step), repetition;
Background: The company has a lot of leadership =[a Total, Liu Total, c Total], at the same time there are a lot of beautiful female staff =[small, small chapter, small b]. You have an urgent suspicion that these bosses have problems with these female staff. In order to scientifically verify your conjecture, you have carried out a careful observation. So
Observation data:
1) A total, small armour, small b together out;
2) Teng, small armour, small go out together;
3) Teng, small chapter, small b went out together;
4) c Total, small b went out together;
Collect the data and you start the mysterious EM calculation:
Initialize, you think three bosses as handsome, as rich, three beauty as beautiful, everyone may have a relationship with each other. So, every boss and every female staff "have a problem" probability is 1/3;
Thus, (E step)
1) A always went out with the small armor 1/2 * 1/3 = 1/6 times, with the small B also went out 1/6 times; (so-called fractional count)
2) Liu always with small armor, small also went out 1/6 times
3) Liu always with small B, small went out again 1/6 times
4) C always went out with little b 1/3 times.
Total, a always with small armor went out 1/6 times, with small b also went out 1/6 times; Liu always with small armour, small B went out 1/6 times, with the small chapter went out 1/3 times, C always with the small chapter went out 1/3 times;
You start to gossip with the new You (M step),
A total with small armor, small B has the probability of a problem is 1/6/(1/6 + 1/6) = 1/2;
Liu always with small armor, small B has the probability of a problem is 1/6/(1/6+1/6+1/6+1/6) = 1/4; The probability of a problem with a small chapter is (1/6+1/6)/(1/6 * 4) = 1/2;
C Total with small B has the probability of a problem is 1.
Then you have to start calculating according to the latest probabilities; (E-step)
1) A always go out with the small armor 1/2 * 1/2 = 1/4 times, with small b also go out 1/4 times;
2) Liu always went out with the small armor 1/2 * 1/4 = 1/12 times, with the small chapter went out 1/2 * 1/2 = 1/4 times;
3) Liu always went out with Xiao B 1/2 * 1/4 = 1/12 times, and went out with small 1/2 * 1/2 = 1/4 times;
4) c always with small B went out 1 times;
Rethink your Gossip (m-step):
A always with small armor, small B has the probability of a problem is 1/4/(1/4 + 1/4) = 1/2;
b Total with small armor, small B is 1/12/(1/12 + 1/4 + 1/4 + 1/12) = 1/8; with small is 3/4;
C Total with small b probability is 1.
You continue to calculate, reflect, in short, finally, you get the truth. (Ma always said I knew the truth already)
Reference Articles 1. Know: https://www.zhihu.com/question/27976634 2. "Statistical learning Method", Hangyuan Li 3. Introduction and practice of machine learning 4. Machine Learning algorithm: https://www.cnblogs.com/Gabby/p/5344658.html