I hope I can easily understand or understand it, but the problem of EM is really not very easy to use the popular language to understand, because it is very simple and complex. Simple lies in its thought, simply because it contains only two steps to complete the powerful function, the complexity lies in its mathematical reasoning involves more complex probability formula. If only the simple, lost the essence of the EM algorithm, if only to talk about mathematical reasoning, and too dull and jerky, but on the other hand, want to combine the two is not an easy thing. So, I can't expect to see how I can talk about it. I hope you will not hesitate to guide.
In the statistical calculation,
Maximum expectation (EM) algorithmis the algorithm for finding the maximum likelihood or maximum posteriori estimation of the parameter in the probability (probabilistic) model, where the probabilistic model relies on the invisible hidden variable (latent Variable). Maximum expectations are often used in the field of data clustering for machine learning and computer vision. The maximum expectation algorithm is calculated alternately in two steps: The first step is to calculate the expectation (E), use the existing estimate of the hidden variable, calculate its maximum likelihood estimate, and the second step is to maximize (M) and maximize the maximum likelihood value calculated on the E step to calculate the value of the parameter. The parameter estimates found on M-step are used in the next E-step calculation, and the process is constantly alternating. Overall, the algorithm flow for EM is as follows: 1. Initialize the distribution parameters 2. Repeat until Convergence: E step: Estimate the expected value of the unknown parameter and give the current parameter estimate. M Step: Re-estimate the distribution parameters to make the data most likelihood, and give the expected estimate of the unknown variable.The emphasis in this is on the modeling process of the maximum likelihood estimation and the M-step calculation process.
Pull too much, get into the point. Let's say we're having a problem like the following:
Suppose we need to investigate the height distribution of boys and girls in our school. What do you do? You said that so many people can not be one to ask, it must be a sample. Let's say you've captured 100 boys and 100 girls on campus casually. They have a total of 200 people (that is, 200 height sample data, in order to facilitate the expression, below, I said "people" means that the corresponding height) are in the classroom. What about the next step? You start shouting: "Male left, female right, other station in the middle!" ”。 And then you'll start by counting the height of the 100 boys that were sampled. Suppose that their height is subject to Gaussian distribution. But the mean U and variance of this distribution ∂2 we don't know, these two parameters are what we want to estimate. Remember as θ=[u,∂]t.
In the language of mathematics: in the school so many boys (height), we independently according to the probability density p (x|θ) to extract 100 (height), composed of sample set X, we want to estimate the unknown parameter θ by the sample set X. Here the probability density p (x|θ) We know is the form of Gaussian distribution N (u,∂), where the unknown parameter is θ=[u,∂]t. The sample set drawn is X={x1,x2,..., XN}, where Xi represents the height of the person I drew, where n is 100, indicating the number of samples drawn.
Since each sample is independently extracted from P (x|θ), in other words, any of the 100 boys is my casual catch, from my point of view these boys are not related. So, why do I get so many boys from school that I just got 100 of them? What is the probability of pumping these 100 people? Because these boys (the height) are subject to the same Gaussian distribution P (x|θ). So the probability that I get the boy A (height) is P (xa|θ), the probability of pumping boys B is P (xb|θ), because they are independent, so it is obvious that I also get boys A and boys B is the probability of P (xa|θ) * p (xb|θ), similarly, The probability that I'm pumping the 100 boys at the same time is the product of their probabilities. In the words of a mathematician, the probability of extracting the 100 samples from the population of the distribution is P (x|θ), which is the joint probability of each sample in the sample set X, is represented by the following formula:
This probability reflects the probability that the probability density function, when the parameter is θ, is obtained by X for this set of samples. Because X is known here, that is to say, the height of the 100 people I have extracted can be measured, which is known. While θ is unknown, the above formula is only θ, so it is the function of theta. This function shows the possibility of taking the current sample set under different parameter θ values, so it is called the likelihood function of the parameter θ relative to the sample set X (Likehood functions). Remember as l (θ).
Here comes a concept, likelihood function. Do you remember our goal? We need to estimate the value of the parameter θ in the condition where we have already pumped this set of sample X. How do you estimate it? What's the use of likelihood function? Let's begin by understanding the concept of the likelihood.
Give an example directly:
A classmate went out hunting with a hunter, and a hare sprang from the front. Just one shot, the hare went down, and if you want to speculate, who hit the bullet? You will think, only a shot to hit, because the probability of the hunter's hit is generally greater than the probability of the classmate hit, it appears that the gun was shot by the hunter.
The inference from this example embodies the basic idea of the maximum likelihood method.
For example: After class, a group of male and female students went to the toilet respectively. Then, you idle bored, want to know the recess is a boy to go to the toilet more people or girls to the toilet more people, and then you run to squat in the men's and female toilets in the doorway. Squat for five minutes, suddenly a beautiful woman came out, you ecstatic, ran over to tell me that the girls at recess more people, you do not believe you can enter the number. Oh, I'm not so stupid to go to the count, not to the headlines. I asked you how you knew it. You said: "5 minutes, out of the girls, girls ah, then girls out of the probability is definitely the biggest, or more than the boys, then the women's room is certainly more than the men's toilets." See, you've used the maximum likelihood estimate. You see the girls come out first, so under what circumstances will the girls come out first? Certainly is the probability that the girl comes out the biggest time, that when the girl comes out of the probability biggest ah, that must be the ladies ' room more than men in the time of the toilet, this is your estimated parameters.
From the above two examples, what conclusions have you got?
Back to the example of male height. In the school so boys, I smoke the 100 boys (for height), and not others, that is not to say that throughout the school, these 100 people are the most likely to appear. So what does this probability mean? Oh, that's the likelihood function L (θ) above. So, we just need to find a parameter θ, which corresponds to the maximum number of likelihood function l (θ), which means that the 100 boys are the most likely to be pumped. This maximum likelihood estimator, called θ, is recorded as:
Sometimes, you can see that L (θ) is a multiplicative, so for the sake of analysis, you can also define a log-likelihood function and turn it into a plus:
Well, now that we know that, to ask for Theta, we need only to make the likelihood function of theta (θ) greater, and then the maxima corresponding to Theta is our estimate. Here we go back to the question of finding the most value. How do I find the maximum value of a function? Of course it is derivative, then the derivative is 0, then the solution of this equation is the theta (of course, if the function of L (θ) is continuously differentiable). So if Theta is a vector with multiple parameters, what's the deal? Of course, L (θ) for all parameters of the partial derivative, that is, the gradient, then n unknown parameters, there are n equations, the solution of the equations is the extremum point of the likelihood function, of course, the n parameters are obtained.
Maximum likelihood estimation you can think of it as a counter-push. In most cases we calculate the result based on a known condition, and the maximum likelihood estimate is the one that already knows the result and then seeks to make that result the most probable condition, as an estimate. For example, if other conditions are certain, smokers who are at risk of lung cancer are 5 times times more likely to be non-smokers, then if I now know that a person is lung cancer, I would like to ask you whether this person smokes or smokes. How do you judge? You probably don't know anything about this person, and the only thing you've got is that smoking is more prone to lung cancer, so you're guessing this guy doesn't smoke? I believe you are more likely to say that this man smokes. Why? This is "the greatest possible", I can only say that he "most likely" is smoking, "He is smoking" this estimate is "most likely" to get "lung cancer" results. This is the maximum likelihood estimate.
Well, the great likelihood is that it will be mentioned here, summing up:
Maximum likelihood estimation is only a kind of probability theory applied in statistics, it is one of the methods of parameter estimation. It is said that a random sample is known to satisfy a certain probability distribution, but the specific parameters are not clear, the parameter estimation is through a number of experiments to observe the results, using the results to introduce the approximate value of the parameters. The maximum likelihood estimate is based on the thought that a parameter is known to make the sample appear the most likely, and we certainly will not choose a sample of other small probabilities, so simply take this parameter as the real value of the estimate.
A new solution to machine learning-em algorithm