Shallow solution from maximum likelihood to EM algorithm

Source: Internet
Author: User

Here's the original.

One of the Ten machine learning algorithms: EM algorithm. To be able to comment on one of the ten, so that people sound like a very NB. What is NB Ah, we generally say that someone is very NB, because he can solve some problems that others cannot solve. Why God is God, because God can do things that many people cannot do. So what can the EM algorithm solve? Or the EM algorithm comes to this world because of what, also attracts so many world's eyes.

I hope I can easily understand or understand it, but the problem of EM is really not very easy to use the popular language to understand, because it is very simple and complex. Simple lies in its thought, simply because it contains only two steps to complete the powerful function, the complexity lies in its mathematical reasoning involves more complex probability formula. If only the simple, lost the essence of the EM algorithm, if only to talk about mathematical reasoning, and too dull and jerky, but on the other hand, want to combine the two is not an easy thing. So, I can't expect to see how I can talk about it. I hope you will not hesitate to guide.

First, maximum likelihood

Pull too much, get into the point. Let's say we're having a problem like the following:

Suppose we need to investigate the height distribution of boys and girls in our school. What do you do? You said that so many people can not be one to ask, it must be a sample. Let's say you've captured 100 boys and 100 girls on campus casually. They have a total of 200 people (that is, 200 height sample data, in order to facilitate the expression, below, I said "people" means that the corresponding height) are in the classroom. What about the next step? You start shouting: "Male left, female right, other station in the middle!" ”。 And then you'll start by counting the height of the 100 boys that were sampled. Suppose that their height is subject to Gaussian distribution. But the mean U and variance of this distribution? 2 We don't know, these two parameters are what we want to estimate. Remember as Θ=[u,?] T.

In the language of mathematics: in the school so many boys (height), we independently according to the probability density p (x|θ) to extract 100 (height), composed of sample set X, we want to estimate the unknown parameter θ by the sample set X. Here the probability density p (x|θ) We know is Gaussian distribution N (U,?) The form in which the unknown parameter is Θ=[u,?] T. The sample set drawn is X={x1,x2,..., XN}, where Xi represents the height of the person I drew, where n is 100, indicating the number of samples drawn.

Since each sample is independently extracted from P (x|θ), in other words, any of the 100 boys is my casual catch, from my point of view these boys are not related. So, why do I get so many boys from school that I just got 100 of them? What is the probability of pumping these 100 people? Because these boys (the height) are subject to the same Gaussian distribution P (x|θ). So the probability that I get the boy A (height) is P (xa|θ), the probability of pumping boys B is P (xb|θ), because they are independent, so it is obvious that I also get boys A and boys B is the probability of P (xa|θ) * p (xb|θ), similarly, The probability that I'm pumping the 100 boys at the same time is the product of their probabilities. In the words of a mathematician, the probability of extracting the 100 samples from the population of the distribution is P (x|θ), which is the joint probability of each sample in the sample set X, is represented by the following formula:

This probability reflects the probability that the probability density function, when the parameter is θ, is obtained by X for this set of samples. Because X is known here, that is to say, the height of the 100 people I have extracted can be measured, which is known. While θ is unknown, the above formula is only θ, so it is the function of theta. This function shows the possibility of taking the current sample set under different parameter θ values, so it is called the likelihood function of the parameter θ relative to the sample set X (Likehood functions). Remember as l (θ).

Here comes a concept, likelihood function. Do you remember our goal? We need to estimate the value of the parameter θ in the condition where we have already pumped this set of sample X. How do you estimate it? What's the use of likelihood function? Let's begin by understanding the concept of the likelihood.

Give an example directly:

A classmate went out hunting with a hunter, and a hare sprang from the front. Just one shot, the hare went down, and if you want to speculate, who hit the bullet? You will think, only a shot to hit, because the probability of the hunter's hit is generally greater than the probability of the classmate hit, it appears that the gun was shot by the hunter.

The inference from this example embodies the basic idea of the maximum likelihood method.

For example: After class, a group of male and female students went to the toilet respectively. Then, you idle bored, want to know the recess is a boy to go to the toilet more people or girls to the toilet more people, and then you run to squat in the men's and female toilets in the doorway. Squat for five minutes, suddenly a beautiful woman came out, you ecstatic, ran over to tell me that the girls at recess more people, you do not believe you can enter the number. Oh, I'm not so stupid to go to the count, not to the headlines. I asked you how you knew it. You said: "5 minutes, out of the girls, girls ah, then girls out of the probability is definitely the biggest, or more than the boys, then the women's room is certainly more than the men's toilets." See, you've used the maximum likelihood estimate. You see the girls come out first, so under what circumstances will the girls come out first? Certainly is the probability that the girl comes out the biggest time, that when the girl comes out of the probability biggest ah, that must be the ladies ' room more than men in the time of the toilet, this is your estimated parameters.

From the above two examples, what conclusions have you got?

Back to the example of male height. In the school so boys, I smoke the 100 boys (for height), and not others, that is not to say that throughout the school, these 100 people are the most likely to appear. So what does this probability mean? Oh, that's the likelihood function L (θ) above. So, we just need to find a parameter θ, which corresponds to the maximum number of likelihood function l (θ), which means that the 100 boys are the most likely to be pumped. This maximum likelihood estimator, called θ, is recorded as:

Sometimes, you can see that L (θ) is a multiplicative, so for the sake of analysis, you can also define a log-likelihood function and turn it into a plus:

Well, now that we know that, to ask for Theta, we need only to make the likelihood function of theta (θ) greater, and then the maxima corresponding to Theta is our estimate. Here we go back to the question of finding the most value. How do I find the maximum value of a function? Of course it is derivative, then the derivative is 0, then the solution of this equation is the theta (of course, if the function of L (θ) is continuously differentiable). So if Theta is a vector with multiple parameters, what's the deal? Of course, L (θ) for all parameters of the partial derivative, that is, the gradient, then n unknown parameters, there are n equations, the solution of the equations is the extremum point of the likelihood function, of course, the n parameters are obtained.

Maximum likelihood estimation you can think of it as a counter-push. In most cases we calculate the result based on a known condition, and the maximum likelihood estimate is the one that already knows the result and then seeks to make that result the most probable condition, as an estimate. For example, if other conditions are certain, smokers who are at risk of lung cancer are 5 times times more likely to be non-smokers, then if I now know that a person is lung cancer, I would like to ask you whether this person smokes or smokes. How do you judge? You probably don't know anything about this person, and the only thing you've got is that smoking is more prone to lung cancer, so you're guessing this guy doesn't smoke? I believe you are more likely to say that this man smokes. Why? This is "the greatest possible", I can only say that he "most likely" is smoking, "He is smoking" this estimate is "most likely" to get "lung cancer" results. This is the maximum likelihood estimate.

Well, the great likelihood is that it will be mentioned here, summing up:

Maximum likelihood estimation is only a kind of probability theory applied in statistics, it is one of the methods of parameter estimation. It is said that a random sample is known to satisfy a certain probability distribution, but the specific parameters are not clear, the parameter estimation is through a number of experiments to observe the results, using the results to introduce the approximate value of the parameters. The maximum likelihood estimate is based on the thought that a parameter is known to make the sample appear the most likely, and we certainly will not choose a sample of other small probabilities, so simply take this parameter as the real value of the estimate.

General steps to find the maximum likelihood function estimate:

(1) Write out the likelihood function;

(2) The likelihood function takes the logarithm, and organizes;

(3) The derivative number is 0, and the likelihood equation is obtained.

(4) To solve the likelihood equation, the obtained parameter is the request;

Second, EM algorithm

Well, get back to the question of the height distribution estimate above. Now, by extracting the height of the 100 boys and the known height to obey the Gaussian distribution, we can obtain the corresponding Gaussian distribution parameter θ=[u by maximizing its likelihood function. T anymore. So, the height distribution of girls in our school can be obtained in the same way.

Back to the example itself, if there is no "male left, female right, other station in the middle!" "This step, or I have drawn to these 200 people, some boys and some girls at first sight, has been good, tangled up." We do not want to be so cruel, hard to pull them open. Now that these 200 people have been mixed together, when you from the 200 people (the height) casually to me to refer to a person (height), I can not determine whether the person (height) is a boy (height) or a girl (height). That is to say you do not know that the 200 people in the extraction of each person is from the boy's height distribution inside the extraction, or the girl's height distribution extracted. In the language of mathematics, every sample extracted is not known from which distribution is extracted.

At this time, for every sample or the person you extracted, there are two things to guess or estimate, one is this person is male or female? What is the parameter of the Gaussian distribution of height for boys and girls?

It is only when we know who belongs to the same Gaussian distribution that we are able to make a predictable prediction of the distribution parameters, such as the first maximum likelihood, but now the two Gaussian distributions are mixed together, and we do not know which ones belong to the first Gaussian distribution, which belong to the second one, Therefore, it is impossible to estimate the parameters of these two distributions. In turn, only when we have an accurate estimate of the parameters of these two distributions can we know which people belong to the first distribution, and those who belong to the second distribution.

It's a matter of having chickens or eggs first. The chicken said, without me, who gave you birth ah. The egg does not obey, say, without me, where do you jump out from? (hehe, this is a philosophical question.) Of course, the scientists later said there was an egg, because eggs evolved from bird eggs. In order to solve this you rely on me, I rely on your cyclic dependence problem, there must be a party to break the deadlock first, said, regardless, I first arbitrarily a value out, see how you change, and then I adjust my changes according to your changes, and then iterative to each other to deduce, and eventually converge to a solution. This is the basic idea of the EM algorithm.

I do not know whether people can understand the ideas, I would again to wordy. In fact, this idea is nowhere to be.

For example, when you were a child, MOM gave you a big bag of candy, called you and your sister equal, and then you do not bother to order the number of sweets, so you do not know how much each person should be divided. How do we usually do it? First, a bag of candy visually divided into two bags, and then put two bags of candy in the right hand, to see which heavy, if it is heavy, it is obvious right hand this generation of candy more, and then you in the right hand this bag of candy in the bag, and then feel under which heavy, and then from the heavy bag grab a small put into the light of the bag, continue to go, Until you feel like two bags of candy are almost equal. Hehe, and then in order to embody fairness, you also let your sister choose first.

The EM algorithm is like this, assuming we want to know that A and B two parameters, in the start state both are unknown, but if you know the information of a can get B information, in turn know B also got a. Consider first giving a certain initial value, in order to get the estimate of B, and then starting from the current value of B, re-estimate the value of a, the process continues until convergence.

EM means "expectation maximization", in our question above, we are the first to randomly guess the male (height) Normal distribution parameters: such as the mean and variance is how much. For example, the average male is 1.7-meter, The variance is 0.1 meters (of course, it is certainly not that accurate at first), and then calculates that each person is more likely to belong to the first or second normal distribution (for example, the person's height is 1.8-meter, it is obvious that he is the most likely to belong to the male distribution), this is a expectation step. With each person's attribution, or we have roughly according to the above method to divide these 200 people into two parts of boys and girls, we can according to the maximum likelihood of the above, through these are roughly divided into boys by the N individuals to re-estimate the first distribution parameters, the girl's distribution the same method to re-estimate. This is maximization. Then, when we update the two distributions, the probability of each of these two distributions changes again, then we need to adjust the E-step again ... This is repeated until the parameters are no longer changing substantially.

Here the complete description of each person (sample) is considered to be ternary yi={xi,zi1,zi2}, where Xi is the observed value of the sample I, that is, the height of the corresponding person, is the observable value. ZI1 and Zi2 indicate which of the two Gaussian distributions of boys and girls are used to produce the value XI, that is, whether the two values mark the person as a male or female (the height distribution is generated). These two values we do not know, is an implied variable. To be exact, Zij in XI by the J Gaussian distribution produces a value of 1, otherwise 0. For example, if a sample has an observation value of 1.8, then he comes from the boy's Gaussian distribution, then we can represent this sample as {1.8, 1, 0}. If the values of Zi1 and Zi2 are known, that is, everyone I have labeled as boys or girls, then we can use the maximum likelihood algorithm described above to estimate their respective Gaussian distribution parameters. But they are unknown, so we can only use EM algorithms.

We are not now because of that disgusting implicit variable (extracted from each sample does not know which distribution is extracted from) so that the original simple can solve the problem becomes complex, can't solve it. What about that? The idea of a man's problem-solving is to simplify the complex problem. OK, so now put this complex problem back, I assume already know this implied variable, ah, then solve the parameter of the distribution is not very easy, directly according to the maximum likelihood estimate is good. Then you ask me, this implied variable is unknown, how do you come up with a hypothesis that is known? You have no basis for this assumption. Oh, I know, so we can first give this to the distribution to get an initial value, and then seek the implied variable of the expectation, as the implied variable of the known value, then you can now use the maximum likelihood to solve the parameter of the distribution, it is assumed that this parameter is better than the previous random parameter, it is more able to express the real distribution, Then we can find the expectation of this implied variable by the distribution of this parameter, then maximize it and get another better parameter ... iteration, you can get a happy result.

At this time you are not satisfied, said you old iterative iteration, how do you know the new parameters of the estimate is better than the original? Why does this approach work? Is there a time for failure? When does it expire? What problems do you need to pay attention to in this way? Oh, suddenly throw so many problems, make me adapt to come, but this proves that you have a good research potential AH. Oh, actually these problems are the problem that mathematicians need to solve. It is mathematically possible to prove or draw a conclusion. Let's use maths to re-describe the problem above. (Here you can know that no matter how complex or simple the idea of the physical world, you need to use mathematical tools to model the abstraction to be used and to play its powerful role, and that the mathematics contained in it can often bring you more imagination, this is the mathematical subtlety AH)

Three, EM algorithm derivation

Suppose we have a sample set {x (1),..., X (M)}, which contains m independent samples. But the class Z (i) corresponding to each sample I is unknown (equivalent to clustering), which is also the implied variable. So we need to estimate the parameter θ of the probability model P (x,z), but because it contains the implied variable z, it is difficult to solve with the maximum likelihood, but if z knows, then we can easily solve it.

For parameter estimation, we essentially want to obtain a likelihood function maximization of that parameter θ, and now the maximum likelihood is different than the likelihood function formula is more than an unknown variable z, see the following formula (1). That is to say our goal is to find the right theta and Z to make L (θ) the largest. Then we may think, you are more than an unknown variable, ah, I can also distinguish between the unknown θ and z-biased, and then equal to 0, the solution is not the same?

In essence we need to maximize (1) formula (1), we recall the solution of the edge probability density function of a variable under the joint probability density, and note that Z is also a random variable. For each of the possible categories of the sample I z to the right of the equality of the joint probability density function and the equation to the left of the random variable x edge probability density, that is, the likelihood function, but you can see there is "and" logarithm, after the derivation form will be very complex (you can imagine the log (F1 (x) + F2 (x ) + F3 (x) + ...) Complex functions, so it is difficult to solve the unknown parameters z and θ. OK, can we make some changes to the (1) formula? We see (2), (2) formula is only the numerator denominator multiplied by an equal function, or there is a "sum of the logarithm" ah, or can not solve, then why do this? Let's take a look at (3) and find that (3) becomes "logarithmic and", so it's easy to take a derivative. We notice that the equal sign becomes an equal, why can it be so changed? This is the place where the Jensen inequality is greatly apparent.

Jensen Inequalities:

Set F is a function that defines the field as a real number, if for all real numbers x. If the two derivative of all real x,f (x) is greater than or equal to 0, then f is the convex function. When x is a vector, if its Hessian matrix H is semi-positive, then f is the convex function. If it is only greater than 0, not equal to 0, then the "F" is a strictly convex function.

The Jensen inequalities are expressed as follows:

If f is a convex function, X is a random variable, then: E[f (x)]>=f (E[x])

In particular, if f is a strictly convex function, the equation is taken when and only if X is a constant.

If you use a diagram, it will be clear:

In the figure, the real line f is the convex function, x is a random variable, the probability of 0.5 is a, and the probability of 0.5 is B. (Just like a coin toss). The expected value of X is the median of a and B, and E[f (x)]>=f (e[x]) can be seen in the figure.

When F is a (strict) concave function, and only if-F is a (strict) convex function.

When a Jensen inequality is applied to a concave function, it is reversed in an equal direction.

Return to the formula (2) because F (x) =log x is a concave function (its two derivative is -1/x2<0).

(2) in the type of expectation, (considering that E (x) =∑x*p (x), F (x) is the function of x, then E (f (x)) =∑f (x) *p (x)), and so you can get the equation (3) inequality (if not understand, please pick up the pen, hehe):

OK, here, now the formula (3) is easy to derivative, but the formula (2) and formula (3) is not equal to Ah, the maximum value of formula (2) is not the maximum value of formula (3) Ah, and we want the maximum value of the formula (2), How to do?

Now we need a little imagination, the above formula (2) and formula (3) inequalities can be written: the likelihood function number L (θ) >=j (z,q), then we can continue to maximize the Nether J, so that l (θ) continuously improve, and eventually reach its maximum value.

See, we fix θ, adjust q (z) to increase the Nether J (z,q) to equal to the L (θ) at this point θ (green curve to blue curve), then fix Q (z), adjust θ to make the Nether J (z,q) reach the maximum (θ T to θt+1), then pin θ, adjust Q (z) ... Until the θ* is convergent to the maximum value of the likelihood function L (θ). Here are two questions: when is the Nether J (z,q) equal to the L (θ) at this point θ? Why does it always converge?

First of all, in the Jensen inequality, when the argument x is constant, the equation is set. And here, that is:

Again, since q is the probability density function of the random variable z (i), it can be obtained: the numerator and equals C (the numerator denominator sums all Z (i): the numerator denominator of multiple equations is added to the same, which is considered to be C for each of the two probability ratios of each sample):

At this point, we have introduced the calculation formula of Q (z), which makes the nether pull up after the fixed parameter θ, to solve the problem of how Q (z) is selected. This step is e-step, which establishes the lower bound of the L (θ). The next M-step is to adjust θ after the given q (z), to make the Nether J of the L (θ) greater (the nether can be adjusted even larger after fixing Q (z)). Then the general EM algorithm steps are as follows:

EM algorithm (expectation-maximization):

The expected maximum algorithm is a maximum likelihood estimation method that solves the probabilistic model parameters from incomplete data or data sets with data loss (there are hidden variables).

Algorithm flow of EM:

Initialize the distribution parameter θ;

repeat the following steps until Convergence :

e Step: According to the parameter initial value or the model parameter of the last iteration, the posterior probability of the recessive variable is calculated, which is the expectation of the recessive variable. As the current estimate for hidden variables:

m Step: maximize the likelihood function to obtain the new parameter value:

With this constant iteration, you can get the parameter θ that maximizes the likelihood function L (θ). Then you have to answer the second question, will it converge?

Perceptual saying, because the nether is constantly improving, so the maximum likelihood estimate monotonically increases, then finally we will reach the maximum likelihood estimate max value. Rational analysis of the words, you will get the following things:

How to prove, see the derivation Process reference: Andrew Ng "the EM algorithm"

Http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006936.html

Iv. Another understanding of the EM algorithm

Coordinate ascent method (coordinate ascent):

The path of the straight-line iterative optimization in the figure, you can see that each step will be further ahead of the optimal value, and that the forward route is parallel to the axis, because each step only optimizes one variable.

This is like finding the extremum of a curve in the X-y coordinate system, but the curve function cannot be directly derivative, so what gradient descent method does not apply. However, after one variable is fixed, the other one can be obtained by derivation, so we can use the coordinate rising method to fix one variable at a time, to find the other extremum, and then to approximate the extremum gradually. Corresponds to EM,e-step: fix θ, optimize q,m step: fix Q, optimize θ, alternately push the extremum to maximum.

V. Application of EM

EM algorithm has a lot of applications, the most extensive is GMM mixed Gaussian model, clustering, hmm and so on. Refer specifically to the Machine learning column in Jerrylead's cnblog:

(EM algorithm) the EM algorithm

Mixed Gaussian model (mixtures of Gaussians) and EM algorithm

K-means Clustering algorithm

There is no chicken and egg fight, because they all know that "without you there is no me." Since then they have a happy and beautiful life together.

Shallow solution from maximum likelihood to EM algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.