One of the top 10 machine learning algorithms: EM Algorithm

Source: Internet
Author: User

One of the top ten algorithms for Machine Learning: EM algorithm. One of the top 10, which makes people think Nb-rich. What is Nb? We generally say someone is Nb because he can solve problems that others cannot solve. Why God is God, because God can do things that many people cannot do. So what problems can the EM algorithm solve? Or the reason why the EM algorithm came to this world has attracted so many people's eyes.

I hope I can understand it in plain words. However, the EM problem is really difficult to explain in plain language, because it is very simple and complicated. The simplicity lies in its thinking. The simplicity lies in its ability to complete powerful functions only in two steps. The complexity lies in its mathematical reasoning involving complicated probability formulas. The essence of the EM algorithm is lost if you only talk about math reasoning. However, on the other hand, it is not easy to combine the two. So I can't expect it to happen. I hope you will not give any guidance.

 

I. Maximum Likelihood

Too many questions have to be answered. Suppose we have encountered the following problems:

Suppose we need to investigate the height distribution of boys and girls in our school. What do you do? You can't ask so many people one by one. It must be a sample. Suppose you caught 100 boys and 100 girls on campus. They have a total of 200 people (that is, the sample data of 200 height) in the classroom. For convenience, I will say "person" means the corresponding height. What should we do next? You start to shout, "male's left, female's right, and other stations !". Then, you will first calculate the height of the sample of 100 boys. Assume that their height is subject to Gaussian distribution. However, we do not know the mean of the distribution U and variance 2. We need to estimate these two parameters. It is written as θ = [U, Hangzhou] T.

In the language of mathematics, among so many boys in school (height), we have extracted 100 (height) separately according to the probability density P (x | θ ), component sample set X. We want to use sample set X to estimate the unknown parameter θ. Here we know the probability density P (x | θ) is in the form of Gaussian distribution N (u, cosine), where the unknown parameter is θ = [U, cosine] T. The sample set is X = {x1, x2 ,..., Xn}, where Xi indicates the height of the person I got, where n is 100, indicating the number of samples.

Since each sample is extracted from p (x | θ) independently, in other words, I can catch any of the 100 boys, from my perspective, these boys are irrelevant. So why did I find these 100 boys from school? What is the probability of 100 people? Because these boys (height) follow the same Gaussian distribution p (x | θ. So the probability of a (height) for boys is P (xa | θ). The probability of B for boys is P (XB | θ), because they are independent, therefore, it is obvious that the probability of both boys A and B is P (xa | θ) * P (XB | θ). Likewise, the probability of these 100 boys is the product of their respective probabilities. In the mathematician's tone, the probability of extracting the 100 samples from the total sample with the distribution of P (x | θ), that is, the joint probability of each sample in the sample set X, use the following formula:

This probability reflects the probability of the X Group of samples when the parameter of the probability density function is θ. This is because X is known, that is, the height of the 100 people I have extracted can be measured, that is, known. If θ is unknown, then the above formula is only θ, which is an unknown function. This function shows the possibility of obtaining the current sample set under different values of θ. Therefore, it is called the likehood function of the parameter θ relative to the sample set X ). It is recorded as L (θ ).

A concept, likelihood function, is introduced here. Do you still remember our goal? We need to estimate the value of θ under the condition that we have drawn this set of sample X. How can we estimate it? What is the use of likelihood functions? Let's first understand the concept of likelihood.

For example:

A classmate went out hunting with a hunter and a hare crossed the forward. Just one shot, and the hare answered. If you want to speculate, who shot the hit bullet? You will think, just hit it with one shot. Because the probability of a hunter's hit is generally higher than the probability of a classmate's hit, it seems that the gun was shot by the hunter.

The inference in this example reflects the basic idea of the maximum likelihood method.

Another example: after class, a group of men and women went to the toilet. Then, you are bored and want to know if there are too many boys or girls who need to go to the toilet during the recess, and then you go to the door of the Men's and female toilets. After five minutes, a beautiful girl suddenly came out, and you were excited and ran to tell me that there were a lot of girls who went to the bathroom during the recess. Do you want to believe that you could go in and count them. Well, I am not so stupid enough to go into the data. I asked you how you knew it. You said, "In five minutes, girls and girls are coming out. The probability of girls coming out is definitely the greatest, or higher than boys, there must be more women's toilets than men's toilets ". You have used the maximum likelihood estimation. After you observe that girls come out first, when will girls come out first? It must be the time when girls come out with the highest probability. When will girls come out with the highest probability? It must be the time when women's toilets are more likely than men's toilets, this is your estimated parameter.

From the two examples above, what conclusions do you get?

Return to the example of boys' height. Among the boys at school, I drew the 100 boys (height) instead of others. Does that mean that the 100 boys (height) in the school) the highest probability of appearance. How is this probability expressed? Oh, that is the above likelihood function L (θ ). Therefore, we only need to find a parameter θ, which corresponds to the maximum likelihood function L (θ), that is to say, the probability of drawing these 100 boys (height) is the highest. This is the maximum likelihood estimator of θ, which is recorded:

Sometimes, we can see that l (θ) is a concatenation. To facilitate analysis, we can also define the logarithm likelihood function and convert it into a concatenation:

Well, now we know that if θ is required, we only need to make the likelihood function L (θ) of θ very big, and then the θ corresponding to the maximum is our estimation. Here we return to the problem of maximizing the value. How to obtain the maximum value of a function? Of course, it is the derivation, and then let the derivative be 0, then the θ obtained by solving this equation is (of course, the premise is that the function L (θ) is continuous and micro ). So what if θ is a vector containing multiple parameters? Of course, it is to find the partial derivative of L (θ) to all parameters, that is, the gradient. Then n unknown parameters have n equations, the solution of the equations is the extreme point of the likelihood function. Of course, the N parameters are obtained.

You can regard the maximum likelihood estimation as a reverse push. In most cases, we calculate the results based on known conditions, while the maximum likelihood estimation is the condition that we know the results and then seek to make the results appear most likely. For example, if other conditions are certain, the risk of lung cancer is 5 times that of those who do not smoke, then if I already know that one person is lung cancer, I want to ask whether this person smoke or not. How do you judge? You may know nothing about this person. All you know is that smoking is more prone to lung cancer. do you guess this person does not smoke? I believe you are more likely to say that this person smokes. Why? This is "the greatest possibility". I can only say that he is "the most likely" to smoke, the estimate of "he is smoking" is the result of "most likely" getting "lung cancer. This is the maximum likelihood estimation.

Now, let's take a look at the maximum likelihood estimation. Let's summarize:

Maximum Likelihood Estimation is only an application of Probability Theory in statistics. It is one of the parameter estimation methods. It is said that a random sample is known to satisfy a certain probability distribution, but the specific parameters are not clear. The parameter estimation is based on several tests, the results are observed, and the approximate values of the parameters are introduced using the results. The maximum likelihood estimation is based on the idea that it is known that a parameter can maximize the probability of this sample. Of course, we will not choose other samples with low probability, therefore, we can simply use this parameter as the estimated real value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.