Machine learning Notes (10) EM algorithm and practice (with mixed Gaussian model (GMM) as an example to the second complete EM)

Source: Internet
Author: User

What we're going to talk about today is the EM algorithm. At first sight of EM I thought of my big maple brother, em Master, maxima, RUA!!! I wonder if anyone reading this blog understands the stem. Ok, so to speak, today's EM algorithm, the full name is expectation maximization, the expectation maximization. What is the meaning of this is to give you a bunch of observational samples that give you an estimate of the parameters of the model. Oh, my gosh, this is not a bad time for us to discuss the various regressions. The expectation, the logarithm expectation, the derivation is 0, obtains the parameter estimate, this routine I understand, mle! But the question is, if there is an implicit variable in the middle of the problem? Will we be able to bring our routines to a collapse, we have two examples to understand the two cases.

====================================================================

There is no EM in the middle variable.

Let's say that one day humans eliminate gender differences, and all people are the same gender. This time, I gave you a group of people's height let you make me an estimate of the height of the model.

What do we do? Feel height should obey Gaussian distribution, so assume that the height distribution obeys Gaussian distribution N (mu,sigma^2), now I have n personal height data, I can follow the above routines. The logarithm likelihood function is first evaluated


Then the two parameters are biased to 0


So we can get our parameter estimates.


The result of the loved one, and the good and our intuition, let's look at another situation.

====================================================================

There is an EM in the middle variable.

As you know, the relationship between height and race is very large, and humans have so many races, so give you a bunch of people's data, to do an estimate of the height of the model, then we should do?

First of all, now divided into a number of different races, the probability of these categories must have a distribution, and secondly, the height of the various families are subject to different distributions, so the height of the estimate becomes


Alpha represents the proportion of the sample belonging to a race, which is actually a hidden intermediate variable. Muk and sigmak^2 are the parameters of various Gaussian distributions. According to our above pattern is to seek the logarithm likelihood probability derivative to obtain the parameter estimation, then first looks the likelihood function


This embarrassing situation arises, with a plus sign inside the logarithm, the derivation becomes complex and impossible to solve, in fact, the formula does not have an analytic solution. But let's just say that if we randomly guess the distribution of alpha is Q, then the logarithmic likelihood function can be written


Since q is a distribution of alpha, the likelihood function can be considered a log (e (x)), log is a concave function, the secant is always below the function image, the Jensen inequality is applied in reverse, there is a log (e (x)) >=e (log (x)), So the logarithm above seems to have


Calmly analysis of the current situation, we now have a logarithmic likelihood function of the nether function, we adopt the strategy of saving the curve, we solve its local maximum value, then the updated parameters into the Nether function is larger than the previous parameter value, and it is a logarithmic likelihood function of the nether function, so the parameter update, Our logarithmic likelihood function must be getting bigger! So, using this method to iterate, we can get better parameter estimation at last. Still a little dizzy, nothing, I take a map from Baidu to give you an image of the explanation


The red line is our logarithmic likelihood function, blue that is we found in the current parameters of the logarithmic likelihood of the nether function, we can see that we find its local extremum so the parameters are updated to thetanew, the log likelihood function value has also been raised, so repeat, Is it possible to converge to a local extremum of the logarithmic likelihood function? Yes, the local extremum, is not guaranteed to be the global optimal, but it is an estimate, you still want her how?!

Here, we seem to jump first to the second step to update the work done, then there is one thing we should pay attention to, Q, q is what, no q you calculate what extremum, update what parameters. We already know that Q is a distribution of alpha, and then we definitely want this nether function to be as close as possible to the original logarithmic likelihood function so that we can update the parameters more quickly, when the Nether function is the largest, the equals sign is set up, the equals sign is a constant for the object you expect. So log and q who doesn't matter before and after, then there's


It is intuitively possible to understand the possibility that the sample I comes from the K category. Well, now Q is OK, we update the parameters according to the method described above, then update Q, and then update the parameters, the iteration will go on.

If you can insist on seeing here, I can only say that you have done the work. Because actually we have put the EM algorithm entire deduction, perhaps you still a little foggy, then we will carefully comb this process

1 Get all the observational samples and give a parameter estimate based on a priori or preference.

2 the lower bound function closest to the logarithmic likelihood function is obtained according to this parameter estimation and the sample calculation category distribution Q.

3 The extremum of the nether function is obtained, and the parameter distribution is updated.

4 iterations are calculated until convergence.

Say Ah, the EM algorithm is said to be a machine learning advanced algorithm, but at least for the moment, it is still easy to understand the idea, the whole process of the only one may feel a little bit around the place is the application of Jensen inequalities that step, then I long-winded two sentences. So-called Jensen inequalities, as you can understand, for a convex function f, its secant is always above the function image you admit, I take two points x1,x2 above, the parameter theta between 0 and 1, then theta*x1+ (1-theta) * X2 is a point between X1 and X2, in this point over x1x2 secant value is greater than the value of the function, is not there theta*f (x1) + (1-theta) *f (x2) >f (theta*x1+ (1-theta) *x2), According to this conclusion, there is E (f (x)) >f (E (x)), in the logarithmic likelihood function, because the log is a concave function, so turn it back to use, the old iron is no problem?! This point of view I think the whole EM algorithm process is good to understand.

And here we go back to the prediction of this height model, given the M-samples, K-races, and the height of each race are subject to Gaussian distribution, then this becomes the most representative example of the EM algorithm, the Gaussian mixture model GMM.

==================================================================== Gaussian mixture model (GMM)

Just now we've talked about the EM algorithm, so let's say we're in the middle of a step iteration, so what do we do?

E-step to find the best category distribution


It can be understood as the probability that the first sample belongs to Class J.

M-step Update parameters

After the q is obtained, we get the lower bound function closest to the original logarithmic likelihood function, then we can get the updated parameters by the extremum of it, and first look at this nether function


The log function is full of product items This is our favorite form, so when the derivation of the irrelevant we directly throw away, to find the parameters mu,sigma^2,psi, in turn, the derivation of 0.


The solution for the PSI may be more complex, first we remove all the psi irrelevant items in the Nether function, and then the PSI as a different scale has a natural constraint is that all psi sum is 1, so the objective function becomes


This constrained optimization in front of SVM does not know how many times to use the Lagrange multiplier method


The next derivative of the PSI


On both sides of the J from 1 to K even add, psi that one item is gone, the right type becomes the sample number m, so that the beta, we can obtain the PSI parameter update


At this point, all the parameter updates are done, and the iterations are repeated below.

Let's comb the GMM algorithm first.

1 start the iteration by taking the initial value for the parameter.

2 the probability of each sample for each category, the scientific term called the response degree


3 Updating model parameters


4 Repeat 232 steps until convergence.

Let's take a look at the meaning of these parameters, but it doesn't really fit our intuition. W (i,j) can be seen as the probability that the first sample belongs to Class J, then the number of all samples belonging to class J is the sum of W (i,j), each sample XI corresponds to the value of Class J is W (i,j) Xi, so that the average is the class J corresponding Mu, The variance that continues to be calculated according to this idea is the sigma^2 of Class J, and the probability of class J is the number of class J divided by the total number of samples. So, although the GMM model is a little scary, but think about it the final result is also in line with our intuition, each sample is a part of a certain category, all samples of a certain category of parts constitute this kind of distribution, perfect!!!

====================================================================

In this case, the theoretical part of our finished, the next is the time of the swap man, the last time I finished writing, I think iris data unsupervised algorithm can do ah, do not give the label we force it to classify to see how the effect. So here we k-means and GMM algorithm respectively to the iris processing, to see how their clustering effect.

The code is as follows

Import NumPy as Npfrom sklearn import datasetsfrom sklearn.cluster import kmeansfrom sklearn.mixture import Gaussianmixtur e# read Data Iris=datasets.load_iris () X=IRIS.DATA[:,:2]Y=IRIS.TARGETMU = Np.array ([Np.mean (x[y = i], axis=0) for I in range (3 ])  print ' actual mean = \ n ', Mu#k-meanskmeans=kmeans (n_clusters=3,init= ' k-means++ ', random_state=0) y_hat1=kmeans.fit_ Predict (x) Mu1=np.array ([Np.mean (X[Y_HAT1 = i], axis=0) for I in range (3)]) print ' k-means mean = \ n ', Mu1print ' classification correct rate is ', np.m EAN (y_hat1==y) gmm=gaussianmixture (n_components=3,covariance_type= ' full ', random_state=0) gmm.fit (x) print ' gmm mean = \ N ', gmm.means_y_hat2=gmm.predict (x) print ' classification accuracy is ', Np.mean (y_hat2==y)

The output result is

Actual mean value =

[[5.006 3.418]

[5.936 2.77]

[6.588 2.974]]

K-means mean value =

[[5.77358491 2.69245283]

[5.006 3.418]

[6.81276596 3.07446809]]

The correct rate of classification is 0.233333333333

GMM Mean value =

[[5.01494511 3.44040237]

[6.69225795 3.03018616]

[5.90652226 2.74740414]]

Classification of the correct rate of 0.533333333333 angry keyboard Ah, what the correct rate Ah!!! It's not easy, I think. Wit we look at the mean matrix. The first line given by K-means appears to be close to the actual second line, and the second line is close to the actual first row. Similarly, the mean matrix given by GMM has the same problem, and the second and third lines seem to be swapped. This is not the algorithm of the pot ah, it just give you a cluster, where can guarantee the same label as you Ah, three categories six kinds of labeling methods The family algorithm can only be random one, so now we put the prediction of the results of the label to see how the actual rate of accuracy.

Import NumPy as Npfrom sklearn import datasetsfrom sklearn.cluster import kmeansfrom sklearn.mixture import Gaussianmixtur e# read Data Iris=datasets.load_iris () X=IRIS.DATA[:,:2]Y=IRIS.TARGETMU = Np.array ([Np.mean (x[y = i], axis=0) for I in range (3 ])  print ' actual mean = \ n ', Mu#k-meanskmeans=kmeans (n_clusters=3,init= ' k-means++ ', random_state=0) y_hat1=kmeans.fit_ Predict (x) Y_hat1[y_hat1==0]=3y_hat1[y_hat1==1]=0y_hat1[y_hat1==3]=1mu1=np.array ([Np.mean (X[Y_HAT1 = = i], axis=0) For I in range (3)]) print ' k-means mean = \ n ', Mu1print ' classification correct rate is ', Np.mean (y_hat1==y) gmm=gaussianmixture (n_components=3, Covariance_type= ' full ', random_state=0) gmm.fit (x) print ' gmm mean = \ n ', gmm.means_y_hat2=gmm.predict (x) y_hat2[y_hat2== 1]=3y_hat2[y_hat2==2]=1y_hat2[y_hat2==3]=2print ' classification correct rate for ', Np.mean (y_hat2==y)

The output result is

Actual mean value =

[[5.006 3.418]

[5.936 2.77]

[6.588 2.974]]

K-means mean value =

[[5.006 3.418]

[5.77358491 2.69245283]

[6.81276596 3.07446809]]

The correct rate of classification is 0.82

GMM Mean value =

[[5.01494511 3.44040237]

[6.69225795 3.03018616]

[5.90652226 2.74740414]]

The correct rate of classification is 0.786666666667

Ah, the results are more satisfying, and even better than some of the previous supervised learning results ... In addition, the label inconsistency of the problem I used here is the most stupid manual adjustment, you can of course, according to your calculation of the mean matrix each row and the original mean matrix which line of the minimum distance, to determine its label in the original data automatically adjusted, which of course is OK, I stole a little lazy here.

OK, happy workday is over again, hahaha, weekend Hello!!!




Machine learning Notes (10) EM algorithm and practice (with mixed Gaussian model (GMM) as an example to the second complete EM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.