Andrew Ng Machine Learning Open Course Notes -- mixtures of gaussians and the EM Algorithm

Source: Internet
Author: User
Tags andrew ng machine learning

Netease Open Course (Class 12 and 13)
Notes, 7A, 7B, 8

This chapter introduces unsupervised Algorithms
For unsupervised users, K means is the most typical and simple. You need to read the 7A handouts directly.

 

Mixtures of gaussians

To understand mixtures of gaussians, go back and review gaussians Discriminant Analysis and Gaussian discriminant analysis.

First, Gaussian discriminant analysis generates algorithms,

Therefore, P (Y | X) is not directly fit, but p (x | Y) P (Y), that is, P (x, y)

P (y) conforms to the Berner's hard distribution. If it is a multiclass classification, that is, polynomial distribution
P (x | Y) conforms to multiple Gaussian distributions.

Then we use the maximum likelihood method to learn


This problem is solved.

 

For Gaussian mixture, the difference is that for a series of data points, Y is unknown, that is, unsupervised.
Let's take a look at the formal definition,

Since y is unknown, change the name, Z, implicit random variable (LatentRandom Variables, meaning that they're hidden/unobserved .)


Z is in line with polynomial distribution. The parameter phi j indicates the probability of Z = J. Therefore, Phi must be greater than = 0, and all Phi is equal to 1.

X | Z, conforming to multiple Gaussian distributions

In addition, the Gaussian discriminant analysis only replaces y with Z, indicating that Z is unknown and invisible.
In addition, each number of Gaussian distributions is different, which is different from Gaussian discriminant.

The maximum likelihood is,

In the maximum likelihood, only X is considered, and p (x, y) is not considered as Gaussian discriminant, because y is invisible.
But how can this problem be solved?
As you can imagine, one-dimensional data has many data points, representing a mixture of multiple Gaussian distributions.
However, Gaussian distribution must be concentrated among vertices, where p (x) is relatively high.
Assume that our data points are representative, so fitting out a Gaussian distribution with a high p (x) value is more reasonable.

How can we solve this problem?
It is difficult to solve the problem by using gradient descent directly, because the sum in the log is... Try it

Of course, if z is known here, it is very simple, and it directly becomes a Gaussian discriminant analysis problem, but the problem is unknown now Z.

To solve this problem, the EM algorithm Expectation Maximization Algorithm

The idea of this algorithm is actually very simple, but it is complicated to deduce and prove its convergence and effectiveness.

So let's first look at the ideas and implementations, and then look at the derivation.

The idea is simple. Since we don't know Z, and we can solve this problem if we know it, we can guess z first and then iterate.

The details are as follows,

Step EWe can calculate the Zi corresponding to each XI by using any initialization parameter. In fact, we only need to calculate the above concept distribution.

The formula is as follows,

Which respectively conform to the polynomial and multiple Gaussian distributions, and the formula can be easily calculated.

M steps,

Use the preceding Z to re-calculate the parameter. here we can see why we only need to calculate W, because it is enough to calculate the new parameter.

As to why this formula is used, we can obtain it from the Gaussian discriminant analysis above,

Simply replace the part with W

Through non-stop iteration of E and M steps, the final convergence will surely be able to reach the local optimum. Like K-means, we can try some initial values to find the global optimum.

But why does this simple method work? How can we understand em? Continue

 

The EM Algorithm

We can see that EM is used to fit the Gaussian mixture problem, but this is only a special case of em.

This chapter will export the general form of EM, which can solve various estimation problems Containing hidden variables (estimation problems with latent variables .)

 

Jenkins' Inequality

First, let's introduce the Jenkins inequality.

First, let's take a look at the figure below. When F is a convex function
E [F (x)]> = f (E [x])

For convex functions, if X is a random variable and the distribution is even, the mean value of X must be close to the bottom. Therefore, this inequality must be true.

When F is a strictly convex function, that is, a common convex function, the second derivative may be 0, for example, a certain section is a straight line.
If you want E [F (x)] = f (E [x]), if and only if x = E [X], X is a constant

Note that for concave, the concave function is also satisfied, but the inequality is in the opposite direction.

 

EM Algorithm

Let's take a look at the EM algorithm,

For M independent training data points, the likelihood function is as follows,
Here is the general form, so the parameter is, there is no assumption that the distribution of Z and X | Z can be any distribution.


This direct solution is very difficult, so we use the EM algorithm to solve it.

Solution,

E-step, construct a lower-bound on
Initialize the parameter at will first to build the lower bound of the distribution, that is, the worst case.
Then, the lower bound distribution is used to obtain the Z

M-step, optimize that lower-bound
Optimize parameters using Z obtained by e-step

For example, in the iteration process, the distribution of Lower Bounds constantly approaches the real distribution.

 

First, assume that Q is a certain distribution of Z, and Q (zi) is the probability of occurrence of Zi, and Q (zi)> = 0

Then, in order to use the Jenkins inequality, (1) the denominator of the numerator multiplied by Q (zi) at the same time, thus the expected E

Let's take a look at the expected definition,

Reference, (EM algorithm) the EM Algorithm

 

Corresponding to the above formula, where

, G (z)

And P

So,

Yes,

Let's look at the Jenkins inequality, E [F (x)]> = f (E [x]), where F is log, so we get the above (3)

Therefore, a lower bound is generated,

We need to optimize the lower bound in M-step, but the problem is that Q distribution is not yet determined, and how to determine which Q distribution will be the best

Although we provide a lower bound in the parameter, we hope this lower bound can be approached as much as possible, so we hope that (3) it is best to obtain the equation, so that the lower bound is equal

In this case, let's look at the condition of the = value, that is,

Because, so let the numerator and denominator sum all Z, it should still be equal to C, such as 2 + 4/1 + 2, still 2, get

,

So the Q distribution is the posterior probability of Z.

Therefore, the final general EM algorithm is,

We can compare the previously mixed Gaussian em to understand the difference between the special case and general purpose.

So does this algorithm converge? That is, the formula below proves that the iteration at t + 1 is equal to the iteration at T.

The process is as follows,

(4) Lower Bound

(5) In M-step mode, optimization is required under fixed Q conditions, so the optimization must be greater than the original one.

(6) because Q makes

So you have to verify

Both em and k-means will converge to the local optimum.

From another perspective, Em is actually a coordinate rise algorithm,

In e-step, we can solve the optimal Q.

In M-step, we fix Q to solve the optimal

 

Mixture of gaussians Revisited

After reading the General EM algorithm, we will look at the mixed Gaussian algorithm over the head, which should be clearer.

E-step is simple,

General em, expressed

For the Gaussian mixture algorithm, this is natural and does not need to be explained.

Then for M-step, we need to maximize the following formula to find

The subsequent Solution Process is to pair them separately. After the solution is obtained, the formulas listed above can be obtained. For details about the specific process, refer to the handout, which is not listed here.

 

Text clustering-mixtures of Naive Bayes model

No handouts.

For Naive Bayes, it is a text classification problem because the training set here does not know y.
M texts are obtained. Each text is an n-dimensional vector. {0, 1} indicates whether the word appears in the text.

The Hidden variable Z is also the value {0, 1}, which indicates two classes. Then Z is consistent with the Berner's effort distribution.

P (x | z), conforming to the naive Bayes Distribution

Here is the formula for e-step and M-step.

Of course, M-step is solved by maximizing p (x | Z ).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.