Gaussian mixture model (GMM)

Source: Internet
Author: User

review:
1. Probability density function, density function, probability distribution function and cumulative distribution functionThe probability density function is generally capitalized "PDF" ( Probability Density Function), also known as probability distribution function, sometimes referred to as probability distribution function. The cumulative distribution function is the integral of the probability distribution function. Note the distinction

Mathematically, the cumulative distribution function f (x) =p (x<x) represents the probability that the value of the random variable x is less than x . The meaning is easy to understand.

The probability density f (x) is the first derivative of X, the rate of change, of f (x) at x. If a very small neighborhood is Δx near an X, then the probability of a random variable x falling in (x, X+δx) is about f (x) Δx, which is P (x<x<x+δx) ≈f (x) Δx. In other words, the probability density f (x) is the probability that x falls within the "unit width" at x . The term "density" can be understood from this.

Example:
A. The simplest probability density function is a uniform distribution of the density function, for a value in the interval of the uniform distribution function . That is, when x is not on the interval, the function value is equal to 0, while on the interval, the function value equals. This function is not a complete continuous function, but it is an integrable function.

B. normal distribution is an important probability distribution. The probability distribution also changes with the parameters and changes. Its probability density function is:

2. Prior probability and posteriori probability

This article assumes that we all know what a conditional probability is (P (a| b) indicates the probability of a event occurring in the case of the B event.

Prior and posterior probabilities:
The textbook explanation is always too round. In fact, for example, we understand these two things.

Let's say there are two possible factors for us to go out of traffic jams (that is, just assume, don't take it seriously): too many vehicles and traffic accidents.

The probability of traffic jam is a priori probability .

So if we hear the news before we go out there's a traffic accident on the road today, then we want to calculate the probability of a traffic jam, which is called conditional probability. That is, p (traffic jam | accident). This is due to the fruit.

If we have been out of the door, and then encountered traffic jams, then we want to calculate the traffic accident caused by the probability of how much, that this is called the posterior probability (also the conditional probability, but usually used to say so). It is P (traffic accident | traffic jam). This is a cause for fruit.

The following definition is from the Baidu Encyclopedia:

A priori probability is a probability based on past experience and analysis, such as the full probability formula , which is often used as the "cause" in the problem of "seeking fruit by reason".

The posterior probability is the most likely occurrence of the event based on the information obtained from the "result", such as the Bayesian formula , which is the "cause" of the "fruit-seeking" problem.

----------

So what is the use of these two concepts?

Maximum likelihood estimates, let's look at an example:

One day, a patient went to the hospital to see a doctor. He told the doctor that he had a headache, and then the doctor judged that he had a cold and gave him some medicine to go back to eat. Someone must have asked, this example seems to have something to do with the maximum likelihood of what we want to say. The relationship can be big, in fact doctors unknowingly used the maximum likelihood estimate (although a bit far-fetched, but people reluctantly accept it ^_^).

You know, there are many causes of headache, such as colds, strokes, cerebral hemorrhage ... (Brain >_< This I do not know whether it will be a headache, and those who see problems on the headache of the patient is not in the scope of the discussion! )。 So why does the doctor say that the patient is a cold? Oh, the doctor said it was my years of medical experience. Let's look at the problem from the perspective of probability. In fact, the doctor's brain worked like this, and he calculated:

P (Cold | headache) (headache caused by a cold, similar to the probability below)

P (Stroke | headache)

P (Cerebral hemorrhage | headache)

...

Then the computer brain found that P (Cold | headache) was the biggest, so it was thought that the patient had a cold. Did you see it? This is called the maximum likelihood estimate (Maximum likelihood estimation,mle).

Let's Think again, p (Cold | headache), p (Stroke | headache), p (cerebral hemorrhage | headache) is a priori probability or a posteriori probability?

Yes, it's a posteriori probability. See, the posterior probability can be used for the doctor (as long as you calculate, hehe). In fact, the posteriori probabilities play such a purpose, based on some of the facts that occur (usually bad ones), to analyze the most probable causes of the results, and then to address the problem in a targeted way.

So what's the use of a priori probability?

Let's think about how P (brain stump | headache) counts. P (Brain stump | headache) = number of brain remnants in a headache/number of headaches, a sample of headaches is easy to find, but the number of people with headaches is not a good survey. If you ask a person who has a headache you are not brain, I guess that person will shoot you fly. The next priori probability comes in handy.

According to the Bayesian formula: P (b| A) =p (a| b) p (b)/P (A)

We can know: p (brain remnant | headache) =p (headache | Brain remnant) p (brain remnant)/p (headache), note: (Headache | Brain residue) is a priori probability, then using Bayesian formula we can use the priori probability to calculate the posterior probability.

P (Headache | brain stump) = number of people with headache in the brain/brain stump, so we just need to ask the person with the brain. Do you have a headache, it's obviously safe. (What do you say about the number of brain-handicapped people, then we assume we have a list of legendary brain remnants.) The classmate don't quarrel, I didn't say you on the list AH. And then we're not going to have to catch a guy with a headache and ask for a head count. At least ask a person who is in a good mood is more safe than asking a person with a headache.

I admit that the example above is far-fetched, but mainly to express a meaning. The posterior probability is generally difficult to calculate directly in practice, whereas the prior probability is much easier. Therefore, a priori probability is generally used to calculate the posteriori probability.

3. Maximum likelihood estimation (MLE) and maximal posterior probability (MAP), Bayesian inference and maximum entropyReference: Parameter Estimation (2): Maximum likelihood, maximum posteriori, Bayesian inference, and Max entropy "parameter estimation is ... Estimate the value of the probability distribution parameter by measuring or empirical data "-wikipedia said.

At this point we will find a variety of parameter estimation methods, such as maximum likelihood estimation, maximal posterior estimation, Bayesian inference, maximum entropy estimation, and so on. Although the methods are different, the truth behind them is basically the same. To understand the connections and differences between them, just take one of the simplest examples: observe a bunch of values generated from a Gaussian distribution, and estimate one of the parameters of the Gaussian distribution-the mean value. Is our experimental data: 1000 points from a 0-Gaussian distribution, the horizontal axis is the ordinal of the data (1:1000), and the ordinate is the value of the sample point.

1. Maximum likelihood estimation (MLE)

What are the best parameters? The probability of the occurrence of observational data (i.e., the so-called likelihood, likelihood) is the best parameter. This simple thought is the maximum likelihood estimate (Maximum likelihood estimation, MLE). For an independent distribution (I.D.D) sample set, the overall likelihood is the product of each sample likelihood. For example, the likelihood (likelihood) in this example is clearly:

In practice, because the calculation of the connection is troublesome, and the probability is very small, it is inevitable that the more the more and more close to 0 of the resulting numerical bug, so more to the log, get log likelihood (log likelihood):

Log does not change the convexity of the likelihood function, so it can make the extremum to U (the method of the function takes the extremum: the point of the derivative is 0, the boundary point. ), obviously get:

This completes the maximum likelihood estimation for the mean value of the Gaussian distribution. It is worth mentioning that the log likelihood of this example is simply too simple, so the extremum can be directly derivative, and in more cases, we need to use the gradient descent method and other optimization algorithms to solve. most of the most optimized toolkits will default to the minimum value of the function, so don't forget to multiply your log likelihood by 1 into a negative log (negative log likelihood) before you plug it into an optimization toolkit.

2. Maximum posteriori estimate (MAP)

MLE is simple and objective, but excessive objectivity can sometimes lead to overfitting (over fitting). With few sample points, the MLE does not work well. For this reason, the Bayesian school has invented the maximum posteriori estimate (Maximum a posterior). Let's review a priori, likelihood, and posterior examination by looking at one of the simplest probability graph models:

Likelihood: For an estimate parameter θ, it produces the probability density function p (x|θ) of the observed sample x is called the likelihood of X;

Prior:θ itself is an unobserved variable, since it is not observed, that is, it can be regarded as a random variable, assuming that it obeys the probability distribution of α as a parameter P (θ|α), called θ Priori;

Posterior: After observing X, our understanding of θ has been enhanced to modify its probability distribution to P (θ|α,x), which is called the posterior of θ, and the Bayesian formula can be used to obtain the posterior distribution:

That is, the product of a priori and likelihood. In the example of this article, if we know in advance that the mean u itself obeys a Gaussian distribution with a mean value of u0 and a variance of σ0, then after observing the data sample, the posterior distribution of U is:

The next is exactly the same as the MLE: Ask a u to make the posteriori probability the most. For convenience, the u0 is fixed to 0, change σ0 do several sets of comparative experiment:

The horizontal axis is the number of samples used to estimate the parameters, and the longitudinal axes are the errors between the estimates and the real values. Σ0 takes 0.01, 0.1, 1 and so on three values, as the variance, the smaller the value the greater the strength of a priori.

Visible:

1) MLE is inaccurate when data is low

2) a priori strong map (red and yellow lines in the graph) can achieve better results in a small amount of data

3) a priori weak map (blue line in the graph) degraded to MLE

It is also not visible in the diagram: if we know beforehand that the information about U is wrong, that is, choose a strong but deviate from the actual priori (for example, set u0 to 5,σ0 set to 0.01)? In fact, the result is even worse than the MLE, which is one of the most widely criticized by the Bayesian school: with what to choose a priori? Most of the time, we choose a conjugate priori that is convenient to calculate but does not contain too much information (what is a conjugate priori?). Tell).

3. Bayesian inference (Bayesian inference)

In fact, map not only let the frequency school people ungrateful, not even the harsh Bayesian school satisfaction.

As a result, map only takes the peak of the posterior distribution (majority, mode), and mode is often not very representative (especially in the multi-peak function);

Secondly, map has a unique disadvantage, which is sensitive to the Parameter form. If we want to estimate the variance, we will find that the solution of the variance as a parameter is not the square of the solution obtained by the standard deviation as a parameter. And the MLE is not so.

So instead of taking the peak of the posterior distribution to make it, it is better to find out the entire posterior distribution and use a distribution to describe the parameters to be evaluated. This is inference.

But didn't we just find out the entire posterior distribution in the map? Yes, this is because the example is too simple. In most of the probability graph models of more than three nodes, we can not find accurate posterior distribution, we need to rely on a variety of approximate means, so we have the Laplace approximation, variational inference, Gibbs sampling ... And so on and so on, the content is very complicated, next.

4. Maximum entropy estimation

The estimates in the preceding precedent are all based on the following: The form of the known distribution, the parameters of the distribution. But if you don't know the form of distribution, can you estimate it? The answer is not only possible, but also reliable. This is Dingding's famous maximum entropy method. about how to derive the maximum entropy estimate from the maximum entropy principle, there are enough introductions, not to mention here.

What we want to say is that the maximum entropy estimate is also a mle.

First, we do not know the distribution of the sample, but as a probability distribution, it will certainly meet

1) non-negative everywhere

2) and for 1

You can then construct a function like this:

which

The exponent guarantees non-negative, z guarantees normalization, so f (x) can be constructed as any function of the X--in the vast ocean, there is always an F (x) that makes P (x) close to the true distribution of the sample.

Now let's do the MLE for this distribution, whose log might be:

This log may be convex for λ, so using a simple optimization algorithm (such as gradient descent), we can obtain an optimal λ, and then put λ into the formula of P (x), we can get the specific form of the distribution. In particular, when we take f (x) = (X,X2), the resulting result is a Gaussian distribution. On the other hand, the estimated results depend heavily on the choice of F (x), which is somewhat similar to map.

-this result is completely equivalent to the maximum entropy estimate. In other words, the maximum entropy estimate is equivalent to the MLE of the model in the following form:

And this form of the model, is unified called "Logarithmic linear Model" (log linear models). It is the basis of logistic regression, maximum entropy model, and various probabilistic graphs represented by conditional random field (CRF).

Reference 2:

Maximum likelihood estimate:

The maximum likelihood estimation provides a method for evaluating model parameters with a given observation data, namely: "The model is determined, the parameters are unknown". To put it simply, suppose that we want to count the height of the population of the country, first of all assuming this height obeys the normal distribution, but the mean and variance of the distribution are unknown. We don't have the manpower and the material to count the height of each person in the country, but we can get the height of some people by sampling, and then obtain the mean and variance of the normal distribution in the hypothesis by maximum likelihood estimation.

The sampling in the maximum likelihood estimation needs to satisfy an important hypothesis that all the samples are distributed independently. Let's describe in detail the maximum likelihood estimate:

First, assuming that the sample is independent of the same distribution, θ is the model parameter, and F is the model we use, following our independent assumption of the same distribution. A model F with a parameter of θ produces the above sampling to be expressed as

Back to the "model has been determined, parameters unknown," the statement, at this time, we know that, the unknown is θ, so the likelihood is defined as:

  

In the actual application is commonly used in both sides to take the logarithm, the formula is as follows:

This is called logarithmic likelihood, which is called the mean logarithmic likelihood. And what we call the maximum likelihood is the largest logarithmic average likelihood, namely:

  

For example of someone else's blog, if there is a jar, there are two colors of black and white ball, the number of how many do not know, the proportion of the two colors do not know. We want to know the proportions of the white and black balls in the jar, but we can't take all the balls out of the jar. Now we can take a ball out of the jar that has been shaken every time, record the color of the ball, and then put the ball back in the jar. This process can be repeated, we can use the color of the recorded ball to estimate the ratio of black and white balls in the jar. If in the previous 100 repeat records, 70 times is the white ball, can you tell me what percentage of the white ball in the jar is most likely? Many people immediately have the answer: 70%. And what is the theoretical support behind it?

We assume that the proportion of white balls in a jar is p, so the ratio of black balls is 1-p. Since each shot comes out, after recording the color, we put the extracted ball back into the jar and shake it evenly, so the color of each drawn ball is subject to the same independent distribution. Here we refer to the color of the ball drawn once as a sample. The probability that 70 times is a white ball in 100 samples is P (Data | m), where data is all of the figures, M is the given model, indicating that each time the ball is drawn out is white with a probability of p. If the result of the first sample is recorded as X1, the result of the second sample is recorded as X2 ... Then data = (x1,x2,..., x100). Such

P (Data | M

= P (x1,x2,..., x100| M

= P (x1| M) P (x2| M) ... P (x100| M

= P^70 (1-p) ^30.

So when P is taking what value, p (Data | M) is the maximum value? The p^70 (1-p) ^30 is derivative of p and is equal to zero.

70p^69 (1-p) ^30-p^70*30 (1-p) ^29=0.

The solution equation can be p=0.7.

At the boundary point P=0,1,p (data| M) = 0. So when p=0.7, P (data| M) is the maximum value. This is the same as the result of our common sense as measured by the proportion in the sample.

If we have a set of sample values for continuous variables (x1,x2,..., xn), we know that this set of data obeys the normal distribution and the standard deviation is known. What is the expected probability of this normal distribution, which is the most likely to produce this existing data?

P (Data | M) =?

According to the formula

You can get:

For μ derivation, the maximum likelihood estimate results in μ= (X1+X2+...+XN)/n

The general solution process for maximum likelihood estimation is as follows:

(1) Write out the likelihood function;

(2) The likelihood function takes the logarithm, and organizes;

(3) Derivative number;

(4) Solution likelihood equation

Note: The maximum likelihood estimate only considers the probability that a model can produce a given observation sequence. Without considering the probability of the model itself. This differs from Bayesian estimation. The Bayesian estimation method will be described at a later blog post

This article references

Http://en.wikipedia.org/wiki/Maximum_likelihood

Http://www.shamoxia.com/html/y2010/1520.html

Maximum posteriori probability:

The maximum posteriori estimate is a point estimate of the hard-to-observe quantity based on empirical data. Similar to the maximum likelihood estimate, but the greatest difference is that the maximum posteriori estimate incorporates a priori distribution of the estimated amount. Therefore, the maximum posteriori estimate can be regarded as the maximum likelihood estimate of the rule.

First, we review the maximum likelihood estimates in the previous article, assuming that X is an independent sample of the same distribution, θ is the model parameter, and F is the model we are using. Then the maximum likelihood estimate can be expressed as:

Now, suppose that the prior distribution of θ is G. With Bayesian theory, the posterior distribution of θ is shown in the following formula:

The goal of the final distribution is:

Note: The maximum posteriori estimate can be considered as a specific form of Bayesian estimation.

For example:

Suppose there are five bags, each with an unlimited amount of biscuits (cherry or lemon), and the ratio of the two flavors known to five bags is

Cherry 100%

Cherry 75% + Lemon 25%

Cherry 50% + Lemon 50%

Cherry 25% + Lemon 75%

Lemon 100%

If only the above conditions, the question from the same bag to get 2 of lemon biscuits, then this bag is most likely the above five which one?

We first use the maximum likelihood estimation to solve this problem and write out the likelihood function. Assuming that the probability of the lemon cookie being taken out of the bag is P (which we use to determine which bag is taken from), the likelihood function can be written

  

Since the value of P is a discrete value, the 0,25%,50%,75%,1 described above. We just need to evaluate which value of these five values makes the likelihood function maximum and get the bag 5. Here is the result of the maximum likelihood estimate.

One problem with the above-mentioned maximum likelihood estimation is that the probability distribution of the model itself is not taken into account , and the problem of this cookie is extended below.

Suppose the probability of getting a bag 1 or 5 is 0.1, the probability of getting 2 or 4 is 0.2, the probability of getting 3 is 0.4, and the same answer to the above question? This is the time to change the map. We are based on the formula

  

Write our map function.

  

According to the description of test instructions, the values of P are 0,25%,50%,75%,1,g respectively 0.1,0.2,0.4,0.2,0.1. The results of the map function are as follows: 0,0.0125,0.125,0.28125,0.1. The result from the map estimate is the highest obtained from the fourth bag.

All of these are discrete variables, so what about continuous variables? When the hypothesis is independent of the same distribution, μ has a priori probability distribution of. Then we want to find the maximum posteriori probability of μ. According to the previous description, write the map function as:

  

At this point we take the logarithm on both sides. The maximum value of the above equation can be equal to the request

  

The minimum value. The derivative can be obtained by the μ as

  

The above is the process of map solving for continuous variables.

What we should note in the map is:

The biggest difference between map and MLE is that the probability distribution of the model parameter itself is added to the map, or. In Mle, the probability of the model parameter itself is uniform, that is, the probability is a fixed value.

-----------------------------K-means Clustering Algorithm Gaussian mixture model (GMM) So, the gray scale of digital image processing, the horizontal axis is the gray range (0-255), the ordinate is the number of pixels, divided by the total number of pixels, is the probability of the gray value.  Anyway, if the picture has a background and a target area, the grayscale image should be a mixture of two Gaussian distributions, a complex image that can be mixed with multiple Gaussian images, and the Gaussian number should be known.  We can then use the maximum entropy method to find out the parameters of all the Gaussian component GM component. The two Gaussian functions are equivalent to two probability density functions, so each gray value in the original image corresponds to two probabilities, representing the probability of the foreground and the probability of belonging to the background, which is the core part of graph cuts. Concrete: Examples: Matlab uses Gaussian Mixture Model (GMM) to do the classifier

Gaussian mixture model (GMM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.