Speech recognition probability knowledge-likelihood estimation/Maximum Likelihood Estimation/Gaussian Mixture Model

Source: Internet
Author: User
Tags natural logarithm
Document directory
  • Principle 1.1
  • 1.2 example
  • Principle 2.1
  • 2.2 example
  • Principle 3.1
  • 3.2 Example

In speech recognition, probability models play a crucial role. Before learning speech recognition technology, you should carefully organize relevant probability knowledge.

1. Likelihood Estimation 1.1 Principle

In mathematical statistics,Likelihood FunctionA function used to calculate parameters in a statistical model.Likelihood. Likelihood functions play an important role in statistical inference, such as the application of Maximum Likelihood Estimation and Fisher information. "Likelihood" is similar to "probability" or "probability", which means the possibility of an event. However, in statistics, there is a clear distinction between "likelihood" and "probability. Probability is used to predict the results of subsequent observations when some parameters are known, while likelihood is used to predict the results of some observations, evaluate the parameters of the properties of related things.

In this sense, the likelihood function can be understood as the inverse of the conditional probability. When a parameter is knownBEventAProbability of occurrence:

Using Bayesian theorem,

Therefore, we can reverse construct a method to indicate likelihood: events are known.AOccurrence, using the likelihood function, we estimate the parametersB. In form, the likelihood function is also a conditional probability function, but the variable we are concerned with has changed:

Note that the likelihood function is not required to meet Normalization :. A likelihood function is still a likelihood function after it is multiplied by a positive constant. For all, there can be likelihood functions:

1.2 example

 

Likelihood function when both throws are facing up

 

Consider throwing a coin. Generally, it is known that the probability of the front and back sides of the coin is the same, so that you can know the possibility of various results after throwing the coin several times. For example, the probability that both requests are positive is 0.25. It is represented by the conditional probability, that is:

WhereHIndicates that the front is facing up.

In statistics, what we care about isWhen a series of throwing results are known, Information about the possibility of front-up when a coin is thrown.
We can establish a statistical model: Assuming that a coin is displayed with a positive probability
The opposite of the probability.
In this case, the conditional probability can be rewritten to the likelihood function:

That is to say, for the set likelihood function, when we observe that both throws are facing upLikelihoodIt is 0.25 (this does not mean that when two positive upwards faces are observed
OfProbabilityIs 0.25 ).

If you consider this, the value of the likelihood function will also change.

The likelihood function of the first two front-up and third back-up in three-way throwing

Note that the value of the likelihood function increases.
This indicates that if the value of the parameter changes to 0.6, the probability of two consecutive positive faces is more than the hypothesis.
Larger. That is to say, obtaining a parameter of 0.6 is more convincing and "reasonable" than obtaining a parameter of 0.5 ". In short,The importance of a likelihood function is not its specific value, but whether the function gets smaller or larger when the parameter changes.For the same likelihood function, if there is a parameter value that maximizes its function value, then this value is the most "reasonable" parameter value.

In this example, the likelihood function is actually equal:

.

If this parameter is usedThe likelihood function reaches the maximum value of 1.. That is to say, when two positive faces are observed consecutively, it is most reasonable to assume that the probability of positive orientation is 1 when a coin is thrown.

Similarly, if three coins are thrown, the first two sides are facing up, and the third side is facing up, the likelihood function will be:

, Where TIndicates that the opposite side is facing up ,.

At this time, the maximum value of the likelihood function will be obtained at that time. That is to say, when three throws are observed, the first two are facing up, and the last is facing up.

2 Maximum Posterior Estimation 2.1 Principle

The maximum posterior estimation is an estimation of points that are difficult to observe based on empirical data. Similar to the maximum likelihood estimation, but the maximum difference is the same. The Maximum Posterior Estimation incorporates the Prior Distribution of the expected estimator. Therefore, the maximum posterior estimation can be seen as the normalized maximum likelihood estimation.

First, let's review the maximum likelihood estimation in the previous article. Assume that X is an independent sampling with the same distribution, θ is the model parameter, and F is the model we use. The maximum likelihood estimation can be expressed:

Assume that the prior distribution of θ is G. Based on bayesian theory, the posterior distribution of θ is shown as follows:

The purpose of the final verification distribution is:

Note: The maximum posterior estimation can be seen as a specific form of Bayesian estimation.

2.2 example

Assume there are five bags, each containing an unlimited number of cookies (cherry or lemon flavor). The ratio of the two flavors in five bags is known to be

Cherry 100%

Cherry 75% + lemon 25%

Cherry 50% + lemon 50%

Cherry 25% + lemon 75%

Lemon 100%

If only the preceding conditions are met, ask the user to get two Lemon cookies from the same bag consecutively. Which of the above five is the most likely?

We first use the maximum likelihood estimation to solve this problem and write the likelihood function. Assume that the probability of taking out the Lemon cookies from the bag is P (we use this probability P to determine which bag the cookies are from), then the likelihood function can be written.

  

Because the value of P is a discrete value, that is, the values 0, 50%, 75%, and 1 described above. We only need to evaluate the value of these five values to maximize the likelihood function and get the bag 5. Here is the result of the maximum likelihood estimation.

There is a problem with the above maximum likelihood estimation, that is, the probability distribution of the model itself is not taken into account. Next we will expand this cookie.

Suppose the probability of getting a bag 1 or 5 is 0.1, the probability of getting 2 or 4 is 0.2, and the probability of getting 3 is 0.4. What is the same answer to the above question? In this case, map is changed. Based on the formula

Write out our map function.

According to the description of the question, the values of P are 50%, 75%, 0.1, 1, and g, respectively, 0.2, 0.4, 0.2, and 0. 1. the map functions are calculated as follows: 0, 0.0125, 0.125, 0.28125, 0. 1. from the above we can see that the result obtained by map is the highest from the fourth bag.

The above are discrete variables. What about continuous variables? Suppose it is an independent distribution, and μ has a prior probability distribution. Then we want to find the maximum posterior probability of μ. According to the preceding description, write the map function as follows:

In this case, we can obtain the logarithm on both sides. The maximum value of the above formula can be equivalent

  

. Evaluate the obtained μ is

The above is the process of solving the map of continuous variables.

In map, we should note that:

The biggest difference between map and MLE is that map adds the probability distribution of model parameters, or. MLE considers the probability of the model parameter itself to be even, that is, this probability is a fixed value.

3. Maximum Likelihood 3.1 Principle

Given a probability distribution, assuming its probability density function (continuous distribution) or probability aggregation function (Discrete Distribution) is, and a distribution parameter, we can extract a sample with a value from this distribution. By using this, we can calculate its probability:

However, we may not know the value even though we know that the sampling data comes from the distribution. So how can we estimate it? A natural idea is to extract a sample with a value from the distribution and then use the sample data to estimate it.

Once we get it, we can find an estimate. The maximum likelihood estimation will look for the most possible values (that is, in allAmong the possible values, find a value to maximize the "possibility" of this sampling). This method is exactly the same as some other estimation methods. For example, the non-biased estimation may not output the most possible value, instead, it outputs a value that is neither overestimated nor underestimated.

To implement the maximum likelihood estimation in mathematics, we must first define the likelihood function:

In addition, this function is maximized (first derivative) for all values ). The value that makes the most possible is calledMaximum Likelihood Estimation.

Note:
  • The likelihood function here refers to a function related to the same time.
  • The maximum likelihood estimation function is not necessarily unique or even exists.
3.2 Example of discrete distribution, discrete finite parameter space [edit]

Consider an example of coin throwing. Suppose the front and back of the coin are different. We throw this coin 80 times (that is, we get a sample and write down the number of times on the front, marked as H on the front, and T on the back ). And the probability of throwing a front is counted as, and the probability of throwing a back is recorded as (so here is equivalent to the above ). Suppose we throw 49 front and 31 back, that is, 49 times H and 31 times t. Suppose this coin is taken from a box containing three coins. The probability that the three coins throw the front is,
,
These coins are not marked, so we cannot know which one is. UseMaximum Likelihood EstimationThrough these experimental data (that is, the sampling data), we can calculate which coin is the most likely. This likelihood function takes one of the following three values:

\ Mathbb {p} (\ mbox {H = 49, t = 31} \ mid P = 1/3) & =& \ binom {80} {49} (1/3) ^ {49} (1-1/3) ^ {31} \ approx 0.000 \\
&&\\
\ Mathbb {p} (\ mbox {H = 49, t = 31} \ mid P = 1/2) & =& \ binom {80} {49} (1/2) ^ {49} (1-1/2) ^ {31} \ approx 0.012 \\
&&\\
\ Mathbb {p} (\ mbox {H = 49, t = 31} \ mid P = 2/3) & =& \ binom {80} {49} (2/3) ^ {49} (1-2/3) ^ {31} \ approx 0.054 \\
\ End {matrix} "src =" http://upload.wikimedia.org/math/a/1/f/a1f3c94ed5790e61ee9a07c99a81ac43.png ">

We can see that at that time, the likelihood function obtained the maximum value. This is the maximum likelihood estimation.

Discrete Distribution, continuous parameter space [edit]

Assume that the box in Example 1 contains countless coins,
There is a coin that throws a Positive Probability. We will calculate the maximum value of its likelihood function:

\ Mbox {lik} (\ theta) & = & f_d (\ mbox {H = 49, t = 80-49} \ mid P) = \ binom {80} {49} P ^ {49} (1-p) ^ {31 }\\
\ End {matrix} "src =" http://upload.wikimedia.org/math/5/ B /7/5b78074d235f091606ac223c08c805d3.png ">

Here, we can use the differentiation method to obtain the greatest value. The two sides of the equation take the differential at the same time and make it zero.

0 & = & \ frac {d} {DP} \ left (\ binom {80} {49} P ^ {49} (1-p) ^ {31} \ right )\\
&&\\
& \ Propto & 49p ^ {48} (1-p) ^ {31}-31p ^ {49} (1-p) ^ {30 }\\
&&\\
& = & P ^ {48} (1-p) ^ {30} \ left [49 (1-p)-31 P \ right] \
\ End {matrix} "src =" http://upload.wikimedia.org/math/f/4/3/f43c984e21445732edf403445fe32ea9.png ">

Likelihood curve of the next binary process with different proportional parameter valuesT= 3,N= 10; its maximum likelihood estimation value occurs in its mode and at the maximum value of the curve.

The solution is,
And, obviously, the most probable solution is (because and the two solutions will make the possibility zero ). ThereforeMaximum Likelihood EstimationIs.

This result is easily generalized. Replace 49 with a letter to express the number of "successes" of observed data (that is, the sample) in the bernuoli test, and use another letter to represent the number of the bernuoli test. The same method can be used to obtainMaximum Likelihood Estimation:

For any number of successes, the total number of tests is the bernuoli test.

Continuous Distribution and continuous parameter space [edit]

The most common continuous probability distribution is normal, and its probability density function is as follows:

Now there is a normal random variable sampling point, which requires such a normal distribution. These sampling points are most likely to be distributed to this normal distribution (that is, the maximum probability density product, each vertex is closer to the center point. The corresponding density function of the sampling of a normal random variable (assuming that the vertex is independent and subject to the same distribution) is:

Or:

,

There are two parameters for this distribution: Some may worry that the two parameters are different from the above example. The above example only maximizes the possibility of one parameter. In fact, the method for finding the maximum value on two parameters is similar: you only need to maximize the possibility on the two parameters respectively. Of course, this is more complicated than a parameter, but it is not complicated at all. We use the same symbol in the above example.

Maximizing a likelihood function is equivalent to maximizing its natural logarithm. Because the natural logarithm log is a continuous top convex function that strictly increments within the value range of the likelihood function. [Note: the natural logarithm of the likelihood function (likelihood function) is closely related to the information entropy and Fisher information.] The logarithm can simplify the operation to a certain extent. For example, you can see in this example:

0 & = & \ frac {\ partial} {\ partial \ Mu} \ log \ left (\ frac {1} {2 \ pi \ Sigma ^ 2} \ Right) ^ \ frac {n} {2} e ^ {-\ frac {\ sum _ {I = 1} ^ {n} (x_ I-\ bar {x }) ^ 2 + N (\ bar {x}-\ mu) ^ 2} {2 \ Sigma ^ 2 }}\ right )\\
& =& \ Frac {\ partial} {\ partial \ Mu} \ left (\ log \ left (\ frac {1} {2 \ pi \ Sigma ^ 2} \ right) ^ \ frac {n} {2}-\ frac {\ sum _ {I = 1} ^ {n} (x_ I-\ bar {x }) ^ 2 + N (\ bar {x}-\ mu) ^ 2} {2 \ Sigma ^ 2} \ right )\\
& = & 0-\ frac {-2n (\ bar {x}-\ mu)} {2 \ Sigma ^ 2 }\\
\ End {matrix} "src =" http://upload.wikimedia.org/math/4/8/2/48202b92b3b70c4594ec868f7eb26b76.png ">

The solution to this equation is. This is indeed the maximum value of this function, because it is the only point in which the first derivative is equal to zero and the second derivative is strictly less than zero.

Similarly, we evaluate and make it zero.

0 & = & \ frac {\ partial} {\ partial \ Sigma} \ log \ left (\ frac {1} {2 \ pi \ Sigma ^ 2} \ Right) ^ \ frac {n} {2} e ^ {-\ frac {\ sum _ {I = 1} ^ {n} (x_ I-\ bar {x }) ^ 2 + N (\ bar {x}-\ mu) ^ 2} {2 \ Sigma ^ 2 }}\ right )\\
& =& \ Frac {\ partial} {\ partial \ Sigma} \ left (\ frac {n} {2} \ log \ left (\ frac {1} {2 \ pi \ Sigma ^ 2} \ right) -\ frac {\ sum _ {I = 1} ^ {n} (x_ I-\ bar {x}) ^ 2 + N (\ bar {x}-\ mu) ^ 2} {2 \ Sigma ^ 2} \ right )\\
& =&-\ Frac {n} {\ Sigma} + \ frac {\ sum _ {I = 1} ^ {n} (x_ I-\ bar {x }) ^ 2 + N (\ bar {x}-\ mu) ^ 2} {\ Sigma ^ 3}
\\
\ End {matrix} "src =" http://upload.wikimedia.org/math/f/1/2/f1222744f148d3c19a35abc7ec571d95.png ">

The solution to this equation is.

Therefore, itsMaximum Likelihood EstimationIs:

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.