Parameter estimation method of text language model--maximum likelihood estimation, MAP, Bayesian estimation

Source: Internet
Author: User

Transferred from: http://blog.csdn.net/woshizhouxiang/article/details/17556241

The emphasis is on mastering the method of model parameter estimation, realizing the optimization of the model.


The text language model, which is represented by pLSA and LDA, is a hot topic in the research of statistical natural language processing today. This kind of language model usually presents its own probabilistic graph model for the text generation process, then estimates the model parameters by using the observed corpus data. With the language model and the corresponding model parameters, we can have many important applications, such as text feature dimensionality reduction, text topic analysis and so on. This paper mainly introduces three kinds of parameter estimation methods of text analysis-maximum likelihood estimation mle, maximum posteriori probability estimation map and Bayesian estimation.


1. Maximum Likelihood estimation MLE

First, look back at the Bayesian formula.




This formula, also called inverse probability formula, can transform the posterior probability into a computational expression based on likelihood function and prior probability, i.e.




The maximum likelihood estimation is to use the likelihood function to take the maximum value as the parameter value as the estimate, the likelihood function can write



Because of the multiplication operation, it is usually simple to calculate the logarithm of the likelihood function, that is, the logarithmic likelihood function. The maximum likelihood estimation problem can be written




This is a function of solving this optimization problem usually to the derivation, to get the derivative of the extreme point of 0. The function gets the maximum value is the corresponding value is the model parameter that we estimate.

With the last of the coin toss experiment as an example, the results of n experiments obey two distributions, the parameter is P, that is, the probability of each experiment event, it may be set as the probability of getting positive. To estimate p, the likelihood function can be written using the maximum likelihood estimation.



This indicates the number of times the experimental result is I. The extremum point of the likelihood function is




The maximum likelihood estimate of the parameter p is




It can be seen that the probability p of each event in two distributions is equal to the probability of the event occurring in N independent repetitive random trials.

If we do 20 experiments, there are 12 positive, negative 8

Then the parameter value p is 12/20 = 0.6 according to the maximum likelihood estimation.


2. Maximum posteriori estimate map

The maximum posteriori estimate is similar to the maximum likelihood estimate, and the difference is that it is permissible to include a priori in the estimated function, that is, the maximum likelihood function is not required at this time, but the whole posteriori probability calculated by the Bayesian formula is the largest.




Note that P (X) here is independent of the parameter, so it is equivalent to making the numerator maximum. Compared with the maximum likelihood estimate, it is now necessary to add a logarithm of the probability of a priori distribution. In practical applications, this priori can be used to describe the universal laws that people already know or accept. For example, in a coin toss test, the probability of each throw positive occurrence should be subject to a probability distribution, the probability of the maximum value at 0.5, the distribution is a priori distribution. The parameters of the prior distribution are called Hyper-parameters (hyperparameter), i.e.




In the same way, when the above-mentioned posteriori probability gets the maximum value, we get the parameter value according to the map estimate. Given the observed sample data, the probability of a new value occurring is



Here we still show the coin toss example, we expect the prior probability distribution at 0.5 to get the maximum value, we can choose the beta distribution is




Where Beta function expansion is




When x is a positive integer




The random variable range of the beta distribution is [0,1], so you can generate normalised probability values. The following figure shows the probability density function of the beta distribution in the case of different parameters.


We take, so that a priori distribution takes maximum value at 0.5, and now we are going to solve the extremum point of the map estimation function, and the same p derivative number we have




The maximum posteriori estimate for the parameter p is




Compared with the results of the maximum likelihood estimation, it can be found that there are more such pseudo-counts in the results, which is the transcendental function. And the larger the parameter, the more observations are needed to change the belief of the prior distribution, at which point the corresponding beta function gathers and tightens at its maximum value.

If we do 20 experiments, 12 times in front and 8 times on the back, then

Then the parameter p estimated by map is 16/28 = 0.571, less than the maximum likelihood estimate of the value of 0.6, which also shows that "the coin is generally two-sided uniform" this prior to the parameter estimation effect.


3 Bayesian estimates

Bayesian estimation is further expanded on the map without directly estimating the value of the parameter, but allowing the parameter to obey a certain probability distribution. Look back at the Bayesian formula




It is not required that the posterior probability is the largest, which requires that the probability of the observed evidence be expanded by the full probability formula.




When new data is observed, the posteriori probability can be adjusted automatically. But the general probability of this method is that Bayesian estimation is a tricky place to find.

So how to use Bayesian estimation to make predictions. If we want to find the probability of a new value, it can be




To calculate. Note that the integral of the second factor is no longer equal to 1, which is a very different point from the MLE and the map.

We still use the last of coin toss as an example to illustrate. As in map, we assume that the prior distribution is a beta distribution, but when constructing Bayesian estimates, it is not required to approximate the parameter values with the most posterior parameters, but rather to satisfy the expectations of the beta distribution parameters p, there are


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.