From: http://blog.csdn.net/yangliuy/article/details/8296481
The text language model, represented by plsa and Lda, is a hot topic in statistics natural language processing. This type of language model generally proposes its own probability graph model for the text generation process, and then uses the observed corpus data to estimate the model parameters. With the language model and corresponding model parameters, we can have many important applications, such as text feature Dimensionality Reduction and text topic analysis. This article mainly introduces three types of parameter estimation methods for Text Analysis: Maximum Likelihood MLE, maximum posterior probability estimation map, and Bayesian estimation.
1. Maximum Likelihood MLE
First, let's review the Bayesian formula.
This formula is also called the inverse probability formula. It can convert the posterior probability into a calculation expression based on the likelihood function and the prior probability, that is
The maximum likelihood estimation is to use the parameter value when the maximum value is obtained by the likelihood function as the estimated value. The likelihood function can be written
Given the concatenation operation, it is usually easy to calculate the logarithm of the likelihood function, that is, the logarithm likelihood function. The maximum likelihood estimation problem can be written
This is a function. To solve this optimization problem, we usually evaluate the derivative and obtain the extreme point with the derivative 0. The maximum value obtained by this function is the estimated model parameter.
Taking the bernuoli experiment that threw coins as an example, the results of N experiments are subject to two distributions. The parameter is P, that is, the probability of each experiment event. It may be set to a positive probability. To estimate p, the maximum likelihood estimation is adopted, and the likelihood function can be written.
The number of times the experiment result is I. The following describes the maximum values of the likelihood function.
The maximum likelihood of parameter P is obtained.
It can be seen that the probability P of each event in the two distributions is equal to the probability of event occurrence in N independent repeated random tests.
If we do 20 experiments, there will be 12 front-end labs and 8 back-end labs
According to the maximum likelihood estimation, the parameter value p is 12/20 = 0.6.
2. Maximum Posterior Estimation Map
The maximum posterior estimation is similar to the maximum likelihood estimation. The difference is that a prior can be added to the estimation function, which means that the maximum likelihood function is not required at this time, instead, the Bayesian formula requires the maximum posterior probability, that is
Note that p (x) is irrelevant to the parameter, so it is equivalent to maximizing the molecular weight. Compared with the maximum likelihood estimation, we need to add a logarithm of the prior distribution probability. In practical applications, this anterior rule can be used to describe general rules that people already know or accept. For example, in the test of coin throwing, the probability of each front-end throwing should be subject to a probability distribution, which gets the maximum value at 0.5. This distribution is a prior distribution. The parameters of prior distribution are called hyperparameter.
Similarly, when the posterior probability is the maximum value, we get the parameter value estimated based on map. Given the observed sample data, the probability of a new value occurring is
The following example shows that we expect the prior probability distribution to reach the maximum value at 0.5. We can choose beta distribution, that is
The beta function is expanded
When X is a positive integer
The random variable range of beta distribution is [0, 1], so normalised probability values can be generated. The probability density function of beta distribution under different parameters is provided.
Let's take, so that the prior distribution gets the maximum value at 0.5. Now we can solve the extreme point of the MAP estimation function. We also have
The maximum posterior evaluation value of parameter P is as follows:
Compared with the results of the maximum likelihood estimation, we can find that such a pseudo-counts is added in the results, which means that a prior is at work. In addition, the larger the hyperparameter, the more observed values required for the belief passed to change the prior distribution. At this time, the more the corresponding beta function gathers, the more computation on both sides of the maximum value.
If we do 20 experiments, there are 12 front-end experiments and 8 back-end labs, then
The parameter P estimated based on map is 16/28 = 0.571, which is less than the maximum likelihood estimation value of 0.6, this also shows the influence of the first test of "coins are generally evenly distributed on both sides" on parameter estimation.
3 Bayesian Estimation
Bayesian estimation is further expanded on map. In this case, the parameter values are not directly estimated, but are allowed to obey a certain probability distribution. Bayes formula
The maximum posterior probability is not required. Therefore, the observed probability of the evidence is derived from the full probability formula.
When new data is observed, the posterior probability can be adjusted automatically. However, the method for finding the full probability is usually tricky in Bayesian estimation.
So how can we use Bayesian Estimation for prediction? If we want to calculate the probability of a new value
. Note that the point on the second factor is no longer equal to 1, which is a big difference from MLE and map.
We still take the bernuoli experiment, which threw coins, as an example. Like in map, we assume that the prior distribution is beta distribution, but when constructing Bayesian estimation, we do not need to use the parameters of the posterior maximum to approximate as the parameter value, but to meet the expectation of the Beta distribution parameter P, there are
Note that the formula is used here.
When T is two-dimensional, it can be applied to beta distribution; if T is multi-dimensional, it can be applied to Dirichlet distribution.
According to the results, we can see that, according to Bayesian estimation, the parameter P is subject to a new beta distribution. Recall that the prior distribution we selected for P is beta distribution, and the posterior probability obtained using Bayesian Estimation for the two distributions with P as the parameter is still subject to beta distribution, from this we can say that the two distributions and the beta distributions are bounded. In the probability language model, we usually select the bounded distribution as a prior, which brings convenience in computing. The most typical issue is that the topic distribution of words in each document in Lda is subject to the multinomial distribution, and its prior selection of the bounded distribution is the Dirichlet distribution. The word distribution under each topic is subject to the multinomial distribution, the prior algorithm also selects the bounded distribution, that is, the Dirichlet distribution.
Based on the expected and Variance Calculation Formula of beta distribution, we have
We can see that the expectation of P is different from that of MLE and map at this time. At this time, if we are still doing 20 experiments, 12 front and 8 back, then we obtain the beta distribution with the p satisfying Parameters 12 + 5 and 8 + 5 based on Bayesian estimation. The mean and variance are 17/30 = 0.567, respectively, 17*13/(31x30 ^ 2) = 0.0079. We can see that the P expectation obtained at this time is smaller than the estimated values obtained by MLE and map, which is closer to 0.5.
To sum up, we can visualize MLE, map, and Bayesian estimates. The estimation results for parameters are as follows:
In my opinion, from MLE to map to Bayesian estimation, the representation of parameters is more and more accurate, and the estimated results of parameters are more and more close to the prior probability of 0.5, more and more real sample-based parameters are reflected.
References
Gregor Heinrich, Parameter Estimation for test analysis, technical report
Wikipedia beta distribution term, http://en.wikipedia.org/wiki/Beta_distribution