http://blog.csdn.net/pipisorry/article/details/51482120
Three kinds of parameter estimation methods for text analysis-maximum likelihood estimate mle, maximum posterior probability estimate map and Bayesian estimation.
parameter Estimation
Parameter estimation, we will encounter two main problems: (1) How to estimate the value of the parameter. (2) After estimating the value of the parameter, how to calculate the probability of the new observation, that is, the regression analysis and prediction.
First define some symbols:
All XI in the DataSet X, they are distributed independently, so that when the probability of X is followed, Xi can be multiplied.
Bayesian Formula
This formula, also called inverse probability formula, can transform the posterior probability into a computational expression based on likelihood function and prior probability, i.e.
[ Probabilistic Graph Model: Bayesian networks and naive Bayesian networks]
Maximum likelihood estimation mle
As the name implies, of course is to find a parameter, so l maximum, why to make it the largest, because X has occurred, that is, based on a parameter occurs, then of course, the probability that it will occur is the largest.
The maximum likelihood estimation is to use the likelihood function to take the maximum value as the parameter value as the estimate, the likelihood function can write
Multiply because they are distributed independently of each other. Because of the multiplication operation, it is usually simple to calculate the logarithm of the likelihood function, that is, the logarithmic likelihood function.
The maximum likelihood estimation problem can be written
This is a function of solving this optimization problem usually to the derivation, to get the derivative of the extreme point of 0. The function gets the maximum value is the corresponding value is the model parameter that we estimate.
Given the observed sample data, the probability of a new value occurring is
Finding the value of a parameter is not the ultimate goal, the ultimate goal is to predict the probability that a new event will occur based on this parameter.
Note: Notice that there is an approximately equal, because he makes an approximate substitution, replacing the theta with an estimated value for easy calculation. That's, the next sample is anticipated to being distributed with the estimated parametersθ? ML.
Last example of a coin toss experiment
With the last of the coin toss experiment as an example, the results of n experiments obey two distributions, the parameter is P, that is, the probability of each experiment event, it may be set as the probability of getting positive. To estimate p, the likelihood function can be written using the maximum likelihood estimation.
This indicates the number of times the experimental result is I. The extremum point of the likelihood function is
The maximum likelihood estimate of the parameter p is
It can be seen that the probability p of each event in two distributions is equal to the probability of the event occurring in N independent repetitive random trials.
If we do 20 experiments with 12 positive and negative 8 times, the parameter value p is 12/20 = 0.6 according to the maximum likelihood estimate.
Phi Blog
maximum posteriori estimate map
The maximum posteriori estimate is similar to the maximum likelihood estimate, and the difference is that it is permissible to include a priori in the estimated function , that is, the maximum likelihood function is not required at this time, but the whole posteriori probability calculated by the Bayesian formula is the largest.
Note: here P (X) is independent of the parameter, so it is equivalent to the maximum of the molecule.
By adding this prior distribution, we can encode additional information and avoid the problem of overfitting of parameters.
Compared with the maximum likelihood estimate, it is now necessary to add a logarithm of the probability of a priori distribution. In practical applications, this priori can be used to describe the universal laws that people already know or accept. For example, in a coin toss test, the probability of each throw positive occurrence should be subject to a probability distribution, the probability of the maximum value at 0.5, the distribution is a priori distribution. The parameter of the prior distribution we call the Hyper-parameter ( hyperparameter) that is, we believe that Theta is also subject to a priori distribution: Alpha is his hyper-parameter.
In the same way, when the above-mentioned posteriori probability gets the maximum value, we get the parameter value according to the map estimate.
Given the observed sample data, the probability of a new value occurring is
Note: The first item here is not related to theta (using the map value), so the second integral is 1 (that is, the posterior probability does not change with the new data, 1?). )。
Last example of a coin toss experiment
We expect the prior probability distribution to get the maximum value at 0.5, and we can choose the beta distribution (LZ: The reason that the beta distribution is actually selected is that the beta distribution and two distributions are conjugate distributions)
Where Beta function expansion is
When x is a positive integer
The random variable range of the beta distribution is [0,1], so you can generate normalized probability values. The probability density function of beta distribution under different parameters is given.
We take, so that a priori distribution takes maximum value at 0.5 (observe the above figure, because we priori think P is about equal to 0.5, so that the hyper-parameters A and B are equal, we choose equal to 5).
Now we can solve the extremum point of the map estimation function, and the maximum posteriori estimate of P is obtained by the same p derivative number.
The following two items are the derivation of log (P (P|alpha,beta))
And the results of the maximum likelihood estimation ml can be found in the results of more, we call these two pseudo count pseudo-count, the role of the effect is to make the total probability p to 0.5, because our transcendental think is about equal to 0.5. Such a pseudo-counts is a priori in effect, and the greater the super-parameter, in order to change the prior distribution of the belief required by the observation of the more, the corresponding beta function, the more aggregation, contraction at its maximum value on both sides.
If we do 20 experiments, the positive 12 times, the opposite side 8 times, then, according to the map estimate of the parameter p is 16/28 = 0.571, less than the maximum likelihood estimate of the value of 0.6, which also shows the "coin is generally two-sided uniform" this prior to the parameter estimation effect.
[ mathematical model in the topic model Topicmodel:lda ]
Phi Blog
Bayesian Estimation
Bayesian estimation is further expanded on the map without directly estimating the value of the parameter, but allowing the parameter to obey a certain probability distribution . The maximum likelihood estimation and the maximal posteriori probability estimate both find the value of parameter theta, and Bayesian inference is not, Bayesian inference expands the maximal posteriori probability estimate map (one is equal to, one is approximately equals) method, it is based on the prior distribution of the parameters P (theta) and a series of observation X, The posterior distribution P (theta|) of the parameter theta is obtained. X), and then the expected value of the theta is calculated as its final values. In addition, a variance amount of the parameter is defined to evaluate the accuracy or confidence of the parameter estimation.
Bayesian formula
It is not required that the posterior probability is the largest, which requires that the probability of the observed evidence be expanded by the full probability formula.
When new data is observed, the posteriori probability can be adjusted automatically. But the general probability of this method is that Bayesian estimation is a tricky place to find.
Make predictions using Bayesian estimates.
If we want to find the probability of a new value, it can be calculated by the following formula.
At this point the integral of the second factor is no longer equal to 1, which is a very different point from the MLE and the map.
Last example of a coin toss experiment
As in the case of the great posteriori probability above, the n-th-Weber experiment, the prior distribution of the parameter P (i.e. the positive probability) is the beta distribution with the parameter (5,5), then we will then find the posterior distribution of p based on the prior distribution of the p and the n-th results of the experiment. We assume that the prior distribution is a beta distribution, but when constructing Bayesian estimation, it is not required to use the parameters of the posterior maximum as the parameter value, but to satisfy the expectation of the p of the beta distribution , that is, to write the distribution of the parameters directly and then to seek the distribution of the expectation, there
Note:
1 C is the set of all experimental results ci=1 or 0.
2
3 The formula is used here.
4 derivation can also be referenced in [the mathematical model in the topic model Topicmodel:lda: beta-binomial conjugate part]
According to the results, according to Bayesian estimation, the parameter P obeys a new beta distribution. Recall that the prior distributions we selected for P were beta distributions, and then the posterior probabilities obtained by Bayesian estimates for the two-item distributions with p as parameters still follow the beta distribution, so we say that two distributions and beta distributions are conjugate distributions. When T is a two-dimensional case, the beta distribution can be applied, and T is a multidimensional case that can be applied to the Dirichlet distribution.
Based on the expected and variance formulas of the beta distributions, we have
It can be seen that at this time the estimated p expectations and MLE, map, the estimated values are different, at this time if you still do 20 experiments, 12 positive, 8 negative, then we based on Bayesian estimates of p to meet the parameters of 12+5 and 8+5 beta distribution, the mean and the variance is 17/30= 0.567, 17*13/(31*30^2) = 0.0079. It can be seen that the expected p at this time is smaller than the estimates obtained by MLE and map, and is closer to 0.5.
Phi Blog
Comparison of estimation parameters of Mle,map and Bayesian estimates
In summary, we can visualize the Mle,map and Bayesian estimates for the estimated results of the parameters as follows
LZ: from MLE to map to Bayesian estimation, the representation of parameters is more and more accurate (from easy to difficult, the estimated value is more and more perfect), the obtained parameter estimation results are more and more close to 0.5 of this priori probability, more and more can reflect the real parameter based on the sample situation.
Why is the MLE doesn ' t work well?
While MLE are guaranteed to maximizes the probability of a observed data, we areactually interested in finding estimators That's perform well on new data. A serious problemarises from this perspective because the MLE assigns a zero probability to elements thathave not been OBS Erved in the corpus. This means it would assign a zero probability to anysequence containing a previously unseen element.
from:http://blog.csdn.net/pipisorry/article/details/51482120
Ref:gregor heinrich:parameter estimation for text analysis*
Parameter estimation (maximum likelihood estimation, maximal posteriori probability estimation, Bayesian estimation) *
Parameter estimation of text language model-Maximum likelihood estimation, map and Bayesian estimation
Parameter estimation in text analysis, LDA for example, English version: Heinrich-gibbslda.pdf
Reading note:parameter estimation for text analysis and LDA learning Summary
Parameter estimation method for text analysis