Maximum entropy model précis-writers

Source: Internet
Author: User

The last two days simply looked at the maximum Entropy model, hereby make simple notes, follow up to continue to add. The maximum entropy model is natural language processing (NLP, Nature language Processing) is widely used, such as text classification. Mainly from three aspects, one: the mathematical definition of entropy; second: the source of the formal definition of entropy mathematics; three: Maximum entropy model.

Note: Entropy here refers to information entropy.

One: Mathematical definition of entropy:

The definitions of entropy, joint entropy, conditional entropy, relative entropy and mutual information are given below.

entropy : If the possible value of a random variable x is x = {x1, x2,..., xk}, its probability distribution is P (x = XI) = Pi(i=, ..., n), the entropy of the random variable X is defined as:

put the first minus sign at the end, it becomes:



The formula for the above two entropy, whichever it is, is the same, and the two are equivalent, one meaning (these two formulas will be used in the following).

   Joint Entropy: Two random variablesX,YJoint entropy can be formed by the combined distribution ofjointentropy, withH (x, y)representation.
Conditional Entropy: in random variablesXoccurrence of the premise that the random variableYoccurrence of the new resulting entropy defined asYthe conditional entropy, withH (y| X)expression, used to measure a known random variableXunder the condition of random variablesYof uncertainty.

and this formula is established:H (y| x) = H (x, y) –h (×), the entire expression represents (x, y) occurrence of the included entropy minus the entropy that occurs separately. As for how to get it, see derivation:



relative entropy: also known as mutual entropy, cross-entropy, identification information,kullback entropy,kullback-leible divergence and so on. Set p (x),q (x) is the two probability distribution of the value in X, then p The relative entropy of Q is:


   To some extent, relative entropy can measure the two random variables of the Span style= "COLOR: #333333" > " distance " Span style= "COLOR: #333333" and has d (p| | Q) ≠d (q| | p) d (p| | Q) 0 jensen inequality proves the result.

Mutual Information : Two random variables x,y 's mutual information is defined as x,y The relative entropy of the joint distribution and the product of each independent distribution, expressed in I (x, y) :


and has I (x, y) =d (P (x, y) | | P (X) p (Y)). Below, let's calculate the result of H (y)-I (x, y) as follows:


through the above calculation process, we find that there are h (Y)-I (x, y) = h (y| X). Therefore, the definition of conditional entropy is:H (y| x) = h (x, Y)-H (×), and according to the mutual information definition expands to get H (y| X) = h (y)-I (x, y),combining the former with the latter, i (x, y) = h (X) + H (Y)-H (y), which is defined by the majority of documents as mutual information.

Two: The source of the formal definition of entropy

Here are a few simple examples:

? Example 1: Suppose there are 5 coins:1,2,3,4,5, one of which is false, lighter than the other coins. There is a balance, the balance can compare two stacks of coins each time, the result may be the following three kinds of:

? the left is lighter than the right

? the right is lighter than the left

? equally heavy on both sides

Q: How many times do I have to use the balance to make sure we find fake coins? ?

(a maths contest for pupils in one year :P )

Solution: LetXindicates the serial number of the good coin,x∈X = {1,2,3,4,5}; letYrepresents the result of the balance,y∈y={1,2,3}; where1indicates a left light,2indicates right light,3means as heavy. With balance scalesNtimes, the results obtained are as follows:y1 y2... ..yn,y1 y2... ..ynthe number of all possible combinations is3^n, we want to passy1 y2... ..ynFind outx. So: eachy1 y2... ..ynThe combination can have a maximum of one correspondingxvalues are taken.

because x Take X any value, we have to be able to find out x , so for any one x value, at least one y1 y2 ... .. yn correspond to it. According to the pigeon Cage principle ...

| Y|^n≥| X|,Open the square root on both sides, you can getn*logy≥LOGX,alsologx= 1/x*logx+ 1/x*logx+ ... (TotalXeach coin, indicating the probability of the occurrence of alog (1/p)the results),soLOGXcan be seen as the uncertainty of the coin, andlogyas a power of expression,Nindicates how many expressive abilities are required to represent the uncertainty of a coin. thereforen = logx/logy.

So why use log y Span style= "Color:rgb (54,46,43)" >h (Y) h (Y) y | y| log| Y|^n This form, put n can be taken out, because the relationship is not big so throw away n left log| y|

"uncertainty" and "descriptive ability" both express the extent to which a variable can change. This is the ability to express when the variable is used to represent another variable. This is the degree of uncertainty when the variable is represented by a variable. And this degree of variability is the entropy of a variable (Entropy). Obviously: entropy is not related to the meaning of the variable itself, but only to the range of possible values of the variable.

Variant of the topic:

Suppose there is 5 a coin: ,... 5, one of which is false, lighter than the other coins. The probability that the first coin is a false coin is one-third, the second coin is the probability of a false coin is one-third, the other coin is the probability of a false coin is one-nineth.

There is a balance, the balance can compare two stacks of coins each time, the result may be the following three kinds of:

? the left is lighter than the right

? the right is lighter than the left

? equally heavy on both sides

Assuming the balance is used N time to find fake coins. Q What is the expected value of n at least?

(no longer a pupil problem :P )

we can easily find out if we follow the above ideas . H (x) = 1/3*log3+1/3*log3 + 1/9*3*log9 , is still the expectation of false coin uncertainty.

H (Y) =log3, so N is equal to h (x)/h (Y).

So we can get the source of entropy, hope to understand.

Three: Maximum entropy model

This part can be seen July of the Blog : http://blog.csdn.net/v_july_v/article/details/40508465

The main record:

1 : The principle of entropy is to acknowledge known things (knowledge) and to make no assumptions about the unknown, without prejudice. from the investment point of view, this is the least risky approach, and from the information theory point of view, is to keep the biggest uncertainty, that is, to maximize the entropy ( investment risk minimization).

2 : The complete description of the maximum entropy model is as follows:


The constraints are:


Lagrange The function is expressed as:


The results are as follows:


which


To export the final request solution:


The maximum entropy model model belongs to the logarithmic linear model because it contains exponential functions, so it is almost impossible to have analytic solutions. In other words, even with analytic solutions, numerical solutions are still needed. So, can you find another approximation? Constructor F (λ)to find its maximum / minimum value?

IIS (improvediterative Scaling) is currently the maximum Entropy Model optimization algorithm, better than the gradient descent algorithm (here is unconstrained optimization problem, but by derivation cannot give analytic solution, so can of course also use gradient iteration, Newton method, Quasi-Newton method , General-Purpose iterative method (GIS) ).

an improved iterative scale method IIS The core idea of the hypothesis is that the current parameter vectors of the maximum entropy model are λ , hoping to find a new parameter vector λ+δ , which makes the logarithm likelihood function value of the current model L increased. Repeat this process until you find the maximum value of the logarithmic likelihood function.

3 : The optimal solution p (y|x) in the formula for maximum likelihood estimation, we find that the parameters obtained by the maximum entropy λ have the same target function. It can be concluded that the solution of maximum entropy (unbiased treatment of uncertainty) is the solution that best conforms to the distribution of sample data, and further proves the rationality of the maximum entropy model. The maximum entropy model is the unbiased distribution of uncertainty, and the maximum likelihood estimation is the unbiased understanding of knowledge.


Questions: (1) I haven't figured out yet. When explaining the maximum entropy model, why introduce the characteristic function to export the constraint, how it represents a concrete example of the image. This may need to see some specific text classification of the case to understand, have to know can leave a message. Thank you!

(2) I have not read the mathematical derivation of IIS, keep the subsequent use of the time to see it .....


Reference documents:

Derivation of maximum entropy model for 1:http://blog.csdn.net/v_july_v/article/details/40508465july

2:http://blog.csdn.net/daoqinglin/article/details/6906421

3: Statistical Learning Methods < Lee Aviation >

3:http://jiangtanghu.com/docs/cn/maxent.pdf Max Entropy Reading notes

Maximum entropy model précis-writers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.