Mathematical deduction in the maximum entropy model

Source: Internet
Author: User

mathematical deduction in the maximum entropy model

To view the original, click here

0 Introduction

After the completion of the SVM, always want to continue to write machine learning series, but the time is not stable and the understanding of the model algorithm is not enough, so the delay has not been pen. Coincidentally, rewrite KMP benefited from this year's April personal organization algorithm class, and the pen continues to write this machine learning series, is benefiting from this year's October organized machine classes.

October 26 Machine class 6th, as one of the lecturers of the Shambo the maximum entropy model, he from the concept of entropy, to explain why the maximum entropy, the derivation of maximum entropy, and the method of solving the parameters of IIS, the whole process is very fluent, especially the mathematical deduction. In the evening I put his ppt on Weibo publicly shared out, but for the friends who did not have a class directly to see the PPT will feel very jumping, so I intend to target some of the machine classes to write a series of blog, just also count on the unfinished Machine learning series in the blog.

In conclusion, this paper, combined with the PPT and other related data of Shambo maximum entropy model, can be regarded as a course note or learning experience, which is mainly deduced. Any suggestions or comments, please feel free to point out in this comment, thanks.

1 Preliminary knowledge in order to better understand this article, you need to understand the probability of the necessary knowledge:
    1. The capital letter x represents a random variable, and the lowercase x denotes a specific value of the random variable x;
    2. P (x) represents the probability distribution of the random variable x, and P (x, y) represents the joint probability distribution of the random variable x, Y, p (y| x) indicates the conditional probability distribution of the random variable y in the case of a known random variable x;
    3. P (x = x) indicates the probability that a random variable X takes a specific value, précis-writers is P (x);
    4. P (x = x, y = y) represents a union probability, précis-writers is P (x, Y), p (y = y| x = x) indicates the conditional probability, précis-writers is P (y|x), and has: P (x, y) = P (×) * P (y|x).
The knowledge points that need to be learned about function derivation and extremum are:
    1. If the function y=f (x) is in [A, b] on a continuous, and on (a, B) can be guided, if its derivative f ' (x) >0, then the function f (x) on [a, b] on the monotonically increasing, otherwise monotonically decreasing; if the function's second-order lead F ' (x) > 0, the function is concave on [a, b], Conversely, if the second-order lead F ' (X) < 0, then the function is convex on [a, b].
    2. When the function f (x) is x0 and the extremum is obtained at x, the derivative of the function f ' (x0) = 0.
    3. Take the two-dollar function z = f (x, y) as an example, fix the Y and treat X as a unique independent variable, at which point the derivative of the function to x is referred to as the partial derivative of the x of the two-tuple function z=f (x, y).
    4. In order to convert the extremum problem of the original band constrained to unconstrained extremum problem, the Lagrange multiplier is introduced, the Lagrange function is established, and the derivative result is equal to 0 and the extremum is obtained.

For more information, see the textbook "Advanced Mathematics", "Probability theory and Mathematical statistics", or refer to this blog: probability theory and mathematical statistics knowledge required in data mining.

2 What is entropy?

From the name point of view, the entropy gives a person a very iffy, do not know what is the feeling. In fact, the definition of entropy is simple, that is, to represent the uncertainty of random variables. The reason why people feel iffy, probably because of why to take such a name, and how to use.

The concept of entropy originated from physics and was used to measure the degree of disorder of a thermodynamic system. In information theory, entropy is the measurement of uncertainty.

2.1 Introduction of Entropy

In fact, the English original of entropy is entropy, originally proposed by German physicist Rudolph Kest Clausius, whose expression is:

It represents the most stable state of a system when it is not disturbed by external interference. Later, when a Chinese scholar translated entropy, considering that entropy is the quotient of energy Q and temperature T, and it is related to fire, the translation of entropy image into "entropy".

We know that the norm of any particle is random motion, that is, "disordered motion", and that if particles are to be "ordered", energy must be consumed. Therefore, temperature (thermal energy) can be regarded as a measure of "ordering", while "entropy" can be regarded as a measure of "disorder".

If there is no external energy input, the closed systems tend to become increasingly chaotic (entropy is getting bigger). For example, if the room is not cleaned, it will not be more and more clean (orderly), can only become more chaotic (disorderly). To make a system more orderly, there must be an input of external energy.

In 1948, Shannon Claude E. Shannon introduced information (entropy), which was defined as the probability of occurrence of discrete random events. The more orderly a system, the lower the entropy of information; Conversely, the more chaotic a system, the higher the entropy of information. Therefore, information entropy can be considered as a measure of the degree of systematic order.
If not specifically noted, all of the entropy mentioned below is information entropy. 2.2 The definitions of entropy, union entropy, conditional entropy, relative entropy, and mutual information are given below respectively. Entropy: If the possible value of a random variable x is x = {x1, x2,..., xk}, its probability distribution is p (X = xi) = Pi (i =, ..., n), then the entropy of the random variable x is defined as:

Put the first minus sign at the end, it becomes:

The formula for the above two entropy, whichever it is, is the same, and the two are equivalent, one meaning (these two formulas will be used in the following).

Joint Entropy : The joint distribution of two random variables x, y, can form a joint entropy joint Entropy, denoted by H (x, y).
conditional entropy : Under the premise of random variable x, the new entropy of the random variable y occurrence is defined as the conditional entropy of Y, with H (y| x), which is used to measure the uncertainty of the random variable y under the condition of the known random variable x.

And this formula is established: H (y| x) = H (x, y) –h (×), the entire expression represents (x, y) occurrence of the included entropy minus the entropy that occurs separately. As for how to get it, see derivation:

Simply explain the above derivation process. The whole equation is 6 lines, of which

    • The second row is pushed to the third row by the edge distribution P (x) equals the sum of the joint distribution P (y, x);
    • The third line is pushed to line fourth by multiplying the male factor Logp (x) and then writing x, y together;
    • The fifth line is driven by the following: Since the two Sigma has P (x, y), extract the male factor p (x, y) to the outside, and then the inside-(log p (x, y)-log P (x)) is written-log (p (x, y)/(x));
    • Row five pushes to line sixth by the following: P (x, y) = P (x) * p (y|x), so P (x, y)/p (×) = P (y|x).

relative entropy: also known as mutual entropy, cross-entropy, identification information, Kullback entropy, kullback-leible divergence and so on. Set P (x), q (x) is the two probability distribution of the value in X, then the relative entropy of P to Q is:

To some extent, relative entropy can measure the "distance" of two random variables, and d (p| | Q) ≠d (q| | P). In addition, it is worth mentioning that D (p| | Q) is necessarily greater than or equal to 0.

Mutual Information : Two random variables x, y mutual information is defined as the joint distribution of X, Y and the relative entropy of the respective independent distribution product, expressed in I (x, y):

and has I (x, y) =d (P (x, y) | | P (X) p (Y)). Below, let's calculate the result of H (Y)-I (x, y) as follows:

Through the above calculation process, we find that there is H (Y)-I (x, y) = h (y| X). Therefore, the definition of conditional entropy is: H (y| x) = h (x, Y)-H (×), and according to the mutual information definition expands to get H (y| X) = h (y)-I (x, y), combining the former with the latter, I (x, y) = h (X) + H (Y)-H (Y), which is defined by the majority of documents as mutual information.

3 The maximum entropy entropy is the measure of uncertainty of stochastic variables, the greater the uncertainty, the greater the entropy value, and the entropy is 0 if the random variable is degraded into a fixed value.      If there is no external disturbance, random variables always tend to be disordered, and in the stable evolution of enough time, it should be able to reach the maximum degree of entropy. In order to accurately estimate the state of a random variable, we generally habitually maximize entropy, thinking that the most entropy model is the best model in the set of all possible probability models (distributions).    In other words, in the premise of knowledge, the most reasonable inference about the unknown distribution is that it conforms to the most uncertain or stochastic inference of known knowledge, the principle of which is to acknowledge known things (knowledge) and make no assumptions about the unknown, without prejudice. For example, throwing a dice, if asked "what is the probability of each face up," you would say is equal probability, that is, the probability of the occurrence of each point is 1/6. Because of this "ignorant" boson, nothing is certain, and assuming that it is equal to each other in the probability of the most reasonableThe practice. From the perspective of investment, this is the least risky approach, and from the information theory point of view, is to retain the greatest uncertainty, that is, to maximize the entropy.    3.1 Unbiased principle here is an example of most of the articles in the maximum entropy model that you would like to cite. For example, the word "learning" appears in an article, is it a subject, a predicate, or an object? In other words, known as "learning" may be a verb or a noun, so "learning" can be marked as subject, predicate, object, attributive and so on.
    • X1 that "learning" is marked as a noun, x2 that "learning" is marked as a verb.
    • Make Y1 said "learning" is marked as the subject, Y2 said to be marked as a predicate, Y3 expression object, Y4 expression attribute.
And these probability values add up and must be 1, that is, according to the principle of unbiased, that the probability of the distribution of each value is equal, so get:

Because there is no prior knowledge, this judgment is reasonable. What if there is a certain priori knowledge?

Further, if known: "Learning" is marked as an attribute of the possibility is very small, only 0.05, that is, the remainder is still based on the principle of unbiased, can be:
Further, when "learning" is marked as a noun x1, it is marked as a predicate y2 probability of 0.95, that is, at this time still need to adhere to the principle of unbiased, so that the probability distribution as far as possible average.    But how do you get the distribution that is as unbiased as possible? Both the practical experience and the theoretical calculation show that the uniform distribution is equivalent to the maximum entropy (constrained case, not necessarily the uniform distribution of equal probability) under completely unconstrained condition.    For example, given mean and variance, the distribution of the maximum entropy becomes a normal distribution. As a result, the problem is transformed to: calculate the distribution of x and Y, so that H (y| X) reaches the maximum value and meets the following conditions:

Therefore, it also leads to the nature of the maximum entropy model, it is to solve the problem is known x, calculate the probability of Y, and as far as possible to make Y the most probability (in practice, X may be the context of a word, Y is the word translated into me,i,us, we the respective probabilities), thus according to the information, The most accurate inference of unknown information is the problem that the maximum entropy model solves.

Equivalent to the known X, the maximum possible probability of calculating Y, converted to a formula, is to maximize the following equation H (y| X):

and meet the following 4 constraints:

3.2 Representation of the maximum entropy model thus, with the objective function and constraints, we can write the general expression of the maximum entropy model, as follows:
where P={p | P is the probability distribution on X that satisfies the condition} before proceeding, define the following features, samples, and feature functions. Features: (x, Y)
    • Y: The information that needs to be determined in this feature
    • X: Contextual information in this feature
Sample: A sample of a feature (x, y) in which the grammatical phenomenon described by the feature is distributed in the standard set: (Xi,yi), where Yi is an instance of Y, and Xi is the context of Yi. For a feature (X0,Y0), define the feature function:

The expected value of the feature function for the empirical distribution in the sample is:
Among them,. Characteristic function about model P (y| X) The expected value of the P (x) with the empirical distribution is:

In other words, if the information in the training data can be obtained, then these two expectations are equal, namely:

However, because P (x) in practice is not good, the probability of x appearing in the sample "P (x)-" Instead of x in the overall distribution probability "P (x)", thus obtaining the full expression of the maximum entropy model is as follows:

The constraints are:

The problem is known to have several conditions, requiring the value of several variables to the maximum of the objective function (entropy), whose mathematical essence is the optimization problem (optimization problem), whose constraints are linear equations, and the objective function is nonlinear, so the problem belongs to the nonlinear programming (linear constraint) ( Non-linear programming with linear constraints), it is possible to transform the optimization problem of the original band constraint into the unconstrained optimal duality problem by introducing the Lagrange function.

Duality problem in 3.3 convex optimization

Considering that many of the problems in machine learning revolve around an "optimization", the optimization of convex optimizations is the most common, so in order to transition nature, the duality problem in the convex optimization is explained briefly here.

The general optimization problem can be expressed as the following formula:

Where subject to exports are constraints, F (x) represents inequality constraints, and H (x) represents equality constraints.

Lagrange's functions can then be established by introducing Lagrange multipliers λ and V, as follows:

The fixed X,lagrange function L (x,λ,v) is the affine functions of λ and V.

3.4 Exponential solution of dual problem maxima

In order to solve the original problem, the Lagrange multiplier λ0,λ1,λ2, ..., λi are introduced, and the Lagrange function is defined, and the conversion to duality problem is obtained:

Then the bias is obtained:

Note: Above here is the P (y|x) biased guide, that is, only P (y|x) as the unknown, the others are constants. Therefore, only the "(X0,Y0)" That is equal to P (y0|x0) will work when the bias is obtained, and the other (x, y) is not a coefficient of P (y0|x0), it is a constant term, and the constant term is "biased".

The result of the above biased derivative is equal to 0, the solution is:

Further conversions:

where Z (x) is called a normalization factor.

According to one of the previous constraints: = 1, so there are

thereby having

The optimal Solution p* (y|x) to be obtained is now brought back to the previously established Lagrangian function L

Get the equation about λ:

Note: In the final step of the derivation, the results of the previous generation into the calculation can be.

Next, look back at the equation:

It is known that the maximum entropy model model belongs to the logarithmic linear model, because it contains exponential function, so it is almost impossible to have analytic solution. In other words, even with analytic solutions, numerical solutions are still needed. So, can you find another approximation? Constructor F (λ) to find its maximum/minimum value?

The equivalent problem is converted into a probability distribution model to find the closest distribution to the sample, how to find it? You might think of a great likelihood estimate.

Maximum likelihood estimation of 3.5 max entropy model

Remember January 13 on Weibo said: the so-called maximum likelihood, that is, the largest possible, in the "model has been determined, the parameter θ unknown" case, through the observation data estimation parameter θ of a thought or method, in other words, the solution is to take what the parameter θ to produce the probability of the most observed data is the biggest problem.

For example, suppose we want to count the height of a nation's population, and first assume that this height obeys a normal distribution, but the mean and variance of the distribution are unknown. Since there is not enough manpower and material to count the height of each person in the country, it is possible to obtain the height of some people by sampling (all the sampling requirements are independently distributed), and then obtain the mean and variance of the normal distribution in the above hypothesis by the maximum likelihood estimation.

The general form of a maximum likelihood estimation mle is expressed as:

Among them, the probability distribution of the model is estimated, and the probability distribution is obtained by the experimental results.

Further conversions can be done by:

The logarithm on both sides of the upper type can be taken:

Because the second item of the final result of the above equation is a constant term (since the second item is about the joint probability of the sample and the formula of the sample argument, both are fixed values), the end result is:

At this point, we find that the maximum likelihood estimation and conditional entropy definitions have great similarity, so we can make a bold guess that they are very likely to be the same, so that the objective function they set up is the same. Let's deduce it and verify the speculation.

The solution of the maximum entropy obtained before is taken into the MLE, and the calculation is obtained (the right side on the left is based on a few more steps down):

Note: where, and p~ (x, y) = p~ (×) * P (y|x), = 1.

And take this as a result of a maximum likelihood estimate.

The great solution to the duality problem previously obtained

Only one "-" number, so as long as the original duality problem of the great solution also adds a minus sign, equivalent conversion to a dual problem of minimal solution:

And the results of the maximum likelihood estimation have the exact same objective function.

In other words, the minimization of the duality problem of the maximum entropy model is equivalent to the maximum likelihood estimation of the maximal entropy model.

According to the correctness of MLE, it can be concluded that the solution of maximum entropy (unbiased treatment of uncertainty) is the solution that best conforms to the distribution of sample data, and further proves the rationality of the maximum entropy model. In two-phase contrast, entropy is a measure of uncertainty, and likelihood represents the degree of coincidence with knowledge, further, the maximum entropy model is unbiased distribution of uncertainty, and the maximum likelihood estimation is an unbiased understanding of knowledge.

4 parametric solver: IIS

Review the solution of the maximum entropy model before:

which

The logarithmic likelihood function is:

The equivalent of the current problem is converted to: The maximum entropy model is solved by maximal likelihood function, that is, the maximal value of the logarithmic likelihood function parameter λ is obtained. At this point, it is usually solved by iterative algorithm, such as improved iterative scale method IIS, gradient descent method, Newton method or Quasi-Newton method. Here is the main introduction to the improved iterative scale method of IIS.

An improved iterative scale method The core idea of IIS is to assume that the current parameter vector of the maximum entropy model is λ, hoping to find a new parameter vector λ+δ, so that the logarithm likelihood function value of the current model is increased. Repeat this process until you find the maximum value of the logarithmic likelihood function.

Below, we calculate the parameter λ to λ+δ process, the logarithmic likelihood function increment, with L (λ+δ)-L (λ), while using inequalities:-lnx≥1-x, x>0, can be obtained logarithmic likelihood function increases the lower bound, as follows:

The lower bound result of the above is recorded as a (δ|λ), in order to further reduce the nether, that is, to reduce the value of a (δ|λ), introduce a variable:

Where f is a binary function, so F # (X, y) represents the number of occurrences of all features (x, y) and then uses the Jason inequality to:

We take the lower bound of a (δ|λ) in the above equation as B (δ|λ):

The equivalent of B (δ|λ) is a new lower bound of the logarithmic likelihood function increment, which can be recorded as: L (λ+δ)-L (λ) >= B (δ|λ).

Next, the B (δ|λ) is biased to:

At this point the resulting biased results are δ only, except Δ no longer contains other variables, so that it is 0, you can get:

In order to obtain Δ, the problem is solved.

It is worth mentioning that in the process of solving Δ, if F # (x, y) =m is a constant, then

Otherwise, the Newton method is used to solve:

When the δ is obtained, it is equivalent to the right value λ, and finally the λ is returned to the following formula:

The optimal estimation of the maximum entropy model is obtained.

5 References
      1. A bunch of Wikipedia, thermodynamics entropy: http://zh.wikipedia.org/zh-mo/%E7%86%B5, Information entropy: http://zh.wikipedia.org/wiki/%E7%86%B5_ (%e4%bf%a1 %E6%81%AF%E8%AE%BA), Baidu Encyclopedia: http://baike.baidu.com/view/401605.htm;
      2. The sociological significance of entropy: http://www.ruanyifeng.com/blog/2013/04/entropy.html;
      3. The maximum entropy modelof Shambo in Beijing October machine learning workshop ppt: Http://pan.baidu.com/s/1qWLSehI;
      4. The convex optimization ppt:http://pan.baidu.com/s/1sjhmj2d of the Shambo of Beijing October Machine course;
      5. "Statistical learning method Hangyuan Li";
      6. Maximum Entropy study Note: http://blog.csdn.net/itplus/article/details/26549871;
      7. Discussion on maximum likelihood estimation on Weibo in 2013: http://weibo.com/1580904460/zfUsAgCl2?type=comment#_rnd1414644053228;
      8. Maximum likelihood estimation: http://www.cnblogs.com/liliu/archive/2010/11/22/1883702.html;
      9. Knowledge of probability and mathematical statistics required in data mining: http://blog.csdn.net/v_july_v/article/details/8308762.
      10. The beauty of Mathematics series 16--Talk about the maximum entropy model: http://www.cnblogs.com/kevinyang/archive/2009/02/01/1381798.html.

Mathematical deduction in the maximum entropy model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.