MIT Natural Language Processing V: Maximum entropy and logarithmic linear model (Part I)

Natural language Processing: maximum entropy and logarithmic linear model

Natural Language processing:maximum Entropy and Log-linear Models

Author: Regina Barzilay (Mit,eecs Department, October 1, 2004)

Translator: I love natural language processing (www.52nlp.cn, April 25, 2009)

Previous main content review (last time):

* Conversion-based labeling (transformation-based tagger)

* Tag based on Hidden Markov model (hmm-based tagger)

Legacy content (leftovers):

a) part-of-speech distribution (POS distribution)

I. Number of words in the Brown corpus that are sorted by ambiguity (the numbers of word types in brown Corpus by degree of ambiguity):

No ambiguity (unambiguous) only 1 tokens: 35,340

Ambiguity (ambiguous) has 2-7 tags: 4,100

2 Tags: 3,764

3 Tags: x

4 Tags:

5 Tags: +

6 Tags: 2

7 Tags: 1

B) unsupervised tbl (unsupervised tbl)

I. Initialization (initialization): List of allowed parts of speech (A-list of allowable Part of speech tags)

ii. conversion (Transformations): In context c the mark of a word changes from χ to Y (change the tag of a word fromχto y in context C, whereγ∈χ).

Example (Example): "From NN VBP to VBP if previous tags is NNS"

Iii. scoring standard (scoring criterion):

This is the main content ( Today):

* Maximum entropy model (Maximum entropy models)

* Contact with logarithmic linear model (Connection to log-linear models)

* Optimization method (Optimization meth ODS)

General problem:

a) Given the input domain χ (we have some input domainχ);

B) Given the set of tags γ (we have some label setγ);

c) Target (Goal): Learn a conditional probability P (y|x) for any x∈χ and y∈γ (learn a conditional probability P (y|x) for any x∈χand y∈γ).

Part One, pos tagging:

A) Example: our/prp$ enemies/nns are/vbp innovative/jj and/cc resourceful/jj,/, and/cc SO/RB Are/vb WE/PRP?/?.

I. Input domain: χ is a possible "history" (Χis the set of possible histories);

Ii. Marker set (label set): Gamma is all possible callout tags (γis the Set of all possible tags);

Iii. Objective (GOAL): To learn a conditional probability P (tag|history) (Learn a conditional probability P (tag|history)).

B) Representation (representation):

I. "History" is a 4-tuple (T1,t2,w[1:n],i) (historical is a 4-tuples (t1,t2,w[1:n],i);

Ii. T1,t2 is the former Two tags (t1,t2 is the previous two tags)

Iii. w[1:n] is the n words in the input sentence (W[1:n]are the n words in the input sentence)

Iv. i is the position index of the word that will be labeled (I was the index of the word being tagged)

χ is a collection of all possible "historical" (Χis the set of all possible histories)

To be continued: Part Two

Attached: Courses and courseware PDF download mit English page address:

http://people.csail.mit.edu/regina/6881/

**mit Natural Language Processing V: Maximum entropy and logarithmic linear model (Part II)**

Natural language Processing: maximum entropy and logarithmic linear model

Natural Language processing:maximum Entropy and Log-linear Models

Author: Regina Barzilay (Mit,eecs Department, October 1, 2004)

Translator: I love natural language processing (www.52nlp.cn, April 29, 2009)

One, part-of-speech tagging (POS tagging):

C) eigenvector representation (Feature vector representation)

I. A feature is a function f (a Feature is a function f):

Ii. We have m features fk,k = 1...M (We have M features FK for K =1...M)

D) Speech representation (POS representation)

I. For all simple/tagged pairs of Word/Tag features , (Word/tag features for all Word/tag pairs):

Ii. spelling characteristics for prefixes/suffixes of all specific lengths (spelling features for all prefixes/suffixes O F certain length):

Iii. Contextual characteristics (contextual features):

Iv. for a given "history" x∈x, each gamma marker is mapped to a different eigenvector (for a giv En history x∈x, every label Inγis mapped to a different feature vector):

v. Target (GOAL): Learn a conditional probability P (tag|history) (Lea RN a conditional probability P (tag|history)

Two, maximum entropy (Maximum Entropy):

A) example (motivating Example):

I. Given constraint: P (x, 0) +p (y, 0) =0.6,a∈{x, y} and b∈0, 1, estimated probability distribution P (a , b) (Estimate probability distribution P (A, B), given the constraint:p (x, 0) + p (Y, 0) =0.6, where a∈{x, Y}and b∈0, 1)) :

Ii. A distribution that satisfies constraints (one way to Satisfy Constraints):

Iii. Another distribution that satisfies the constraints (another way to Sati Sfy Constraints):

B) Maximum entropy model (Maximum Entropy Modeling)

I. Given a set of training samples, we would like to find a distribution that conforms to the following two conditions (Given a set of T Raining examples, we wish to find a distribution which):

1. Meet the known constraints (satisfies the input constraints)

2. Maximize its Uncertainty (maximizes the uncertainty)

Ii. Supplement:

The maximum entropy principle was proposed by E.t.jaynes in 1957, and the main idea was that, in mastering only part of the knowledge about the unknown distribution, The probability distribution that conforms to these knowledge but the maximum entropy value should be selected. Because in this case, there may be more than one probability distribution that conforms to known knowledge. We know that the entropy definition is actually the uncertainty of a random variable, when the entropy is greatest, it shows that the stochastic variable is the most uncertain, in other words, the stochastic variable is the most stochastic, the most difficult to predict its behavior accurately. In this sense, the essence of the maximum entropy principle is that, in the premise of known knowledge, the most reasonable inference about the unknown distribution is that it conforms to the most uncertain or stochastic inference of known knowledge, which is the only unbiased choice we can make, and any other choice implies that we have added other constraints and assumptions, These constraints and assumptions cannot be made according to the information we hold. (the largest entropy model of natural language processing, transferred from PKU Baobao Chang teacher)

To be continued: Part III

Attached: Courses and courseware PDF download mit English page address:

http://people.csail.mit.edu/regina/6881/

**mit Natural Language Processing V: Maximum entropy and logarithmic linear model (Part III)**

Natural language Processing: maximum entropy and logarithmic linear model

Natural Language processing:maximum Entropy and Log-linear Models

Author: Regina Barzilay (Mit,eecs Department, October 1, 2004)

Translator: I love natural language processing (www.52nlp.cn, May 5, 2009)

Two, maximum entropy (Maximum Entropy):

B) Maximum entropy model (Maximum Entropy Modeling)

Iii. constraints (Constraint):

observed sample expectation and characteristic model expectation for each feature Observed expectation of each feature have to be the same as the model ' s expectation of the feature):

Iv. Maximum Entropy principle (Principle of Maximum Entropy):

The known fact as a constraint, the probability distribution that maximizes entropy is obtained as the correct probability distribution:

v. Supplement:

Many of the problems in natural language processing can be attributed to Statistical classification problem, many machine learning methods can be found here, in natural language processing, the statistical classification table now to estimate the probability of Class A and a context B co-occurrence of P (A, A, b), different problems, the content and meaning of the Class A and the context is also not the same. In part-of-speech tagging, the meaning of a class is a speech tag in a set of part-of-speech tagging, while context refers to the word in front of the currently processed word and part of speech, followed by a word or word and speech. Often contexts are sometimes words, sometimes word of speech tagging, sometimes historical decisions, and so on. Large-scale corpora usually contain a and B co-existing information, but B in the corpus is often sparse, to all possible (a, B) to calculate a reliable p (a, B), corpus size is always not enough. The problem is to find a way to use this method to reliably estimate p (A, A, b) in sparse data conditions. Different methods may be used for different estimation methods. The advantage of the

Maximum entropy model is that when modeling, the tester only has to focus on the feature selection, without having to expend effort on how to use the features. And it is very flexible to choose features, using a variety of different types of features, features easy to change. With maximum entropy modeling, it is generally not necessary to do the independence hypothesis that is often used in other methods modeling, parameter smoothing can be considered by feature selection, without the need to use a conventional smoothing algorithm alone, and of course, smoothing using the classic smoothing algorithm is not ruled out. The contribution of each characteristic to the probability distribution is determined by the parameter α, which can be obtained by iterative training of some algorithms.

(Note: The above two paragraphs transferred from Peking University Baobao Chang Teacher's "natural language processing Maximum entropy model")

Three, maximum entropy model detailed

A) Summary (Outline)

I. We will first demonstrate that the probability distribution p* of the above conditions is as follows:

Where pi is a normalized constant, α is the model parameter (where pi is a normalization constant and Theα 's is the The model parameters)

II. Then we'll consider searching for the alpha parameter estimation process (then, we'll consider an estimation procedure for finding theα ' s)

b) mathematical notation (notations)

I.χ is a possible "historical" set (Χis the set of possible histories)

Ii.γ is the set of all possible tokens (Γis the set of all possible tags)

Iii. S is a set of event training samples (s finite training sample of events)

Iv. p ' (x) is the observed probability of x in S (P ' (x) observed probability of x in s)

V. P (x) is the model probability of X (P (x) the models ' s probability of X)

VI. Other symbolic formulas are defined as follows:

Continued: Part Fourth

Attached: Courses and courseware PDF download mit English page address:

http://people.csail.mit.edu/regina/6881/

**mit Natural Language Processing V: Maximum entropy and logarithmic linear model (part fourth)**

Natural language Processing: maximum entropy and logarithmic linear model

Natural Language processing:maximum Entropy and Log-linear Models

Author: Regina Barzilay (Mit,eecs Department, October 1, 2004)

Translator: I love natural language processing (www.52nlp.cn, May 9, 2009)

Three, maximum entropy model detailed

c) Relative entropy (Kullback-liebler distance) (Relative Entropy (Kullback-liebler Distance))

Definition: The relative entropy of the two probability distributions P and Q is given by the following formula (the relative entropy D between two probability distributions p and Q are given by)

II. Lemma 1 (Lemma 1): for arbitrary two probability distributions P and q,d (p, Q) ≥0 and D (p, Q) =0 if and only if p=q (for any two probability distributions p and Q, D (p, Q) ≥0, an D d (P, Q) =0 if and only if P =q)

Iii. lemma 2 (Pythagorean nature) (Lemma 2 (Pythagorean property)): If p∈p,q∈q,p*∈p∩q, then D (p, q) = d (P, p*) + D (p*, Q) (if P∈p and q∈q, and P *∈p∩q, then D (p, q) = d (P, p*) + D (p*, q))

Note: Please refer to the Lec5.pdf English script of MIT NLP for proof.

d) Maximum entropy solution (the Maximum Entropy solution)

I. Theorem 1 (theorem 1): if p*∈p∩q, p* = argmax_{p in P}h (p), and p* unique (if p∗∈p∩q then p* = argmax_{p in P}h (p). Furthermore, p* is unique)

Note: Please refer to min NLP original speech, the main use lemma 1 and lemma 2 to derive.

e) Maximum likelihood solution (the Maximum likelihood solution)

I. Theorem 2 (theorem 2): if p*∈p∩q, p* = argmax_{q in Q}l (q), and p* unique (if p∗∈p∩q then p* = argmax_{q in Q}l (q). Furthermore, p* is unique)

Note: Please refer to min NLP original speech, the main use lemma 1 and lemma 2 to derive.

f) duality theorem (duality theorem)

I. There is a unique distribution p* (there is a unique distribution p*)

1. P*∈p∩q

2. p* = Argmax_{p in P}h (p) (Maximum entropy solution (Max-ent solution))

3. p* = Argmax_{q in Q}l (q) (Maximum likelihood solution (Max-likelihood solution))

Ii. conclusion (implications):

1. Maximum entropy solution can be written in logarithmic linear form (the maximum entropy solution can be written in log-linear form)

2. Finding the maximum likelihood solution also gives the maximum entropy solution (finding the Maximum-likelihood solution also gives the maximum entropy solution)

Not to be continued ...

Attached: Courses and courseware PDF download mit English page address:

http://people.csail.mit.edu/regina/6881/

**mit Natural Language Processing V: Maximum entropy and logarithmic linear model (part fifth)**

Natural language Processing: maximum entropy and logarithmic linear model

Natural Language processing:maximum Entropy and Log-linear Models

Author: Regina Barzilay (Mit,eecs Department, October 1, 2004)

Translator: I love natural language processing (www.52nlp.cn, May 14, 2009)

Three, maximum entropy model detailed

g) GIS algorithm (generative iterative Scaling)

I. Background:

The most primitive training method of maximum entropy model is an iterative algorithm called General iterative algorithm GIS (generalized iterative scaling). The principle of GIS is not complicated, it can be summarized as following several steps:

1. Assume that the initial model of the 0th iteration is a uniform distribution of equal probabilities.

2. Use the nth iteration model to estimate the distribution of each information feature in the training data, and if it exceeds the actual, the corresponding model parameters are smaller; otherwise, they will be larger.

3. Repeat step 2 until convergence.

GIS was first proposed by Darroch and Ratcliff in the 70 's. However, these two people have not been able to explain the physical meaning of the algorithm well. It was later explained by mathematician Hiissa (Csiszar), so when people talked about the algorithm, they always quoted both Darroch and Ratcliff as well as Hiissa's two essays. Each iteration of the GIS algorithm takes a long time, requires iterations to converge many times, and is less stable, even if an overflow occurs on a 64-bit computer. As a result, few people actually use GIS in practical applications. It is only through it that we understand the algorithm of the maximum entropy model.

In the 80 's, the talented brother of the twin, Pietra (Della), made two improvements to the GIS algorithm in IBM, and proposed an improved iterative algorithm for IIS (improved iterative scaling). This reduces the training time of the maximum entropy model by one to two orders of magnitude. So the maximum entropy model can become practical. Even so, at the time only IBM had the condition to use the maximum entropy model. (from Google Wu, "The beauty of Math series 16")

II. Target (Goal): Find the distribution of this form of Pi Prod{j=1}{k}{{alpha_j}^{f_j} (x), which follows the following constraints (find distribution of the form Pi prod{j=1}{k}{{ Alpha_j}^{f_j} (x)}that obeys the following constraints):

E_p F_j = e_{p Prime}{f_j}

III. GIS constraints (GIS constraints):

1.

where c is a constant (where C is a constant (add Correctional feature))

2.

Iv. theorem (theorem): The following procedure converges to p*∈p∩q (the following procedure would converge to P*∈P∩Q):

V. Computational capacity (computation)

where s={(A1,B1),..., (AN,BN)} is a training sample (where S is a training sample)

Because there are too many possible (a, B), in order to reduce the amount of computation, the following formula is approximate calculation:

Temporal complexity (Running time): O (NPA)

where n is the training set size, p is the expected number, and a is the average of the active characteristics for a given event (a, A, A, b) (where n is the training set size, p is the numbers of predictions, and A is the Averag E number of features that is active for a given event (a, B))

Four, Max entropy classifier (ME classifiers)

A) Many characteristics can be handled (can handle lots of features)

b) Data sparse problem (sparsity is an issue)

I. Applying smoothing algorithms and feature selection methods (Apply smoothing and feature selection)

c) Feature interaction (Feature interaction).