Easy to understand conditional random airport (CRF)

Source: Internet
Author: User
Tags ming

The best way to understand the conditions with the airport is to use a realistic example to illustrate it. But at present the condition of Chinese with the airport article rarely so dry, may write articles of the people are Daniel, disdain for example. So I translated the article. Hope to help other partners.
The original is here [http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/]

A friend who wants to see English directly can go in directly. I did not rigidly adhere to the original text in the translation, many places have joined their own understanding, with the academic point of the words is free translation. (Voice-over: What to pack, get started.) OK, start translating below.

Suppose you have a lot of Xiao Ming students in different time of day photos, from Xiao Ming to get out of trousers to take off pants to sleep in various periods have (Xiao Ming is photo control. )。 The task now is to classify the photos. For example, some photos are to eat, then give it a meal label, and some photos are taken during running, then the running label, and some photos are taken at the meeting, then the meeting label. Here's the question, what are you going to do?

A simple and intuitive approach is to try to train a multivariate classifier regardless of the chronological order of the photos. is to use some well-labeled photos as training data, training a model, directly according to the characteristics of the photo classification. For example, if the photograph is taken 6:00 and the picture is dark, give it a sleeping label, and if the photo has a car, give it a driving label.

Is this possible?

At first glance can be. But in fact, because we ignore the important information about the chronological order between these photos, our classifier is flawed. For example, if there is a picture of Xiao Ming's mouth closed, how to classify. Obviously difficult to directly judge, need to refer to the photos before the shut-up, if the previous photo shows xiaoming in the meal, then this shut-up picture is likely to be Xiao Ming chewing food ready to swallow, you can give it a meal label; If the photo shows Xiaoming singing, then this shut-down picture is probably a snapshot of Xiao Ming You can give it a singing label.

So, in order for our classifier to perform better, when classifying a photo, we have to take into account the label information of the photo that is adjacent to it. This is where the condition varies with the airport (CRF). from the example--the problem of POS tagging

What is the problem of POS tagging?

It is very simple to give each word in a sentence a part of speech. For example: "Bob drank coffee at Starbucks", stating that the words of each word are the following: "Bob (noun) drank (verb) coffee (noun) at (preposition) Starbucks (noun)".

Below, the problem is solved by using the condition with the airport.

Take the above words, for example, there are 5 words, we will:(noun, verb, noun, preposition, noun) as a sequence of labels, called L, the optional label sequence has many kinds, such as l can also be:(noun, verb, verb, preposition, noun), We're going to pick out one of the most reliable things we can do to mark this sentence in so many optional label sequences.

How to judge a label sequence is not reliable.

As far as the two labeling sequences shown above are concerned, the second one is obviously not as good as the first one, because it marks the second and third words as verbs, followed by verbs, which are usually not in a sentence.

If we rate each label sequence, the higher the score, the more reliable the label sequence, we can at least say that there is a mark sequence of verbs or verbs that appear after the verb in the label, and give it a negative score.

after the verb is a characteristic function, we can define a set of feature functions, use this set of feature functions to score a sequence of labels, and then select the most reliable label sequence. That is, each feature function can be used to score a sequence of labels, combining all the feature functions in the set with the scoring of the same label sequence, which is the final scoring value of the label sequence. defining feature functions in the CRF

Now, we formally define what is the characteristic function in CRF, the so-called characteristic function, which is such a function, it accepts four parameters: the sentence S (that is, we want to mark the sentence of part of speech) I, used to denote the word "I" in the sentence S l_i, indicating the label sequence to be graded to the first word labeled l_i-1 That represents the part of speech for which the callout sequence is to be graded for the i-1 word.

Its output value is 0 or 1,0 indicates that the label sequence to be graded does not conform to this feature, and 1 indicates that the label sequence to be graded conforms to this feature.

Note: Here, our feature function simply relies on the label of the current Word and the label of the word in front of it to judge the label sequence, so the CRF is called a linear-link CRF, which is a simple case in CRF. For the sake of simplicity, we only consider linear-chain CRF in this article. from feature function to probability

After defining a set of feature functions, we give each feature function f_j a weight λ_j. Now, as long as there is a sentence s, there is a callout sequence l, we can use the previously defined set of feature functions to score L.

There are two sums in the above, and the sum of the outer sums is used to f_j the sum of the scores of each characteristic function and the summation of the sums used to find the eigenvalues of the words in each position in the sentence.

By indexation and normalization of this score, we can get the probability value p (l|s)of the callout sequence L, as follows: examples of several characteristic functions

We've already given examples of feature functions, so let's look at a few specific examples to help enhance your perceptual knowledge.

When L_i is an "adverb" and the first word ends with "ly", we let F1 = 1, and other cases F1 to 0. It is not difficult to think that the weight of F1 feature function λ1 should be positive. And the larger the λ1, the more we tend to label sequences that use words that end with "ly" as "adverbs".

If the i=1,l_i= verb and the sentence s are with ". "At the end, f2=1, other conditions f2=0. Similarly, the λ2 should be positive, and the larger the λ2, the more we tend to use the label sequence that marks the first word of the question as a "verb".

When L_i-1 is a preposition, l_i is a noun, F3 = 1, other conditions f3=0. Λ3 should also be positive, and the greater the λ3, the more we think that prepositions should be followed by a noun.

If L_i and l_i-1 are prepositions, then F4 equals 1, other conditions f4=0. Here, we should be able to think that λ4 is negative, and the greater the absolute value of λ4, the more we do not recognize the preposition behind the prepositions or the label sequence.

Well, a condition has been set up with the airport, let's sum it up:
In order to build a condition with the airport, we first define a set of feature functions, each of which is entered with the entire sentence S, the current position I, the position I and the I-1 label. Then give each feature function a weight, and then for each label sequence L, all the feature function weighted sum, if necessary, you can convert the sum value into a probability value. comparison of CRF with logistic regression

Observation formula:

is not a bit logical to return to the taste.
In fact, the conditional with the airport is a serialized version of logistic regression. Logistic regression is a logarithmic linear model for classification, and the conditional random field is a logarithmic linear model for serializing labels. comparison of CRF and Hmm

For the problem of POS tagging, hmm model can also be solved. The idea of hmm is to determine the probability of generating a callout sequence l, in the case of a sentence s known to be labeled, by means of a generative method, as follows:

Over here:
P (l_i|l_i-1) is the transfer probability, for example, L_i-1 is a preposition, l_i is a noun, at this point P indicates that the word after the preposition is the probability of a noun.
P (w_i|l_i) represents the emission probability (emission probability), such as l_i is a noun, w_i is the word "ball", at this point P is in the state of the noun, is the word "ball" probability.

So, how does hmm compare with CRF?
The answer is: CRF is more powerful than Hmm, it solves all the problems that hmm can solve, and can solve many problems that hmm can't solve. In fact, we can take the logarithm of the HMM model above and turn it into the following:

We compare this equation to the CRF:

It is not difficult to find that if we consider the probability of the log form in the first hmm as the weight of the characteristic function in the second CRF, we will find that CRF and hmm have the same form.

In other words, we can construct a CRF that is the same as the logarithmic form of hmm. How to construct it.

For each transfer probability P (l_i=y|l_i-1=x) in a hmm, we can define a characteristic function like this:

The feature function is only equal to 1 when l_i = Y,l_i-1=x. The weight of this feature function is as follows:

Similarly, for each of the probability of emission in Hmm, we can also define the corresponding characteristic function, and let the weight of the feature function equal to the probability of the log form in the Hmm.

The P (l|s) and logarithmic hmm models are almost the same with these forms of feature functions and corresponding weights.

In a word, the relationship between HMM and CRF is this:
Each HMM model is equivalent to a CRF
Each HMM model is equivalent to a CRF
Each HMM model is equivalent to a CRF

However, the CRF is more powerful than Hmm, for the main reason is two: CRF can define more and richer kinds of feature functions . Hmm model has a natural locality, that is, in the HMM model, the current word depends only on the current label, the current label only depends on the previous label. Such locality restricts the HMM to define only the corresponding type of characteristic function, which we also see above. But the CRF can look at the whole sentence S to define a more global feature function, such as this feature function:

If the i=1,l_i= verb and the sentence s are with ". "At the end, f2=1, other conditions f2=0. CRF can use any of the weights of the logarithmic hmm model as a CRF, the weight of the feature function is the probability of log form, so are less than or equal to 0, and the probability to meet the corresponding restrictions, such as

In CRF, however, the weight of each feature function can be any value, without these limitations.

Author: milter
Links: https://www.jianshu.com/p/55755fc649b1
Source: Pinterest

Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.