Conditional with the airport (CRF)-1-Introduction

Source: Internet
Author: User

Statement:

1 , this article is for individuals on the li Hang . Statistical Learning Methods . pdf "Study summary, not used for commercial, welcome reprint, but please indicate the source (ie: this address).

2 , because I learned at the beginning of a lot of math knowledge has been forgotten, so in order to understand the contents of a lot of information, so there should be a reference to other posts in a small part of the content, if the original author can see a private message I, I will be your post address to the following.

3 , if the content is wrong or inaccurate, please correct me.

4 , it would be great if I could help you.

First we understand what is "conditional random airport" and then explore its details.

So, first introduce a few nouns.

Markov chain

For example: A person wants to go from a to the destination F, then the middle must pass by B,c, D, E, so there is a state:

If you want to reach B, you must go through a;

If you want to reach C, you must go through a, B;

And so on, eventually

If you think of up to F, you must go through a,b,c,d,e.

If the above state is written in a sequence, it is: {arrives at a, arrives at B, arrives at C, ..., arrives at f}, and it is clear that each state value of the status sequence depends on whether the previous state has been satisfied.

So, like this, "each state value of a state sequence depends on the state of the previous finite state" sequence is the Markov chain .

TIP:

The word "chain" in this name is still very image, because you can understand that a "series of light Bulbs" is a chain bar, that if you want to light the last bulb (the lamp farthest from the plug), you must let the current from the plug through all the light bulb.

So, if the above sequence of states {arrives at a, arrives at B, arrives at C, ..., and reaches every state in F} as a light bulb, then this sequence is a "chain that connects the bulbs together", such as:

but note that the Markov chain is defined as "each state value of the state sequence depends on the previous finite state", noting that it is a "finite state" and not "all states". Therefore, for the Markov chain, which contains "want to reach the destination F, then only need to reach the destination E can be, and the previous destination A, B, C, D is not OK to not arrive" this situation.

Note again: Here is a "concatenation" example for the sake of understanding, the real Markov chain is the case of "parallelism" , because it is defined as "each state value of the state sequence depends on the previous finite state", which includes "a person wants to reach the destination C, Then you have to drive from a and buy breakfast in B. In this case, first to a or first to B is irrelevant (whether it is to go to McDonald's to buy breakfast before going to the car, or the first car and then go to McDonald's to buy breakfast), as long as a and b to meet the good, such as:

In short, each element in the Markov chain can be either one-to-many or a-to-more, or many-to-many or many-to-more, with reference to:


Hidden Markov model (HMM)

Here is a simple description of the next HMM, the details of my summary of the "Hidden Markov model (HMM)-1 ~ 4".

Also use the example from a to f above a person.

But here we need to change the terms and the content:

Condition Change:

This person wants to finish today A, b,c, D, E, F these places, but he wants to stroll which after which we do not know.

Content additions:

Each person arrives at one place, he will buy you a gift, but this guy is so excited, so you have a duplicate of the present, so you will end up with this observation: {Gift 1, gift 2, gift 1, gift 3, gift 2, gift 2} (so I still don't know which one he wants to stroll before).

As a result, we do not know the status sequence (we do not know the order of his stroll), but we know the observation sequence, and each observation must be a state-generated (the gift must be where he can buy it).

So this example is a description: a process that does not know the state sequence (that is, the Markov chain), but knows a random sequence of observations based on each state, and this process is the hidden Markov model.

The description of the mathematical definition is:

The hidden Markov model describes the non-observable state sequence generated randomly by a hidden Markov chain, and then generates an observation by each State to produce the process of observing random sequence.

The definition of OK,HMM is done, and then we look at the limitations of Hmm.

The limitations of HMM

1, the model defines the joint probabilities and must enumerate the possible values of all observation sequences, which are difficult for most fields.

2, based on each element in the observation sequence, is independent of each other's conditions. That is, the observed value at any moment is only related to one state in the state sequence. And most real-world observation sequences are formed by the dependence of multiple interacting features and the long-range elements of the observed sequence.

PS: The condition with the airport solves the second limitation.

Production model and discriminant model

If you already know hmm, you'll know that the probability of a hmm being calculated is "the combined probability of the observed sequence (input) and the state sequence (output)", i.e. p (state sequence, observation sequence), i.e. the probability that the state sequence and the observed sequence occur simultaneously.

So for input x (or observation sequence) and output y (or tag sequence):

The model of constructing their joint probability distribution P (y,x) is the production model , which can generate samples according to the joint probability, such as: HMM, BNs, MRF.

PS:HMM is a hidden Markov model.

The model that constructs their conditional probability distribution P (y | x) is a discriminant model , because there is no knowledge of y, so samples can only be judged, such as: CRF, SVM, Memm.

PS:CRF is here to talk about the conditions with the airport.

Production model: Infinite Sample--- probability density model = Generate model---Forecast

Discriminant Model: Finite sample--discriminant function = Predictive model --- prediction

Example

For four elements: (1, 0), (1, 0), (2, 0), (2, 1)

production model : Ask P (x, y)

Because the above four elements (1,0) have two, so P (1, 0) = 2/4 = 1/2

Similarly: P (1, 1) =0, p (2, 0) = quarter, P (2, 1) = quarter.

discriminant model : Ask P (y|x)

Because of the above four elements, if x=1, that must have y=0, so P (0|1) =1

Similarly: P (1|1) =0, p (0|2) = A, p (1|2) = A.

Comparison

production model : From the statistical point of view the distribution of data, can reflect the similarity of similar data itself, do not care about discriminant boundaries.

Advantages :

In fact, the information is richer than the discriminant model, and the study of single-class problem is more flexible than discriminant model.

To make fuller use of prior knowledge

Models can be obtained by incremental learning

Disadvantages :

The learning process is more complicated

Large error rate in target classification problem

discriminant model : Look for the optimal classification surface between different categories, reflecting the difference between heterogeneous data.

Advantages :

Classification boundaries are more flexible and more advanced than using pure probabilistic methods or production models.

Can clearly distinguish between multiple classes or a class of different characteristics from other classes

Good effect in clustering, viewpointchanges, parital occlusion and scale variations

Suitable for more categories of identification

Disadvantages :

Does not reflect the characteristics of the training data itself

Limited ability to tell you whether it's 1 or 2, but there's no way to describe the whole scene.

The relationship between the two:

The discriminant model can be obtained by the production model, and the reverse cannot.

Conditional Random Airport (CRF)

Well, finally to the CRF, then what is CRF?

Here's an example:

Suppose you have a photo of your TA's Day life (the pictures are in order) and then let you label them (tag), for example: this one is eating, this is sleeping, this is singing, so what are you going to do?

If you use Hmm's idea, that is:

I already have a collection of observations (a photo of the day), and then let me ask for the set of states that correspond to the observations (tag). OK, I'm starting to hit tag, hmm .... This picture is dark, that may be sleeping, this picture is colorful, is in the KVM bar, that is singing; What the hell is this close-up with a big mouth? At dinner? Are you singing? Well.... I can't read it. Ask Ta Good, (turn head), eh? What about TA people? I'm special!! Forget it, give it to a random! Howling at the Wolf (singing)! Well, it's such a pleasant decision.

So, as the above example describes, although hmm will eventually give a conclusion, but because Hmm "based on the observation sequence of each element are mutually independent" flaw, causes it to give an observation "pairing" state is unfounded.

But we are human, we do not silly random guess, that in such circumstances, how do we do? Everyone knows: We'll look at what the previous picture is, and if the previous picture is in KVM, then this one is very likely to be singing, and if the previous one is in the kitchen, it's quite possible to eat in the mouth.

From this example, we can see that when an output (observation/feature) is found in its input (the state of the "observation/feature"), the context (which is next to the feature) cannot be considered, otherwise the accuracy will be greatly reduced.

And this "HMM-enhanced version of the idea" is the CRF.

Well, let's compare the hmm with the one-step-at-a-minute to clarify CRF.

CRF and Hmm

First of all, we first simple switch the above example: "Photos" switch to "word", "photo tag" switch to "part of speech label" (such as: nouns, verbs, adjectives, etc.), "to the photo tag" switch to "POS tagging" (that is, the word is a noun?). Verb? or something)

The above in the introduction of "production model and discriminant model" shows that CRF is a discriminant model, and here in detail, that is, the nature of the CRF is: "The implied variable (here the magnetic label is an implied variable) Markov chain" + "observable state to the implied variable" condition probability .

Okay, here we go.

PS: The following "POS tags" and "words" correspond to "implied variables, i.e. input" and "observed state, i.e. output" respectively.

first, the Markov chain section:

It is assumed that the POS tags of CRF and hmm satisfy the Markov nature, that is, the current part of speech has a probability transfer relationship with the previous word, and it has nothing to do with the speech of other places, for example: the probability of the adjective is 0.5, with the modifier "the probability is 0.5, and the probability of the verb is 0."

Therefore, it is easy to get a probability transfer matrix by counting on a set of annotations, that is, the probability of any part of speech a that is immediately adjacent to any part of speech B can be counted.

for Hmm , this part is over.

but for CRF , it adds one-dimensional word feature on the basis of two-dimensional conditional transfer matrix, such as: When AB is near, a is a verb and the length of the word exceeds 3 o'clock, B is the probability of the noun is xx.

In this small example, when judging B, consider only one word a, that is, statistic P (b| A) This, of course, can get a lot of data feedback, and if you need to consider multiple words when judging B? such as P (b| ASDFGH), then this may be a problem with sparse data because the sequence ASDFGH is not present in the dataset at all. Note that the effect of sparse data on machine learning is enormous, so Markov chains in the CRF will lose a certain amount of global information in exchange for more full data, the experiment proves that the transaction in the POS tagging is earned.

Again, the mapping probabilities of speech (implied variables, that is, input) and words (observed state, or output):

If it is HMM, that is to count all of the speech combinations, and then calculate all the speech combinations to generate the probability of the word combination, and then choose a probability of the largest combination of speech.

CRF in turn, the CRF by the characteristics of the excavation of the word itself, the word into a k-dimensional eigenvector, and then for each feature to compute the conditional probability of the feature to the part of speech, so that the conditional probability of each word to the candidate part of speech is the sums of all characteristic conditional probabilities. For example, we assume that the eigenvectors are only two, and that P ("Word length >3"--and noun part of speech) is 0.9, p ("The word is at the end of the sentence"--and the probability of the noun part of speech) is 0.4, and a word satisfies both of these characteristics, the conditional probability of the noun is (0.9 + 0.4)/2 = 0.65. In this way, CRF can be used to find the best pos tagging sequence based on this condition and the Markov characteristics of part of speech.

To this CRF is something that everyone should know, let's take a look at its internal details (go to the next section).

Reference documents:

http://baike.baidu.com/link?url=MFzkgH1giyI1MVlkYHPivN_hY1nf6HsGtGqr-OaJEuYB_reXJCQGJYqUn20CnhjRj313nTWpqsl6Ie_Z5MDa3q

Http://wenku.baidu.com/link?url= 7lbbxikpwapnqyexmbohz4icusny6ayg3m53ls0iivkdqlq-9ypnaiw3wkj5ugihjwkmm4ytpahiieu75bb_mm_q1qicaligroiwhuo8ktu

Http://blog.sina.com.cn/s/blog_6d15445f0100n1vm.html

Http://blog.sina.com.cn/s/blog_605f5b4f010109z3.html

http://lhdgriver.gotoip1.com/%E6%9D%A1%E4%BB%B6%E9%9A%8F%E6%9C%BA%E5%9C%BA%E7%AE%80%E4%BB% 8bintroduction-to-conditional-random-fields/

http://blog.csdn.net/heavendai/article/details/7228621

Http://wenku.baidu.com/link?url=kBOg_LBYQDm8ftgIT5xm8rmFC1NN247Ubhp7t_ Lngjbjbifwgqzoffnzmbkq5lpeltdjo0fi0mf8vrimin7jtsqwuyhyyjkwgav3kj-f9fy

Conditional with the airport (CRF)-1-Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.