Foreplay : Walk together with the airport
Bai Ningsu
August 2, 2016 13:59:46
"Abstract": the condition with the airport for sequence labeling, data segmentation and other natural language processing, showing a good effect. In Chinese word segmentation, Chinese name recognition and ambiguity resolution and other tasks have been applied. This paper is based on the understanding of the condition with the airport and the application of natural language processing in the process of tagging the sentence recognition sequence. Written mainly from natural language processing, machine learning, statistical learning methods and some of the online data on the introduction of CRF related related, and finally a large number of research and consolidation into the system knowledge. The article is arranged as follows: The first section introduces the basic statistical knowledge related to CRF, the second section introduces the CRF introduction based on the natural language angle, the third section introduces the CRF based on machine learning, the fourth section introduces the relevant knowledge based on the statistical learning angle, and the fifth section introduces the CRF in the depth of statistical learning. ( This article original, reproduced please specify the source : Strolling conditions with the Airport series article. )
Directory
"Natural language Processing: Walking conditions with the Airport series article (i)": Foreplay : Go into the condition with the airport
"Natural language Processing: Walking conditions with Airport series article (ii)": talking about CRF based on natural language processing
"Natural language Processing: Walking conditions with Airport series article (iii)": talking about CRF based on machine learning perspective
"Natural language Processing: Walking conditions with Airport series article (iv)": talking about CRF based on machine learning perspective
"Natural language Processing: Strolling conditions with the Airport series article (v)": conditional with the airport knowledge expansion
1 production model and discriminant model in machine learning The production model and discriminant model are described, what kind of model is the condition with the airport?
Supervised machine learning methods can be divided into generation methods and Discriminant methods:
1) Production model: directly to the joint distribution modeling, such as: Mixed Gaussian model, hidden Markov model, Markov random airport, etc.
2) discriminant Model: The conditional distribution modeling, such as: conditional with the airport, support vector machine, logistic regression.
Build model Pros and cons Introduction:
Advantages:
1) generated by the joint distribution, not only by the joint distribution to calculate the condition distribution (and vice versa), but also to give other information. If the edge distribution of an input sample is small, you might think that the model you are learning may not be suitable for classifying the sample, and the classification effect may not be good .
2) The generation model converges faster, that is, when the number of samples is higher, the model can converge to the real model more quickly.
3) The generation model can cope with the existence of hidden variables, such as mixed Gaussian model is a generation method with implicit variables.
Disadvantages:
1) There is no free lunch, the joint distribution is able to provide more information, but also need more samples and more calculations, especially in order to more accurately estimate the classification of the distribution of conditions, the need to increase the number of samples, and the category of conditional probabilities of many information is we do not use the classification, so if we only need to do classification tasks, Wasted computing resources.
2) In addition, it is better to discriminate the model in most cases in practice.
Discriminant model Advantages and disadvantages introduced:
Advantages:
1) corresponding to the generation model disadvantage, the first is to save computing resources, in addition, the number of samples required is less than the generation model.
2) The accuracy rate is often higher than that of the generating model.
3) It allows us to abstract the input (such as dimensionality reduction, construction, etc.), so as to simplify the learning problem because of direct learning without the need to solve the class conditional probabilities.
Disadvantages:
1) is the above-mentioned advantage of not generating a model.
2 Easy-to-understand interpretation criteria with airport
The conditional random field of linear chain is the same as the hidden Markov model of linear chain, and the general inference is the Viterbi algorithm. This algorithm is one of the simplest dynamic programming.
First we infer that the goal is to give an X, to find the P (y| X) the largest of Y. Then this z (x), an x corresponds to a z, so x fixed it is a constant, optimization is not related to him (the value of y does not affect z). Then exp is also monotonically increasing, and does not take him directly to optimize exp inside. So the final optimization goal becomes the linear and form inside, which is the weighted summation of each characteristic of each position. For example, two states, it corresponds to the probability of transfer from the beginning to the first state of the probability plus from the first transfer to the second state of the probability, where the probability is only exp inside the weighted sum. Then you can use Viterbi in this relationship, first you calculate the probability of the first state to take each label, and then you calculate to the second state to take the probability of each label the maximum value, this is the maximum value from the state to which the label is transferred to the maximum probability, the value is how much, And remember this shift (which is what the previous tag is). Then you calculate the third one to take the most probability of the most, take the maximum of the last label should be which. And so on After the whole chain has been calculated, you will know the last word to which tag is most likely, as well as go to the label what the previous status of the label is, take the previous label, the last state of the label is what, sauce. Here I say the probability is the exp inside the weighted sum, because two probability multiplication actually corresponds to two weights and adds, the other parts have not changed.
Learn
This is a typical unconditional optimization problem, and basically all of the optimization methods I know are optimization likelihood functions. Typical is gradient descent and its upgraded version (Newton, quasi-Newton, BFGS, L-BFGS), here is the highest version of L-BFGS, so it is generally used l-bfgs. In addition, the EM algorithm can optimize the problem.
3 probability graph and Markov random field past life
The probabilistic graph model, also known as Markov random field, is a joint probability distribution which can be represented by the graph of non-direction graphs.
The graph is a set of nodes and the edges of the junction nodes, ( This part of the knowledge of the data structure or algorithm of the students are relatively understanding, not as in-depth explanation .) )
Note: The non-directed graph is a figure with no direction on the edge, since the edge has no direction, its weights are directional, such as the transfer probability, "I" to "love" transfer probability 0.5.
The probability graph model is the probability distribution represented by the graph, without the joint probability distribution P (y), y∈{y} is a set of random variables represented by the graph g=<v,e> the probability distribution P (y), that is, in Figure G, the node v∈v represents a random variable ; The Edge E∈e represents a probabilistic dependency between random variables, which is described in detail in the first chapter.
Given a joint probability distribution P (Y) and the non-direction graph G, the non-graph representation of random variables between the pairs of Markov, local Markov, global Markov how the difference?
1) pairs of Markov representations
2) Local Markov representation
3) Global Markov representation
Definition of probabilistic graph model with no direction
There is a joint probability distribution P (Y), represented by the g=<v,e> graph, in Figure g, the node represents a random variable, the edge represents the relationship between the random variables (weighted probability), if the joint probability distribution P (y) satisfies the pair/local/global Markov nature, It is called this Union as a probabilistic graph model or Markov random airport.
4 Calculating the joint probability distribution: factor decomposition of the probabilistic graph-free model
For a given probability graph model, the essence is that the joint probability can be changed into the form of a number of sub-joint probability products, that is, the joint probability of the factor decomposition. First, two concepts are introduced: the Regiment and the largest regiment .
Regiment : A subset of nodes with edges connected to any of the two junctions in graph G becomes a regiment.
Maximum Regiment : If C is a group without a graph G, and can no longer be added to any of the nodes of G to make it a larger regiment, it is called this C is the largest regiment.
Note: {Y1,y2,y3,y4} is not a regiment because Y1 is connected to Y4
factor decomposition for a probabilistic graph-free model :
The joint probability distribution of the probability graph model is expressed, and the operation of the product form of the function of the random variable on the maximal group, that is, the joint probability is too complicated, if it is more than 10,000 nodes? (Each node is a Chinese character, assuming the largest regiment is a chapter, this book assumes 10 chapters, it is 10 of the largest regiment of the product.) )
The formulation of the joint probability distribution P (Y) of the probabilistic graph model shows:
Given the probability graph model, the G,c graph is the largest group on G, and YC represents the random variable of C. Then the joint probability distribution of the probabilistic graph model P (Y) can be the product form of the function ψc (YC) on all the largest group C in the written graph, namely:
where, for the potential function, C is the largest group, and Z is the normalization factor
The normalization factor guarantees that P (Y) constitutes a probability distribution .
Because the required potential function Ψc (YC) is strictly positive, it is usually defined as an exponential function:
5 References
"1" The beauty of mathematics Wu
"2" machine learning Zhou Zhihua
"3" Statistical natural Language Processing Zongchengqing (second edition)
"4" Statistical learning Method (191---208) Hangyuan li
"5" Network resources
6 Natural language related series articles
"Natural Language Processing":"NLP" revealing Markov model mystery series articles
"Natural Language Processing":the "NLP" Big Data Line, a little: Talk about how much the corpus knows
"Natural Language Processing":"NLP" looks back: Talk about the evaluation of Learning Models series articles
"Natural Language Processing":"NLP" quickly understand what natural language processing is
"Natural Language Processing":"NLP" natural language processing applied in real life
Statement : Regarding this article each chapter, I take the comb main, the smooth bright writing technique. The system reads the related bibliography and the data summary combing, aims at the technology to share, the knowledge precipitates. Thank you for the selfless work of bringing it together in a book. Secondly, my level is limited, the right to use for knowledge accumulation, it is inevitable that the subjective understanding is inappropriate, causing the reader inconvenience, based on this kind of situation, hope readers feedback, easy to correct in time. This article original, reproduced please indicate the source : foreplay: Together into the conditions with the airport.
"NLP" Walking conditions with Airport series article (i)