The conditional random field (Conditional random fields) is a discriminant diagram model , which has been widely used because of its strong expressive power and excellent performance . From the most general point of view, CRF is essentially a Markov random field given a set of observation values (observations) . Here, we know and understand the CRF directly from the most general point of view, and finally we can see that the linear CRF and the so-called higher-order CRF are some specific structures of CRF.
1. With the airport
In a nutshell, the airport can be seen as a set of random variables (this set of random variables corresponds to the same sample space). Of course, there may be dependencies between these random variables, and in general, only when there is a dependency between the variables, we take them out as a random field to have a practical meaning .
2. Markov with the airport (MRF)
This is a random field with a Markov nature limit. First, a Markov with the airport corresponds to an no-map. Each node on this graph corresponds to a random variable, and the edges between nodes indicate a probabilistic dependency between the random variables corresponding to the nodes . Therefore, the structure of the Markov with the airport essentially reflects our prior knowledge- which variables have dependencies to consider and which can be ignored . Markov property means that the distribution of the variable under all other variables in a given field is equivalent to the distribution of the variable under the neighbor node of the variable in the given field, to any random variable in the Markov. This makes it immediately reminiscent of the definition of the horse chain: they all embody the idea that the factors that are far away from the current factors (which are defined by the circumstances themselves) have little effect on the nature of the current factors .
The Markov property can be regarded as the microscopic attribute of the Markov with the airport, then its macroscopic attribute is the form of its joint probability.
Suppose that the variable set of MRF is
s={y1,... yn},
P (y1,... yn) = 1/z * exp{-1/t * U (y1,.. Yn)},
where Z is the normalization factor , that is, all y1 of the molecule,.. Yn sum to get . U (Y1,.. Yn) is generally referred to as the energy function, defined as the sum of all clique-potential on the MRF. T is called temperature, generally takes 1. What is Click-potential? In the graph of MRF, each clique corresponds to a function called clique-potential. This joint probability form is also called Gibbs distribution. The Hammersley and Clifford theorem expresses the equivalence of these two properties.
if the definition of click-potential is independent of the position of the clique in the figure, the MRF is said to be homogeneous, and if the definition of click-potential is not related to the orientation (clique) of the orientation in the diagram, It is said that the MRF is isotropic. In general, in order to simplify the calculation, it is assumed that MRF is homogeneous is also iostropic.
3. From Markov Airport to CRF
now, if there are observations below each random variable in a given MRF , we want to be sure that the distribution of this MRF, which is the conditional distribution, under a given set of observations , then this MRF is called the CRF (Conditional Random Field). Its conditional distribution form is quite similar to the distribution form of MRF, but one more observation set X, i.e. p (Y1,.. YN|X) = 1/z (x) * exp{ -1/t * U (y1,... yn,x). U (Y1,.. YN,X) is still the sum of click-potential.
4. Training
through a set of samples, we want to be able to obtain the corresponding distribution form of CRF , and use this distribution form to classify the test samples . This is the value of each random variable in the test sample.
in practice,clique-potential ( cluster ) is composed primarily of user-defined feature functions, meaning that the user defines a set of functions that can be used to help describe the distribution of random variables. The strength of these characteristic functions and the positive and negative direction are expressed by training a set of weights , so that in practice we need to give the feature function and the weight of the sharing relationship (different feature functions may share the same weight), And clicque-potential is essentially a linear combination of corresponding feature functions. These weights are the parameters of the CRF. Thus, in essence, the structure of the graph is determined by the definition of the characteristic function (for example, only one-dimensional feature function, which has no edges on the corresponding graph) and the distribution of the CRF as a logarithmic linear form.
seeing this form of distribution, we would naturally think of using the maximum likelihood criterion for training. After taking the log, it will be found that the expression is convex, that is, the global optimal solution-this is a very exciting thing. Moreover, the gradient has an analytic solution, so that the extremum can be solved by LBFGS.
In addition, you can use the maximum entropy criterion for training, so you can train with more mature GIS and IIS algorithms. Because of the logarithmic linear distribution, the maximum entropy criterion and the maximum likelihood criterion are essentially the same, so the difference is not very large.
In addition, since the previous two training methods require inference for each iteration, this greatly reduces the speed of the training. Therefore, another approximate objective function is commonly used, called pseudo-likelihood. It replaces the original likelihood function with the product of the conditional distribution of each random variable (that is, given the distribution of all other variables), according to the Markov nature, the condition distribution is only related to its neighbor (Markov Blanket), so that the global inference is not needed in the iterative process, The speed will be greatly improved. My own experience shows that the pseudo-likelihood effect is almost the same as the maximum likelihood, even slightly better than the latter, when many of the characteristic functions take real values. But for a large number of two-yuan characteristics (binary-valued), pseudo-likelihood of the effect is very poor.
5. Inference
As mentioned earlier, we need probabilistic inference in the course of training, and we need to find out the most probable set of solutions when classifying, which involves inference. This problem is essentially a probabilistic inference problem on graph models. For the simplest structure of a linear framework, we can use the Viterbi algorithm. If the graph results are tree-shaped, you can use the belief propagation (belief propogation), use the sum-product to get the probability, using max-product to get the optimal configuration. However, these methods are not valid for any purpose. An approximate algorithm, called Loopy-belief propogation, is to infer by faith propagation in non-tree structure, and to get approximate solution by cyclic propagation. This is said to work well on some occasions. However, if approximate inference is used during training, it may result in long-time failure to converge.
the probabilistic inference algorithm based on arbitrary graphs is called junction tree. This algorithm guarantees accurate inference of any intention. It first of all the original figure triangulation, in the triangular diagram of the clique according to a certain way to enumerate as a node (in fact, the merging feature function), if there is a intersection between Clicque, the corresponding node has an edge, so that a new graph, through the graph to find the maximum spanning tree, I got the junction tree. Finally, the belief propagation on the junction tree ensures that the exact solution can be obtained.
in essence, these 3 algorithms belong to the idea of dynamic programming. Viterbi's ideas are most intuitive, and belief propagation first converts feature functions to factor, and combines them with random variables to form factor-graph, which is inferred from the idea of dynamic programming on factor-graph (that is, some preprocessing is done). The practice of junction tree is to create a new graph by merging the original feature functions, which can be used to ensure the no-validity of the dynamic programming, so that precise inference can be made. (more complex preprocessing)
It is worth noting that although junction tree greatly avoids the combination explosion, but because it wants to merge the characteristic function and look for clique, the user's characteristic function if the dimension of the definition is too large, it gets the new clique is also very big, this is still very inefficient when calculating, Because it needs to traverse all clique configurations in the inference process, this and clique are exponential in size. Therefore, users should avoid using features with too many dimensions.
A detailed analysis of the conditions with the airport