Previously, we roughly introduced the first three representative methods of linear dynamic models, and drew such a picture to represent the relationship between them. This article will introduce the last method, CRF. We recommend that you review the maximum entropy model at least once before reading this article to understand the relationship between the two. Next, let's take a look at the dynamic model and its solution introduction-an additional article, and have a clear understanding of the representation.
From the graph, we can see that, like the relationship between Nb and hmm, CRF is actually the me model of the sequential version. Of course, you can say that the maximum entropy Markov model is also a sequence model,
The article has already mentioned the shortcomings of this model. CRF is another way of thinking. It can also be said that CRF is a variant of Markov networks. Now let's talk about what a condition is
Airport. We can see from the previous model introduction that dynamic models can all be represented by graphs, while from the introduction of Dynamic Models and Their Solutions
In this article, we know that undirected graphs represent discriminative.
Model Method. Suppose we have a graph G = (V, E), which is expressed in an undirected graph. The graph contains node V, which corresponds to every variable YV in the Y (annotation sequence. If
The random variable YV follows the Markov property. We can say that the (x, y) sequence set forms a Conditional Random Field. Let's explain the plot. assume that we have a graph G, which is shown as follows:
If this model meets
Then V is a conditional random field. For example, in this image, since it is a conditional random field, the following formula will apply:
A friend may ask, isn't this a standard factor chart? In fact, the graph structure that meets the Conditional Random Field may be diverse, because it is essentially a modeling of the independent nature of the label sequence conditions. However, for Conditional Random Fields, we prefer to use the following figure as its example:
A conditional random field belongs to a Markov Random Field and must meet the following three conditions:EquivalentMarkov independence:
- Adjacent Markov properties: whether the neighboring two random variables fixed point VI and vj are independent of Xi and XJ depends on all other random variables.
- Local Markov feature: Given adjacent vertices, the corresponding random variable Xi is independent of all other random variables.
- Global Markov feature: If I and j are two separated non-Intersecting vertices, the corresponding random variable set XI and XJ are independent of each other when the random variables of other sets are given.
Okay, and the Nature is given. How can we analyze and derive the Conditional Random Field algorithm? In previous articles, we have talked about conditional independence.
Graph, according to conditional
Independence definition. In the figure, if there is no edge between two vertices, we can say that the random variables represented by these two vertices are conditional independent when all other random variables are given, that is
Global Markov feature. In layman's terms, an independent random relationship exists when an edge is missing. In this case, the Conditional Random Field Graph can be regarded as the association probability distribution formula of all elements in Y.
Into multiple potential functions (potential
Function), where each potential function acts on the adjacent vertex pair Yi and Y (I + 1 ). The purpose of normalization is to ensure that the product of the potential function is random variables in G.
Effective probability distribution on the vertex.
Note: At the request of some friends, let's talk about the potential function. In this case, we can regard the potential function as the normalized decomposition factor of the joint probability density. The scope of the potential function is maxima clique ).
Every friend who has learned graph theory should have heard of the Clique concept. For a given graph, G = (V, E ). Among them, V = {1 ,..., N} is the vertex set of graph G, and E is the edge set of graph G. Group of Figure G
It is a set of vertices with edges between two pairs. If a group is not included by any other group, that is, it is not the true subset of any other group, it is called the extremely large group of graph G (maximal
Clique ). The largest vertex is called the maximum clique of graph G ). For more information, see Wikipedia. In fact, theoretically, if it is not a chained CRF, the potential function represents every group in a graph, not the largest group.
From an academic point of view, a potential function is a non-negative real-value function that represents the corresponding clique State, representing the State of the Clique. For example, for a Markov network, the joint probability distribution can be viewed
Here, x (k) indicates the state of the K clique, or the state of the variable that appears in this clique.
For each clique in an image, it has a state, which is expressed by the potential function. The State is composed of the weighting and composition of multiple feature, because a clique contains multiple
The random variable corresponding to each node of a node corresponds to a feature. As mentioned later in the article, we will use the simplest binary model as the feature for each point in our analysis.
Since the Conditional Random Field can be decomposed into the product of the potential function, we should start with the potential function to derive the Conditional Random Field Model. The authors define the Conditional Random Field model algorithm as Conditional Random Fields: probabilistic models for segmenting and labeling sequence data. When the observation sequence X is specified, the probability of a specific label sequence y can be defined as the normalization set of the potential function. The expression of the potential function is
Among them, is the entire observation sequence and label sequence in the I and I-1 at the label of the transfer feature function, is the entire observation sequence and at I label state feature function. The Lambda and Mu variables are the training estimation parameters. For specific Y and X sequences, it may be in the transfer or hold state.
Next, like analyzing the maximum entropy model, we select the simplest feature function for analysis. For example
The simplest binary model. The transfer function can be defined as follows:
If y (I-1) and Yi get a specific value, it is 1, otherwise it is 0.
For convenient representation, we set the State feature Function
In this way, we can define global functions for X and Y.
Function F may be a state function or a transfer function, depending on the specific I.
Well, the last step is to derive the conditional probability of the label sequence y under the given observation sequence X:
Z (x) is the normalization function.
The potential function is
The above is the simplest linear chain CRF model.
.
In the articles in this series, we have introduced the final derivation model of the Maximum Entropy:
Are there any differences from the Conditional Random Field Model? Is it the same as that at the beginning of this article?
Some may ask, if you assume that the current State is only related to the previous State or is completely independent, this is just a first-order linear chain Conditional Random Field. What if the order is higher?
It doesn't matter. We can see that the previously defined F feature function form is
The general form of this formula is
Where
We just discussed level 1. k = 2 at this time. If K> 2, it is the upper order. Just include it.
There are two key problems when using linear chain Conditional Random Fields.
- Training: A label sequence y is given, and X is observed. If the maximum value is found? Commonly used L-BFGS algorithm, Max margin algorithm, gradient tree boosting algorithm.
- Inference: If we have given the observation sequence X and the parameter to solve the unknown label, we need to search for the label sequence with the maximum probability of fit. How can we find it? In the linear chain CRF, viterb and forward-backward algorithms are commonly used. This hidden Markov model has the same problem in principle.
Here we provide an idea of finding parameters during training to find the maximum likelihood estimation extreme value. If the training data set is, the extreme values of the following functions are obtained.
The next item is the penalization to avoid overfitting. The formula for calculating P (Y | X) is given in the previous article. The reason for using the logarithm of the previous item is that
Because maximizing a likelihood function is equivalent to maximizing its natural logarithm. Because the natural logarithm log is a continuous function that strictly increments within the value range of the likelihood function. Then the maximum value is obtained by using the differential method.
Value.
Conditional Random Field is a model that is hard to understand. Although it has not been put forward for a few years, it has become more and more widely used. If you are interested in learning this model, you can view the website at will.
Well, it's finally over. To sum up.
In this article, I first introduced the models of pubei and hidden horse. For Naive Bayes, the hidden horse model has the advantage of considering the relationship between state variables (labels. The biggest drawback of the two is that
Because of its output independence hypothesis, it cannot consider the relationship between context observation features and limits the selection of observation features. Note: It doesn't mean that pubei and yinma cannot process a non-independent observation sequence. The root cause is that it
Both are generative models, all of which are for joint probability density modeling. You know, X and Y are random variables that run through the entire observation and labeling sequence. You have to evaluate the joint probability density for them.
Consider enumerating all variables, which is almost unrealistic in a long sequence, so we can assume that the output is independent.
The Discriminative model, represented by CRF, solves this problem, including the maximum entropy and the hidden horse model combining the maximum entropy, because they use conditions to define probability.
The P (Y | x) model focuses on the conditional probability of the label sequence given a specific observed sequence, this avoids the probability of the observed sequence in the joint distribution.
However, whether it is for the maximum entropy or the maximum entropy hidden horse, because each node needs to be normalized, the local optimal value can only be found. At the same time, all conditional improved Markov models, including them, Have label bias. Simply put, none of the situations that do not appear in the training set are ignored.
The Conditional Random Field solves the above problems. First, it considers the state variables and observation variables of the full sequence. Secondly, it does not normalize each node, but global normalization of all features.
Therefore, the global optimal value can be obtained. What are the disadvantages of such a good model? Due to the large processing range, the convergence of parameter estimates slows down and the iteration complexity increases.
Type, so the complexity is higher.
Here, several common modeling methods for dynamic models are introduced. In fact, there are still many classic methods not involved, such as neural networks. If you are interested, study them on your own. Currently
Methods are either generative models or discriminative models. They have their own advantages and disadvantages. You should choose between them based on your own applications.
××××××××××××××××××××××××××××××××
Series links:
Introduction to dynamic models and solutions
Introduction to dynamic models and solutions-Part 1
Introduction to dynamic models and solutions-Medium
Introduction to dynamic models and solutions-bottom
Published 2009-07-09
Filed in algorithm and tagged algorithm, CRF, Conditional Random Field