1. Background
With a strong log-linear model, even the sink can be used to do the classification features, of course, to try to use it. The input of the Log-linear model is a series of characteristic equations that are almost nearly natural, and this abstraction is naturally better than the semantic recognition. Semantic recognition has an important step, called "labeling a sentence," in short, given a sentence, by identifying some of these characteristics: such as the existence of a person, place, date, product name, so as to determine the attributes of the sentence (do trade, under task, change settings, etc.) The ability to accurately identify these tags in a sentence can help you understand a sentence. But how do you determine if a word is a place name or a person's name? It is clearly not good enough to match the initial letter with the case or the corresponding library.
In order to accomplish the task better, the introduction of context has become a meaningful tool. For natural language or most signals, the last signal before a signal has an important meaning (for example, I use the important, then the next word is mostly a noun). So the model of the adjacent signal as the characteristic equation is constructed, and it becomes the conditional random airport.
The normal log-linear model looks like this:
If the context is taken into the characteristic equation, then its characteristic equation is probably the same:
(x) Represents the entire sentence, (y) represents the label sequence. A sentence of n words (the length of X) clearly has a M tag (the capacity of the tag). Because the length of the sentence is generally different, but the set of "tags" can be the same (the part of speech is always a few), so we need a constant number of characteristic equations (in general, the number of characteristic equations is m*n).
The sub-characteristic equation of the above-mentioned form is considered, and the whole sentence is traversed by y_i-1 (this means the i-1 label of a label sequence) and the sub-characteristic equation formed by the I-label. Finally, the result of the ergodic of the sub-characteristic equation is summed as the return value of the total characteristic equation. The sub-feature equation can be in the following form (returns 1 if compliant, otherwise returns 0)
F1. Preceded by a noun, the following word begins with M
F2. The front is an adverb, then the latter word is an adjective
F3. The front is an adjective, then the latter word ends with Y
...
Obviously for a sentence in general, the above-mentioned sub-feature equation sums up a larger value for F_j (because it conforms to the grammar rules). Note that there are only two parts of speech in each f_. When we have a lot of such rules, the correct rules (the position after the verb) will be trained to a higher weight, and the wrong rules will be given a lower weight, the final part of the sequence of speech can be higher correctness.
2. Mathematical analysis of CRF model
After we get the CRF model, we have to find a way to get a proper set of parameter W_j by training to realize the structure of the classifier. But before we get the parameters, we still need to do some preparatory work, such as getting the expression of the derivative of each order. OK, because we introduced the sub-characteristic equation f_ and the f_ and f_ have the summation relation, so the algorithm here is more complex.
The whole expression looks like this:
If the F_j is brought in, the expression length is as follows:
(The peripheral part is ignored)
The ultimate goal is to become:
Here's a multiplication-binding law that puts WJ and FJ together and sums it up into a G function. is not "a" g function? This is not necessarily ....
For any I, GI is a different equation. The parameters associated with GI are two, both of which are tag. So assuming that the set of tags has m elements, then the total number of GI is m^2 (all tag combinations are traversed). In other words, a sentence of length n is counted as n*m^2 times (this may be wrong here) ....
The number of calculations is obviously a bit problematic. Therefore, a recursive algorithm is designed to reduce the computational scale. This is not described in detail here.
3. Gradients used in model iterations
To calibrate the Log-linear model, it is of course to find a way to make the model parameters best fit the training set. The best fit for a training set is to think of ways to get the most out of a set of parameters. It is advisable to take the logarithm of probability to make the equation linearized.
OK, so far, the expression of the derivative of each parameter has been calculated, F_j (x, y) is very easy to find, for any training set this is known. And the later E is more troublesome, it needs to bring all the feasible tags into the FJ, and Times p (here p also, given wj,p is known) but this kind of gradient calculation is very large. It is equivalent to iterating through the set of tags for each iteration, and the computational amount is large, but the witty computer scientist has designed the algorithm to describe E by predicting the distribution of p. The specific algorithm does not repeat, in short, the conditional random airport model is a multi-factor, multi-label objects to classify the model. The training process needs supervised learning, for robot vision, supervised learning is not a simple thing. The shape of the object is also difficult to relate to the object's label (round is the cup or the tea can?) So the condition with the airport to the computer vision will be more effective, based on texture color shape and other information of the two-dimensional image is more suitable to explore its significance.
Machine learning--conditional random Airport model