Http://www.cnblogs.com/liufanping/p/4899842.html
Chinese Word segmentation method is very much, based on Thesaurus is the most basic, but the current major Internet companies will not rely solely on word library participle, the general machine learning participle mainly, word library participle method supplemented. A long time ago, I mentioned using the hidden Markov model for Chinese participle, condition with the airport is in fact a hidden Markov model of an upgrade, the internet has a lot about the conditions with the airport model participle, but it is basically difficult to understand, perhaps the reason for the paper, those authors used to come up with a bunch of complex formulas, I also read some, get some authors don't understand, I get a paper. From the perspective of practice, this paper provides a Chinese word segmentation solution based on conditional random field model.
First step: Prepare Corpus.
First of all, you have to prepare a corpus that has been divided into words for machine learning, as shown in the following illustration:
If you don't, you can download it here.
Step Two: Preliminary Corpus characterization Learning
As with the hidden Markov model, conditional random Airport is also based on the state of learning words to analyze the state, for a word, it has 4 states, respectively: prefix (Begin), the word (middle), the final (end), Tango Word (single), referred to as b,m,e,s.
Therefore, we need to take the first step of the corpus to analyze the status of each word, for example: "Revenue" needs to be improved to: "Collect | B Benefits | E "by adding the status information to each of the words in the corpus, as shown in the following illustration:
Of course, if you are lazy, it does not matter, I here to provide a good analysis of the data, you can download here.
Step Three: Feature learning
With the second step of the initial study, the third step is relatively easy, feature learning is a very important part of the whole process, before the characteristics of the study, we must understand a problem, what characteristics to learn. As follows:
1. What is the number of occurrences of this word in the corpus altogether? For example, the word "I" appeared 256 times altogether.
2. This word appears as the probability of prefix (B), the word (M), the suffix (E), the Tango Word (S). That is, in 256 when the "I" Word, "I" as the probability of prefix, the number of words in the probability, and so on.
3. This word, when it is prefix (B), is the probability that it will be transferred to the state of the next word. Each word has its own state, but the word behind a word, also has its own state, then the state of the current word, to the next word state (perhaps B, M, E, S one) the probability of how much. For example "I", when the state of "I" is B, in the following words, the state is B for 0, the state is 10 for M, the state is 20 for E, and the state is 0 of S. The state of the next word is the time when the state of "Me" is M, E, S. This process is commonly known as: state transition probability calculation. This item will form a 4x4 matrix.
As of the previous three characteristics of learning, you may feel that there is not much difference in the way of the hidden Markov model, but theoretically, the accuracy of the conditions with the airport must be higher than that of the hidden Markov model. is because the condition with the airport will learn the context of the relationship, that is, more computing, when a word appears, it's the first word, what is the last word, what is the probability, that is, our fourth characteristic learning.
4. When this word appears, what appears after the word, the probability of how much. For example, when the state is "I" of B, the next word is "we" with a probability of 67.9%, for example, the "I" of the state, the last word is "the" probability of 21%, the next word is "love" probability of 17.8%, and so on, record each word in four states under the context of the relationship is a very important step. In this step, we only record the contextual relationship between the last word and the next word, which allows us to record two words and two words.
Using code, this is the following example:
/**
* Feature for each word.
* *
* @author liufanping
*/
private static class Feature {/** * The state
transition matrix
*
private double[][] transfer;
/**
* The observation sequence Transition matrix * *
private double[] status;
/**
* The predecessor State * * *
private treemap<integer, double> prestatus;
/**
* The next status * *
private treemap<integer, double> nextstatus;
/**
* Total count.
* *
private int cnt;
... ...
}
Two things to note in the code above:
1. Each word will have such a feature class, you can use a hash table, key stored word, value to save its characteristics.
2. The prestatus of the context before and after the record, key is the combination of the previous word and the current word state, Nextstatus,key is the combination of the next word and the current word state, for example the current word is "I", the current word state is B, the next word may be "we", in Nextstatus The key is the hash of "_b", and value is the probability of this happening.
I believe that learning from the corpus of these four features, should not be a difficult thing, according to the description of the perfect code.
Fourth step: Start participle
Since the characteristics have been trained well, how to give a sentence participle it. For example, user input "the economic structure of Greece is more special," How can be a participle of it. In fact it is very simple, the following is a mathematical calculation.
1. The "Greek economic structure is more special" into a character array. Namely "Greek", "wax", "," "Jing", "Ji", "knot", "structure", "more", "special", "extraordinary".
2. Take out each character corresponding feature (the content produced in the third step).
Now each character of the word as well as taken out, what to do behind it. Let's first analyze, we are actually to each word to determine it is B, M, E, S four states in which one, so you can draw a matrix.
Here, since it is a matrix, and is to find a path within the matrix, it is easy to think of the Viterbi algorithm. Here, we only need to calculate the "Greek" word in B, M, E, S, the state value, followed by the edge of the solution.
We set the value in the matrix to S, s[Word [Current state]=max (p[in any State] [current state]*s[a word] [any one State]) +w[before (after) a word _ current status [current word]+r[current state probability] (note: R is characteristic two, P is characteristic three, W is the previous section of feature four, W is the following section of feature four.
For example "Wax", the value on state B is: s[wax][b]=max (p[b][b]*s[Greek][m],p[e][b]*s[Greek][e],p[s][b]*s[))][s before [Greek +w la _b][] (_ b][Wax]+r[b], the same can be calculated s[][m],s[wax][e],s[][s].
By analogy, you can calculate the rest of the word. Because the word "Greek" appears in the first word, there is no probability of other states being transferred to its state, therefore, s[Greek][b]=w before [_b][]+w [La _b][], the other values of "Greek" were computed in turn.
The above process, hoping to be repeated with the study, after studying through, you will find it very simple, through the calculation will get the following matrix:
See here, you seem to have seen the result of participle, after all the values in the matrix, there is a parenthesis, where the path is stored, that is, the value is passed through the previous value (since there is a path selection for the process of Computing Max), which is recorded so that the path can be traced back.
When this matrix is obtained, we only need to remove the maximum value of "special", that is, 1.1867 corresponding state is end, from the last word of state 0, that is, "special" Word of the begin, again see the "extra" word from the previous word "more" single. Therefore, the discriminant status is as follows
"Greek/b la/e/s via/b/e knot/b/e than/s special/b/e"
The result of translation is "the economic structure of Greece is more special".
The completion of the implementation of the word segmentation process, if you are somewhat vague on this item, it is recommended to look at the Viterbi algorithm.
In the end, I would also like to thank you for being able to see the last, it is relatively simple a kind of use of conditions with the airport word segmentation, complex, can be optimized on this basis to upgrade.
Sincere thanks.