Compared with the shortest path word segmentation and N-Shortest path participle based on hidden Markov model, the word segmentation based on stochastic condition field (CRF) has better support for the non-login words. In this paper (HANLP), we use pure Java to realize the read and Viterbi decoding of CRF model, and the internal feature function is stored by the even-trie tree (Doublearraytrie), and a high performance Chinese word breaker is obtained.
About CRF
CRF is a commonly used model in sequence labeling scenes, which is more resistant to the problem of Mark biasing than Hmm can utilize more features than MEMM.
CRF Training
This kind of time-consuming task is still given to the crf++ implemented in C + +. For the CRF model for crf++ output, refer to the crf++ model format description.
CRF decoding
The decoding is implemented by the Viterbi algorithm. And slightly improved, in Chinese pseudo-code and vernacular described as follows:
The first label of any word depends not only on its own parameters, but also on the label of the previous word. But there is no word in front of the first word, how to label? So the first word processing slightly different, assuming that the No. 0 Word label is x, Traverse x to calculate the first word of the label, take the largest fraction of the one.
How do I calculate the score of a label for a word? A word generates a series of feature functions based on the template provided by the CRF model, and the output value of these functions is multiplied by the weight of the function to derive a fraction. The score is only the point function score, plus the "Edge function" score. The Edge function is simplified in this model to F (s ', s), where S ' is the label of the previous word, and S is the label of the current word. So the side function can be described by a 4*4 matrix, which is equivalent to the transfer probability in Hmm.
After implementing the scoring function, the second word can be used to decode the BEMs and label all the words.
Instance
Or take the classic "goods and services" as an example, first HANLP the crfsegment word breaker to split it into a table:
12345 |
商 null 品 null 和 null 服 null 务 null |
Null indicates that the word breaker has not yet been labeled for this character.
Code
The above said so much, in fact, my implementation is very concise:
12345678910111213141516171819202122232425262728293031323334353637383940414243 |
/**
* 维特比后向算法标注
* @param table
*/
public
void
tag(Table table)
{
int
size = table.size();
double
bestScore =
0
;
int
bestTag =
0
;
int
tagSize = id2tag.length;
LinkedList<
double
[]> scoreList = computeScoreList(table,
0
);
// 0位置命中的特征函数
for
(
int
i =
0
; i < tagSize; ++i)
// -1位置的标签遍历
{
for
(
int
j =
0
; j < tagSize; ++j)
// 0位置的标签遍历
{
double
curScore = matrix[i][j] + computeScore(scoreList, j);
if
(curScore > bestScore)
{
bestScore = curScore;
bestTag = j;
}
}
}
table.setLast(
0
, id2tag[bestTag]);
int
preTag = bestTag;
// 0位置打分完毕,接下来打剩下的
for
(
int
i =
1
; i < size; ++i)
{
scoreList = computeScoreList(table, i);
// i位置命中的特征函数
bestScore = Double.MIN_VALUE;
for
(
int j =
0
; j < tagSize; ++j)
// i位置的标签遍历
{
double
curScore = matrix[preTag][j] + computeScore(scoreList, j);
if
(curScore > bestScore)
{
bestScore = curScore;
bestTag = j;
}
}
table.setLast(i, id2tag[bestTag]);
preTag = bestTag;
}
}
|
Labeling Results
Print the table after labeling:
123456 |
CRF标注结果 商 B 品 E 和 S 服 B 务 E |
Final processing
Merge BEMs the merge to get:
1 |
[商品/null, 和/null, 服务/null] |
Then send the word to the dictionary to query, not find the temporary as NX, and write down the location (because it is a new word, in order to indicate its particularity, the last part of speech set to null), the use of Viterbi to mark the part of speech:
New word recognition
CRF has a good ability to recognize participle, such as:
123 |
CRFSegment segment = new CRFSegment(); segment.enableSpeechTag( true ); System.out.println(segment.seg( "你看过穆赫兰道吗" )); |
Output:
12345678910 |
CRF标注结果 你 S 看 S 过 S 穆 B 赫 M 兰 M 道 E 吗 S [你/rr, 看/v, 过/uguo, 穆赫兰道/null, 吗/y] |
Null represents the new word.
Pure Java implementation of CRF participle