Pure Java implementation of CRF participle

Source: Internet
Author: User

Compared with the shortest path word segmentation and N-Shortest path participle based on hidden Markov model, the word segmentation based on stochastic condition field (CRF) has better support for the non-login words. In this paper (HANLP), we use pure Java to realize the read and Viterbi decoding of CRF model, and the internal feature function is stored by the even-trie tree (Doublearraytrie), and a high performance Chinese word breaker is obtained.

About CRF

CRF is a commonly used model in sequence labeling scenes, which is more resistant to the problem of Mark biasing than Hmm can utilize more features than MEMM.

CRF Training

This kind of time-consuming task is still given to the crf++ implemented in C + +. For the CRF model for crf++ output, refer to the crf++ model format description.

CRF decoding

The decoding is implemented by the Viterbi algorithm. And slightly improved, in Chinese pseudo-code and vernacular described as follows:

The first label of any word depends not only on its own parameters, but also on the label of the previous word. But there is no word in front of the first word, how to label? So the first word processing slightly different, assuming that the No. 0 Word label is x, Traverse x to calculate the first word of the label, take the largest fraction of the one.

How do I calculate the score of a label for a word? A word generates a series of feature functions based on the template provided by the CRF model, and the output value of these functions is multiplied by the weight of the function to derive a fraction. The score is only the point function score, plus the "Edge function" score. The Edge function is simplified in this model to F (s ', s), where S ' is the label of the previous word, and S is the label of the current word. So the side function can be described by a 4*4 matrix, which is equivalent to the transfer probability in Hmm.

After implementing the scoring function, the second word can be used to decode the BEMs and label all the words.

Instance

Or take the classic "goods and services" as an example, first HANLP the crfsegment word breaker to split it into a table:

12345 商   null   品   null   和   null   服   null   务   null   

Null indicates that the word breaker has not yet been labeled for this character.

Code

The above said so much, in fact, my implementation is very concise:

12345678910111213141516171819202122232425262728293031323334353637383940414243 /** * 维特比后向算法标注 * @param table */publicvoidtag(Table table){    intsize = table.size();    doublebestScore = 0;    intbestTag = 0;    inttagSize = id2tag.length;    LinkedList<double[]> scoreList = computeScoreList(table, 0);    // 0位置命中的特征函数    for(inti = 0; i < tagSize; ++i)   // -1位置的标签遍历    {        for(intj = 0; j < tagSize; ++j)   // 0位置的标签遍历        {            doublecurScore = matrix[i][j] + computeScore(scoreList, j);            if(curScore > bestScore)            {                bestScore = curScore;                bestTag = j;            }        }    }    table.setLast(0, id2tag[bestTag]);    intpreTag = bestTag;    // 0位置打分完毕,接下来打剩下的    for(inti = 1; i < size; ++i)    {        scoreList = computeScoreList(table, i);    // i位置命中的特征函数        bestScore = Double.MIN_VALUE;        for(int j = 0; j < tagSize; ++j)   // i位置的标签遍历        {            doublecurScore = matrix[preTag][j] + computeScore(scoreList, j);            if(curScore > bestScore)            {                bestScore = curScore;                bestTag = j;            }        }        table.setLast(i, id2tag[bestTag]);        preTag = bestTag;    }}
Labeling Results

Print the table after labeling:

123456 CRF标注结果商   B  品   E  和   S  服   B  务   E
Final processing

Merge BEMs the merge to get:

1 [商品/null, 和/null, 服务/null]

Then send the word to the dictionary to query, not find the temporary as NX, and write down the location (because it is a new word, in order to indicate its particularity, the last part of speech set to null), the use of Viterbi to mark the part of speech:

1 [商品/n, 和/cc, 服务/vn]
New word recognition

CRF has a good ability to recognize participle, such as:

123 CRFSegment segment = newCRFSegment();segment.enableSpeechTag(true);System.out.println(segment.seg("你看过穆赫兰道吗"));

Output:

12345678910 CRF标注结果你   S  看   S  过   S  穆   B  赫   M  兰   M  道   E  吗   S  [你/rr, 看/v, 过/uguo, 穆赫兰道/null, 吗/y]

Null represents the new word.

Pure Java implementation of CRF participle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.