Document directory
Design and Implementation of a fast word segmentation system
(*** Computer Science Institute)
Abstract: Through the analysis of existing word segmentation algorithms, on the one hand, the structure of hash and tire trees is used to improve the dictionary, thus improving the word segmentation speed. On the other hand, on the basis of existing models, you can add rules to ensure word segmentation accuracy. Experiments show that the word segmentation speed and accuracy of the entire system have been improved to a certain extent.
Key words: Word Segmentation; hash; trie tree; rules
1 Overview
Word Segmentation is the process of re-composing word sequences according to certain specifications. In an English line, words are separated by spaces as natural delimiters, while Chinese words are only divided between sentences by punctuations or paragraphs, there is no such demarcation between words. Therefore, Chinese Processing on computers is much more complex than English.
Words are the smallest, active, and meaningful language component. All computer language knowledge comes from machine dictionaries (various information about words) and syntaxes (word aggregation is described by various combinations of word classes) and the semantic, context, and pragmatic knowledge bases of words and sentences. As long as the Chinese Information Processing System involves syntax and semantics (such as retrieval, translation, summarization, and proofreading), words must be the basic unit. After a Chinese character is converted from a sentence to a word, text processing such as syntax analysis, statement understanding, automatic summarization, automatic classification, and machine translation can be feasible. Word Segmentation is the foundation of machine linguistics.
In the process of implementing the word segmentation algorithm, you must consider the correct word segmentation rate and word segmentation speed. No matter which word segmentation method requires a lot of time to calculate the possible words of the statement to be split, and then based on statistical or syntax rules for the words to be split, obtain the most likely correct splitting result to improve the accuracy of word segmentation. If the initial splitting speed can be accelerated, it will be helpful for improving the speed of the entire word splitting algorithm.
Based on the predecessors, this paper designs an efficient dictionary structure to improve the search speed and improve the word segmentation speed. In addition, two rule-based word segmentation methods are added to ensure the accuracy of word segmentation. This ensures the speed and accuracy of word segmentation.
2. dictionary structure
To solve the speed problem, we first introduce two data structures:
1. hash is usually translated as "hash", which is also directly translated as "hash", that is, input of any length (also called pre- ing, pre-image ), the hash algorithm is used to convert an output with a fixed length. The output is the hash value. This type of conversion is a compression ing, that is, the space of hash values is usually much smaller than the input space, and different inputs may be hashed into the same output, instead, it is impossible to uniquely determine the input value from the hash value. Simply put, a function compresses messages of any length to a fixed-length message digest.
How can I use hash to determine the position of a Chinese character when searching for Chinese characters? This is related to the arrangement of each encoding. Here we mainly provide a hash function policy.
For gb2312 encoding, set the input Chinese characters to gbword, we can use the formula (C1-176) * 94 + (C2-161) to determine gbindex. C1 indicates the first byte, and C2 indicates the second byte. The details are as follows:
Gbindex = (unsigned char) gbword. At (0)-176) * 94 + (unsigned char) gbword. at (1)-161;
2. Trie tree
Trie, also known as the dictionary tree and word search tree, is a tree structure that stores a large number of strings. The advantage is that the common prefix of a string is used to save storage space.
Its basic nature can be summarized:
1. The root node does not contain characters. Each node except the root node contains only one character.
2. From the root node to a node, the character passing through the path is connected to the string corresponding to the node.
3. All subnodes of each node contain different characters.
The basic operations are: Query, insert, and delete. Of course, delete operations are rare. Here I only delete the entire tree, And the delete operation for a single word is also very simple.
You can search for dictionary items by using the following methods:
(1) start a search from the root node;
(2) obtain the first letter of the keyword to be searched, select the corresponding subtree based on the letter, and go to the subtree for further search;
(3) On the corresponding subtree, obtain the second letter of the keyword to be searched, and then select the corresponding subtree for retrieval.
(4) iteration process ......
(5) If all the letters of a keyword have been removed from a node, the information attached to the node is read to complete the search.
Other operations are similar.
Suppose there are six words, B, ABC, Abd, BCD, ABCD, EFG, and HiI. The tree we build is like this.
Figure 1 trie Tree Structure
3. Combine the hash and trie trees to get the dictionary structure we want:
Figure 2 hash_trie tree dictionary structure
After analysis, the time complexity of hash search is O (1). Similarly, trie search time complexity is O (1 ). then the time complexity of searching the entire word is O (1 ).
3. Rule-Based Word Segmentation Algorithms:
This word segmentation system is also a dictionary-based Improved maximum matching algorithm.
There are many word segmentation algorithms. Common algorithms are forward and reverse maximum matching. However, these two algorithms do not provide good support for sentences with multiple differences. Take the sentence "Changchun Mayor's speech for the Spring Festival" and "Changchun pharmacy for Changchun" as an example:
The "Changchun Mayor's speech for the Spring Festival" can be split into Changchun, Changchun, Mayor, Changchun, Spring Festival, and speeches in sequence. Based on the positive maximum matching algorithm, the word segmentation result is Changchun city/long/Spring Festival/speech. According to the inverse maximum matching algorithm, the word segmentation result is Changchun/mayor/Spring Festival/speech.
Changchun pharmacy can be split into Changchun, Changchun, Mayor, Changchun, chunyao, and drugstore by the forward maximum matching algorithm, the word segmentation result is Changchun/pharmacy. According to the inverse maximum matching algorithm, the word segmentation result is "Changchun/mayor/spring Pharmacy ".
It can be seen that there are differences between positive and reverse maximum matching.
The algorithm of this system has made some improvements based on the forward matching algorithm. One sentence is used to describe the first word combination in a sentence that is not matched with the least words, if the number of words that are not matched in multiple combinations is the minimum, find the combination with the least number of matched words. The matching order is from left to right. Take the preceding two sentences as an example:
The following matching word combinations can appear in the "Changchun Mayor's Spring Festival speech" in the largest positive matching scan sequence:
1) Changchun/mayor/Spring Festival/speech match 4 words, no matching word count 0
2) Changchun City, Changchun City, and speech match 3 words, no matching words 0
It is not difficult to see that the first combination does not match the minimum number of words, so the first one is used.
The following matching word combinations can appear in the largest positive matching scan sequence of Changchun pharmacy.
1) matching words in Changchun/pharmacy: 3. No matching words: 0
2) matching words in Changchun/spring drug store: 2; unmatched words: 1
3) Changchun city/spring medicine matching word count 2, unmatched word count 2
4) Changchun city/pharmacy matching word count 2, unmatched word count 2
5) matching words in Changchun/mayor/spring pharmacy: 3. No matching words: 0
6) Changchun/mayor/chunyao match word count 3, not matching word count 1
7) Changchun/mayor/pharmacy matching word count 3, not matching word count 1
8) Changchun/mayor/pharmacy matching word count 3, not matching word count 1
9) matching words in Changchun/pharmacy: 3; unmatched words: 1
It can be seen that the number of unmatched words in combination 1 and combination 5 is the smallest, and the number of matched words is equal, but the matching order in combination 1 is prior, so the combination 1 is used.
4. Implementation
The entire word splitting flowchart is as follows:
Figure 3 system flowchart
The figure shows that the core of the entire algorithm is composed of two parts: it organizes the dictionary into a hash structure and stores it in the trie tree structure, with this structure, you can quickly find all the words that may be contained in the sentence to be processed. This structure ensures the speed of this algorithm.
The other is to use the game tree to find the best match, and the core of the search is the system's custom rule to find the best match: The one with the minimum number of words in a sentence and the least matching words is the best match.
The following is a flowchart of the game tree to find the best match:
Figure 4 flowchart for optimal matching
References
[1] Zhang xiaohuan. Design and Implementation of the Chinese word segmentation system [D]. University of Electronic Science and Technology dissertation. 2010-05-01.
[2] Wu Jingjing, Jing jiwu, Xiao Xiaofeng, and Wang pingjian. A Chinese Word Segmentation dictionary mechanism [J]. Journal of the Chinese Emy of sciences.
[3] array Guilin, Wang Yongcheng, Han kersong, and Wang Gang. Improved Fast word segmentation algorithm. Computer research and development [J]. 1999-08-20.
[4] Chen xiaohe. Automatic Analysis of Modern Chinese ---- visual c ++ implementation [M]. Beijing Emy of Language and Culture Press. 1999.
[5] Wang Yu. Research and Improvement of game tree search algorithms [D]. Zhejiang University master's thesis 20060201
[6] Huang Changning, Zhao Hai. 10-year review of Chinese word segmentation [J] Journal of Chinese Information
[7] Yan weibin, Zhou zhenliu, Jiang zhuoming, Xu Rongsheng. Chinese Word Segmentation dictionary design [J] Computer Engineering and application 2007
The final code for finding the optimal match is attached:
// Obtain the optimal combination
// Sentword indicates the fully split word, and pbegin indicates the current
Unsigned int getbestcomb (unsigned char first, sentwords * preinfo, strword * pbegin)
{
Sentwords * pcurs;
Strword * pcurw;
Strword * Prew;
Strword * pTMP;
Unsigned int I;
// If this is the first time, perform the following Initialization
If (first = 1 | pbegin = NULL)
{
Pbegin = g_words.head;
First = 0;
}
// Perform the following initialization in other cases
Pcurs = (sentwords *) malloc (sizeof (sentwords ));
If (pcurs = NULL)
{
Return NULL;
}
Initcomsent (pcurs );
If (preinfo! = NULL)
{
For (I = 0; I <preinfo-> wordcnt-1; I ++)
{
Pcurs-> words [I] = preinfo-> words [I];
}
Pcurs-> wordcnt = preinfo-> wordcnt;
Pcurs-> blanks = preinfo-> blanks + (pbegin-> pos-preinfo-> words [preinfo-> wordcnt-1]-> POS );
Pcurs-> words [pcurs-> wordcnt-1] = pbegin;
}
Else if (preinfo = NULL)
{
Pcurs-> blanks + = pbegin-> Pos;
Pcurs-> words [pcurs-> wordcnt] = pbegin;
Pcurs-> wordcnt ++;
}
Prew = pbegin;
Pcurw = Prew-> next;
// Enter the loop
While (pcurw! = NULL)
{
PTMP = Prew;
If (Prew-> POS + strlen (Prew-> word) <= pcurw-> POS)
{
Pcurs-> blanks = pcurs-> blanks + (pcurw-> pos-(Prew-> POS + strlen (Prew-> word )));
Pcurs-> words [pcurs-> wordcnt] = pcurw;
Pcurs-> wordcnt ++;
Prew = pcurw;
Pcurw = pcurw-> next;
}
Else
{
// Recursive call is performed when a branch exists.
Getbestcomb (first, pcurs, pcurw );
// Prew = pcurw; because this line of code has been adjusted for 2 hours,
// The function that obtains the optimal combination is written for three hours.
Pcurw = pcurw-> next;
}
// Prew = pcurw;
// Pcurw = pcurw-> next;
}
If (minblank> pcurs-> blanks |
Minblank = pcurs-> blanks & mincnt> pcurs-> wordcnt)
{
If (g_segresult! = NULL)
{
Free (void *) g_segresult );
}
Mincnt = pcurs-> wordcnt;
Minblank = pcurs-> blanks;
G_segresult = pcurs;
}
Else
{
Free (void *) pcurs );
}
Return sys_ OK;
}