The Smartcn word divider is a java version of the ictclas simplified feature. three steps for Smartcn Word Segmentation: 1) atomic segmentation; 2) finding out all possible word group solutions between atoms; 3) n-the shortest path consists of three rough Chinese words. For example: & ldquo; he said the truth & rdquo. 1) The purpose of atomic splitting is to complete the splitting of a single Chinese character. After atomic splitting, it becomes & ldquo; start #
The Smartcn word divider is a java version of the ictclas simplified function.
Three steps for Smartcn Word Segmentation: 1) atomic segmentation; 2) finding out all possible word group schemes between atoms; 3) N-shortest path: the Chinese word crude is divided into three steps.
For example, the sentence "he says is true.
1) The purpose of atomic splitting is to complete the splitting of a single Chinese character. After atomic splitting, it becomes "start # Start/End/true/real/end ".
2) then, based on the dictionary "coredict", find out all possible word group schemes between all atoms. After retrieving the dictionary, this sentence is changed to "start # Start/End/true/real/end/on/in/process/reason/end ".
3) N-shortest path: specifies the rough score of Chinese words. smartCN uses the 1-shortest path. First, we need to find out the distance between all possible combinations of these words (this requires retrieving the weights of the BigramDict dictionary Library ).
It is easy to obtain the shortest path through dynamic planning:
For example, the consumption from node 0 to node 5 is 1 + 2 + 3 + 5 = 3.3 + 2.2 + 4.1 + 4.1 = 13.7
Consumption from node 0 to 4 is 1 + 2 + 4 = 3.3 + 2.2 + 7.1 = 12.6
Node 7 consumes min (5-> 7, 4-> 7) = min (13.7 + 11.6, 12.6 + 11.5) = 28.1 path 4-> 7
...
After finding the shortest path, we can find the result of the word segmentation short sentence.
To sum up the core of smartcn, coredict is used to store words and expand words.
Bigramdict is used to store the jump frequency. Finally, we use the shortest path algorithm to find the best splitting method. How does Bigramdict come from the training corpus. The shortest path solution involves semantic analysis at the cost of training Bigramdict.
Smartcn cannot expand the dictionary, because there is no corresponding association in Bigramdict. if you want to expand it together.
The writing is a little hasty. I haven't introduced many excellent things. For more details, see here.
Http://www.ictclas.org/content_c_005.html
Http://www.cnblogs.com/zhenyulu/articles/668035.html