Lucenesmartcn principle (image and text)

Source: Internet
Author: User
The Smartcn word divider is a java version of the ictclas simplified feature. three steps for Smartcn Word Segmentation: 1) atomic segmentation; 2) finding out all possible word group solutions between atoms; 3) n-the shortest path consists of three rough Chinese words. For example: & ldquo; he said the truth & rdquo. 1) The purpose of atomic splitting is to complete the splitting of a single Chinese character. After atomic splitting, it becomes & ldquo; start #

The Smartcn word divider is a java version of the ictclas simplified function.

Three steps for Smartcn Word Segmentation: 1) atomic segmentation; 2) finding out all possible word group schemes between atoms; 3) N-shortest path: the Chinese word crude is divided into three steps.

For example, the sentence "he says is true.

1) The purpose of atomic splitting is to complete the splitting of a single Chinese character. After atomic splitting, it becomes "start # Start/End/true/real/end ".

2) then, based on the dictionary "coredict", find out all possible word group schemes between all atoms. After retrieving the dictionary, this sentence is changed to "start # Start/End/true/real/end/on/in/process/reason/end ".

3) N-shortest path: specifies the rough score of Chinese words. smartCN uses the 1-shortest path. First, we need to find out the distance between all possible combinations of these words (this requires retrieving the weights of the BigramDict dictionary Library ).

It is easy to obtain the shortest path through dynamic planning:

For example, the consumption from node 0 to node 5 is 1 + 2 + 3 + 5 = 3.3 + 2.2 + 4.1 + 4.1 = 13.7

Consumption from node 0 to 4 is 1 + 2 + 4 = 3.3 + 2.2 + 7.1 = 12.6

Node 7 consumes min (5-> 7, 4-> 7) = min (13.7 + 11.6, 12.6 + 11.5) = 28.1 path 4-> 7

...

After finding the shortest path, we can find the result of the word segmentation short sentence.

To sum up the core of smartcn, coredict is used to store words and expand words.

Bigramdict is used to store the jump frequency. Finally, we use the shortest path algorithm to find the best splitting method. How does Bigramdict come from the training corpus. The shortest path solution involves semantic analysis at the cost of training Bigramdict.

Smartcn cannot expand the dictionary, because there is no corresponding association in Bigramdict. if you want to expand it together.

The writing is a little hasty. I haven't introduced many excellent things. For more details, see here.

Http://www.ictclas.org/content_c_005.html

Http://www.cnblogs.com/zhenyulu/articles/668035.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.