Discovering Word Segmentation Custom Thesaurus

Source: Internet
Author: User

Collection for later use:

Transferred from: http://blog.csdn.net/askpp/archive/2009/09/08/4532355.aspx

custom loading of Cook looked through participle paoding dictionary

Everyone downloaded the cook looked through Chinese word after the words and after the MyEclipse configuration, and in the environment variables in the wingdows with the dic dictionary path, thinking how to load a custom dictionary, haha, actually very simple, I suddenly see, You go to the DiC folder to find paoding-dic-names.properties This file, open the content with a text editor like this

#dictionary character encoding
#paoding. Dic.charset=utf-8

#dictionaries which is skip
#paoding. dic.skip.prefix=x-

#chinese/CJK charactors that would not token
#paoding. Dic.noise-charactor=x-noise-charactor

#chinese/CJK words that would not token
Paoding.dic.noise-word=x-noise-word

#unit words, like "ge", "zhi", ...
#paoding. Dic.unit=x-unit

#like "Wang", "Zhang", ...
#paoding. Dic.confucian-family-name=x-confucian-family-name

#linke "Upan", "Cdhe"
#paoding. Dic.for-combinatorics=x-for-combinatorics

You add your own thesaurus to this, or you can save the # in front of the existing thesaurus, and then run the program to automatically detect it,

By the way, a few words in the library function, the front with X-word library is to block sensitive words with, ha ha, you will not want to put the word into the file inside it, haha, really happy.

Transferred from: http://hi.baidu.com/xwx520/blog/item/c288ee3eb0f5b9f0838b137f.html

Custom thesaurus for discovering participles [custom dictionaries]

Found a long time has not been updated, especially this module, it has not progressed long. Study like riding, behind. You should learn more when you are not in the same time.
First of all, it is first to post out the reference source, after all, not original.
(1), http://blog.csdn.net/askpp/archive/2009/09/08/4532355.aspx
(2), http://qipei.javaeye.com/blog/365207
Now continue:
1, to http://code.google.com/p/paoding/downloads/list download paoding-analysis-2.0.4-alpha2.zip
2, then unzip, find the DiC folder, copy to the folder you want to store




3, configuration environment variables, if not configured, the operation will be error, error of the Chinese information is also required to configure environment variables


4. Delete the. compiled file

5. Create a new text file with the suffix name. DiC, which is saved in the E:/paodingtest/dic/locale file directory using Utf-8.


6, below we write a word breaker test program

7, custom Thesaurus in the case of the word segmentation results, the first thing to see is the word database compilation information


8. Word segmentation results with custom Thesaurus


















































9. Delete Custom thesaurus and. compiled files, re-participle


10, put together a comparison, the effect is still some

















































11, if we are in the participle, we need to move, moving separately, by default is not separate


12, add two words in the thesaurus


13, of course, if you want to use this word breaker better, but also need to understand and think deeply about the word breaker, for example, "I am an athlete", although we have added the word "athlete" in the custom thesaurus, But still did not cut into "I", "yes", "athlete", but there is a irrelevant "mobilization" the word. While the "movement" should be divided into "transport", "move", "movement", that is also need to think about the use of, of course, it is also related to the particularity of Chinese language, for example: "Table tennis Auction is over", in the absence of context itself is ambiguous.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.