Ktdictseg word segmentation component version 1.3 algorithm discussion-Word Segmentation Granularity

Source: Internet
Author: User
Ktdictseg word segmentation component version 1.3 Algorithm Discussion-Word Segmentation Granularity

Author: Xiao Bo
The ktdictseg word segmentation component version 1.3 is almost complete, with only the last feature left. During the development of ktdictseg word segmentation component 1.3, many friends paid attention to and supported it. In particular, some word segmentation experts put forward a lot of good opinions, I also gave a lot of pertinent suggestions to my word splitting algorithm. I would like to express my sincere thanks to them. The 1.3 pre-word Splitting Algorithm follows the disguised largest Matching Algorithm of the previous version. This algorithm has essential defects, in the future, version 2.0 may use an advanced algorithm to replace this obsolete algorithm. The 1.3 version adds support for specialized English vocabulary in terms of pre-word segmentation. For example, C ++ and C # cannot be split. Now, only the words in the dictionary can be split. In the pre-word splitting, version 1.3 also adds the Word Frequency judgment. After pre-segmentation, version 1.3 makes some improvements to Chinese name matching and non-Logon word recognition. In addition, version 1.3 also provides support for e.net and dictionary management. Many of my friends suggested changing all arraylist to list <>. It is recommended that the version 1.3 be replaced by the original version. Code All arraylists are changed to list <>.
Starting from today, I plan to gradually release some of the major algorithms in the new version for your reference. Due to my limited level, many algorithms are difficult to meet your needs, and I hope to correct some of them.
This blog article mainly discusses the question of Word Segmentation granularity.
Chinese Word Segmentation is mainly used in search engines. When a search engine creates an index, if the word segmentation granularity is too large, only specific keywords can be entered to search for the corresponding results. If it is too small, the search accuracy will be affected. Here is a simple example. If the word "Central Hotel" is directly segmented into "Central Hotel", you must enter the word "Central Hotel" to find the relevant Article . If you only enter "Central" or "hotel", no result is returned. In practical applications, we often want users to enter the central or hotel to find the Central Hotel. However, if the word segmentation intensity is too small, for example, if the word is divided into the central/dining/Store, note that the food and the store are separate words. When we enter a shoe store (word Splitting: shoes/shop), we can also find the Central Hotel. This is what we do not expect. Therefore, the best word splitting solution should be the central/Hotel. But for a single phrase, we can perform word segmentation in this way. If the phrase itself is not sure, how can we determine the specific word segmentation granularity? The sun walking in the rain proposed to forcibly control the granularity of Word Segmentation more than a month ago. In my understanding, it is to forcibly set a granularity, such as 2 or 3. An array exceeding the length will be forcibly split. This is also an acceptable solution, but the length of Chinese words is generally between 1 and 4, and some four-character Idioms cannot be split. If we set the granularity to 4
The case I mentioned above cannot be split. Of course, we still have a way to delete all the phrases in the dictionary that can be split. In this case, the workload is too large, and in some cases, it is also difficult to choose to delete. For example, whether or not to split "body-killing" into "body-killing" and "Body-Building" is a matter of benevolence and wisdom. Therefore, I think that the word segmentation granularity is determined based on Word Frequency in version 1.3. The specific algorithm is to determine the minimum word frequency of a long word that consists of multiple short words. If the word frequency is greater than the word frequency of a long word, split according to the combination. If multiple combinations are used, they are split based on the combination with the largest word frequency.
For example, in the preceding example, "Central Hotel", the frequency of the Central Hotel is 1000, the hotel is 900, the meal is 200, the shop is 600, and the Central Hotel is 500
There are the following combinations:
The minimum Word Frequency of the Central Hotel is 600
The minimum Word Frequency of the central/hotel is 900
The minimum frequency of the central/rice/store is 200
Select the maximum word frequency combination and split it into central/Hotel.

Ktdictseg word segmentation component version 1.3 new feature list and download location

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.