Comparison of Chinese Word Segmentation technology: Word Segmentation vs Chinese Word Segmentation

Source: Internet
Author: User

In the full-text information retrieval system, the word segmentation methods used to create inverted indexes have been widely used and are inconclusive.
 
As far as I know, some paper "studies have pointed out that" index construction using binary segmentation is "the best "; I have also seen a brother in the garden think that word segmentation is the most accurate (sorry, forget the specific source); of course, it is also very popular to wrap a Chinese Word Segmentation component based on the dictionary or co-occurrence frequency and add it to your project.
 
Since there are so many opinions and practices, it is inevitable that people will have a high or accurate determination;
 
However, as a mature and rational young man, I think this determination is not necessary, the reason is that the evaluation criteria of the information retrieval system are diversified-the three indicators of recall rate, accuracy and query efficiency are in conflict, and there are only trade-offs and irreconciliation; if people care about different indicators, they will naturally come up with different ideas and adopt different practices. If you are creating a Web search engine, the query efficiency must be guaranteed first, because the massive data to be processed and concurrent requests are a natural obstacle. Secondly, in terms of recall rate and accuracy, you will be more inclined to the latter, because the relationship between the end user and the Web search engine is just like the relationship between men and infatuated women-users want to get the most satisfactory results as soon as possible, and abandon you in the next instant until they need you again (of course, if you provide a bid ranking service named "Good morni" to avoid complaints from customers, it is best to care about the recall rate. Therefore, the conflict of interests between the vast majority of Tom and a small VIP is profound, long-term, and irreconcilable ...); At the same time, for a traditional library information retrieval system, the situation will be very different-books and articles have a good keyword index, including the title, author, abstract, body, recording time, and other well-defined structured data, the document set is relatively stable and relatively small-all of which make your decisions more inclined to increase the recall rate of the system. The reason is very simple. You have the possibility or inherent advantage to do so.
 
Since we have already made it clear that the indicators of the information retrieval system are diversified, let's take a look at how different index word segmentation policies affect these indicators.
 
First, let's compare two opposing strategies,Word Segmentation vs Chinese Word Segmentation:
 
The strongest evidence for single-word splitting supporters is as follows:
The "World Cup" is a word. If you use a single word to split the word, you can also find the "world" to hit this document, but you cannot find the word in Chinese;
The supporter of Chinese word segmentation argues that:
"I have participated in the World Cup". If I use a single word to split it, I can still hit this document, but no one has actually dropped it;
 
Through the above statement, we can observe the conclusion that the single-word splitting function can improve the recall rate of the system and reduce the accuracy. The Chinese word segmentation function is the opposite, which improves the accuracy, and reduce the recall rate, and the rougher the word segmentation particles (the longer the average word length), the more obvious this trend.
 
This conclusion seems to help understand why Google, Baidu, and so on all web search engines that require higher accuracy in theory use Chinese Word Segmentation technology. However, if our understanding stays at this level, it is too superficial: the fact is that a high-throughput Web search engine must use Chinese Word Segmentation technology to process Chinese content.
 
Let's think of an inverted index as a table. Each row has a termtext and a list of all document numbers containing the termtext. In this way, when we query a keyword, we can obtain all the documents containing the keyword at a time, instead of finding them one by one in the original document set. Using different word segmentation policies to create an index actually splits the document number set to different degrees into different rows in the index.Single Word Segmentation can be said to be the lowest way to scatter, the number of rows is only equal to the number of Chinese characters, and the entire inverted index table will be very "wide""; OppositeThe coarse-grained Chinese Word Segmentation distributes the document number set to more different rows, reducing the width of the inverted index table.. As the word segmentation granularity increases, the width will gradually decrease. The most extreme case is to regard each document as a "word". At this time, the width of the inverted index table is equal to 1.
 
Based on the above discussion, we can see the following two points:
1. When the number of document sets is very large, the system throughput will be limited by the performance of disks storing inverted index files. Therefore, the Chinese word segmentation is used, shortening the width of the inverted index table will help increase the system throughput.
2. No matter whether you use a Boolean query or a location-based Query (such as phrasequery in Lucene), word splitting is not better than Chinese word segmentation.

 
In this way, using Chinese Word Segmentation in Web search engines is not hard to understand. Similarly, when the document size is small, the single-word splitting strategy does not have any major issues.
 
To the dual segmentation, in my opinion, this method tries to use the rough temperament of battlefield surgery to realize the mean of the mean, word Segmentation and Chinese Word Segmentation form an unsatisfactory compromise in some aspects (ambiguity, meaningless binary groups, etc ). In terms of implementation, it is closer to word segmentation rather than Chinese word segmentation. According to the even assumption, if a Chinese word segmentation strategy of the standard meaning is improved, it can solve the problem of reducing the recall rate of Chinese word segmentation to a certain extent, it may become a more balanced solution in all aspects.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.