Automatic word segmentation and Chinese search engine

Source: Internet
Author: User
Keywords Search engine Precision

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

The author has been engaged in the study of Chinese automatic segmentation, one of the simple idea is that the study on the WWW Chinese search engine will be helpful, but often for the open environment automatic segmentation difficult to achieve satisfactory precision and distress. Recently suddenly seems to have realized, here will be a bit of experience written out to a point. A "fun" experience with Chinese search engines
Let's talk about the author's "Interesting" experience. A day, I accidentally want to look for the Japanese "kimono" in the WWW information. Open the Yahoo http://cn.yahoo.com/search engine, naturally choose "Kimono" as query.
The search results were completely unexpected: 255 "related sites" were found, but there were few "kimono" related, such as: "China HR hotline GB-Provide recruitment and job search information and services." In 255 Web sites to view each one can not bear, re (that is, independent of the previous search results, the same below) type "Kimono" and "Japan", hoping to narrow the search scope. This time only a "kimono" related to the site: "Ningbo jiangdong star Silk Belt factory GB-engaged in Japanese kimono belt embroidery and manufacturing."




The author does not believe that the great Yahoo of the big, but only to save the fruits, so try "kimono" and "clothing." The total return of 45 sites, but the relevant still only "Ningbo Jiangdong star Silk Belt factory", the search accuracy of 1/45. I am really puzzled: Is it true to keep Bao Shan empty-handed and return? In my mind, I jump out of a wonderful good word: "Japanese style", quickly type "kimono" and "Japanese style", finally dug up a lot of "treasure" to: returned 1140 pages (I do not know why, check is "related sites", the operation is the same as before, but the feedback is still alive and dead are "related pages"), including the "Kimono" Related content, such as: "Kimono Culture", the following is kimono, Japanese clothing products market and other fiber products market comparison chart ... "Finally" finished, at that time a relaxed heart. After that, I think it is not so simple: if you can't think of the word "Japanese", how many other words to try? How many related pages do I have to know? The uncertainty is too strong to be easy to ponder. Retrieval seems to be an "art" rather than a "technology".




A preliminary test of Chinese search engine performance
This experience prompted me to do a preliminary survey of the performance of Chinese search engines. At that time I was lecturing at the University of Hong Kong and asked 50 students from the University of Hong Kong each to type a word of interest to Yahoo. Kong (http://hk.yahoo.com/), and then examine the search accuracy. The search precision is defined as the number of Web sites (pages) retrieved that are really relevant to the query. If the retrieved Web site (page) is greater than 50, only the first 50 are examined.




The 50 search terms and corresponding retrieval accuracy (%) are shown in table 1.
The search results show that Yahoo. Kong did not do word segmentation, the average retrieval accuracy is only 48.8%, half of which is rubbish. Table 2 lists the partial retrieval instances. From the point of view of retrieval errors, the situation is quite complex, involving various aspects of Chinese automatic segmentation, including cross ambiguity (such as "Research on ecology Theory and Application"). Underline indicates retrieval words, same below), combinatorial ambiguity ("promote people-oriented education"), Chinese name (such as "Shandong Ann Lily Law Firm"), foreigner name (such as "Helen and John", "introduce Sakai"), Chinese place names (such as "Miyang County Double Temple Street Township"), foreign geographical names (such as "Egypt and Jordan"), Organization name (such as "Palm Weather Therapy Center"), abbreviations (such as "medium and large ERP software") and so on.




In order to approximate the impact of the word segmentation system on Chinese search engine, the author uses the Chinese word segmentation system developed by Tsinghua University to Cseg&tag 122 Typical examples of these 50 words, which are given by Yahoo. Kong, including "Retrieving the wrong example" 78 sentences and " Retrieve the correct example of "44 sentences, some of which are shown in table 2" for automatic participle, the result of the word segmentation as Table 3 shows.




Overall, the correct rate for the 122 sentences is 76.2%. Assuming that this can partly reflect the word segmentation results for all sentences retrieved from 50 words, the retrieval precision can rise from 48.8% to 76.2%. Visible, although the current word segmentation system performance from the ideal state still have a considerable distance, the role of the search engine is also called "there is a benefit also has a disadvantage", but the pros and cons outweigh the advantages. In other words, participle technology is available in search engines.




Further analysis of the Cseg&tag system participle error 29 sentences, can also be divided into two categories: the first category (a total of 11 sentences), is basically due to the lack of access to the word did not do the right processing, was cut, but fortunately, the boundary of the word is not associated with other words in the tangle (such as "Joint Machinery Co., Ltd." The second category (a total of 18 sentences) is or the boundary of the word is mistaken (e.g. "palm
Day Qigong Therapy Center "), or should not be combined with the composition of a" word "together (such as" with the Institute and the tenth session of the Conference of the Asian Medical Association "). The first kind of impact on the search engine, in effect and do not do word processing exactly the same,
So, if you add these 11 sentences, the search accuracy for 50 words is expected to increase from 76.2% to 85.2%. The second type of search engine is fatal, is the most we do not want to be the most afraid of the situation encountered. To analyze it carefully,
Some of these cases can be solved by simple rules (such as "and", if followed by the numeral, generally should be separated), but most of the situation is not easy to deal with, even in the WWW environment, we can not even encounter how many similar situations are impossible to predict, let alone solve effectively. Experience tells us that no matter how hard it is, the word breaker will never be perfect in an open environment-which means that when we construct a Chinese search engine, we must first accept the basic assumption that a robust Chinese word segmentation system will inevitably cause unexpected errors when dealing with real text. and to achieve 90% of the word segmentation accuracy is thankfully, there are errors is inevitable, normal. To study the mechanism of Chinese search engine or algorithm, try to improve the recall rate of retrieval or accurate rate (precision), it must be done on this basic assumption, otherwise it is tantamount to impossible.




Future research and development direction
In view of the above discussion, the author thinks: The Chinese word segmentation system for search engine must be based on a word mixing model, and the corresponding text retrieval mechanism must be mixed with words. The research on this model and mechanism is bound to become the frontier and hot topic in the Chinese automatic word segmentation system and Chinese search engine system development in the next few years.




I get another inspiration is: Chinese search engine on the response characteristics of different words there are very big differences, for example, even if not participle, the "cheongsam" of the search accuracy can still reach 100%, the "Aborigines" the search accuracy is 0. 11545.html "> It is necessary for us to do a exhaustive survey of all the common Chinese words: What is the response characteristic of the word relative to the Chinese search engine?" Is there some kind of simple solution (such as "Aborigines" almost all appear in the "local customs")? Or simply limited by the level of research, there is no way to find a solution at this time? Wait a minute. The survey will be a valuable foundational work for the design of a new generation of Chinese search engines based on participle technology.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.