Difficulties in Word Segmentation
Mature Word SegmentationAlgorithmIs it easy to solve the problem of Chinese word segmentation? This is far from the case. Chinese is a very complex language, making it even more difficult for computers to understand Chinese languages. In the process of Chinese word segmentation, two major problems have not been completely broken through.
1. Ambiguity Identification
Ambiguity refers to the same sentence. There may be two or more segmentation methods. For example, because "surface" and "surface" are both words, the phrase can be divided into "surface" and "table ". This is called cross-ambiguity. Cross-ambiguity is very common. The preceding example of "kimono" is actually an error caused by cross-ambiguity. "Makeup and clothing" can be divided into "makeup and clothing" or "makeup and clothing ". Since there is no one to understand, it is difficult for computers to know which solution is correct.
Cross-ambiguity is relatively easier to deal with than the combination ambiguity. The combination ambiguity must be determined based on the entire sentence. For example, in the sentence "this door handle is broken", the "handle" is a word, but in the sentence "please pull the handle", the "handle" is not a word; in the sentence "General appointed a Lieutenant", "Lieutenant" is a word, but in the sentence "production will increase twice in three years, "Lieutenant" is no longer a word. How Can computers identify these words?
If both cross-ambiguity and composite ambiguity can be solved, there is still a problem in ambiguity, which is true ambiguity. True ambiguity means giving a sentence. People cannot determine which word should be and which should not be a word. For example, if "the Table Tennis auction is over", it can be divided into "the table tennis racket is sold out" or "the Table Tennis auction is over". If there are no other context sentences, i'm afraid no one knows that "Auction" is not a word here.
2. New Word Recognition
The new term is called the Unlogged-on term. That is, the words that have not been included in the dictionary but can indeed be called words. The most typical is the name of a person. In the sentence "Wang junhu has gone to Guangzhou", "Wang junhu" is a word, because it is a person's name, however, it would be difficult for the computer to identify it. If "Wang junhu" is used as a word to be indexed into the dictionary, there are so many names all over the world, and there are new names every moment. recording these names is a huge project. Even if this work can be completed, there will still be problems. For example, in the sentence "Wang Jun and Hu HU", can "Wang junhu" be regarded as a word?
In addition to the name of a person, the organization name, place name, product name, trademark name, abbreviation, and omitting of a new word are difficult to solve. These words are frequently used by people, therefore, word segmentation is very important for search engines. At present, the accuracy of New Word Recognition has become one of the important indicators for evaluating the quality of a word segmentation system.
Application of Chinese Word Segmentation
At present, in natural language processing technology, Chinese processing technology lags behind Western processing technology for a long time, and many Western processing methods cannot be used directly in Chinese, it is because the process of word splitting is required for Chinese characters. Chinese word segmentation is the basis for processing other Chinese information. search engines are only an application of Chinese word segmentation. Word Segmentation is required for other words, such as machine translation (MT), speech synthesis, automatic classification, automatic summarization, and automatic proofreading. Chinese word segmentation may affect some research, but it also brings opportunities for some enterprises, because the foreign computer processing technology to enter the Chinese market, the first thing is to solve the problem of Chinese word segmentation. In terms of Chinese research, Chinese people have obvious advantages over foreigners.
Word Segmentation accuracy is very important for search engines. However, if the word segmentation speed is too slow, even if the accuracy is higher, it is not available for search engines because search engines need to process hundreds of millions of webpages, if the time consumed by word splitting is too long, the search engine content update speed will be seriously affected. Therefore, for search engines, Word Segmentation accuracy and speed both must meet high requirements. At present, most of the Chinese word segmentation research institutions are research institutions. Tsinghua, Peking University, Chinese Emy of Sciences, Beijing Language Institute, Northeastern University, IBM Research Institute, Microsoft Chinese Research Institute, and so on all have their own research teams, however, commercial companies that really specialize in Chinese word segmentation have almost no more than massive technologies. Most of the technologies studied by scientific research institutions cannot be productized quickly, but the power of a professional company is limited. It seems that there is still a long way to go before the Chinese Word Segmentation technology can better serve more products.
More references
Note: because the following references are not published in some journals in the form of papers, there is no apparent source. You can search for them on Google or Baidu search engines.ArticleDownload the relevant article from the title.
[1] Chinese search engine technology: web spider.
[2] Chinese search engine technology unveiling: Sorting Technology.
[3] unveiling the Chinese search engine technology: system architecture.
[4] robots & spiders & crawlers: How web and intranet search engines follow links to build indexes. Author: Avi rapports.2001.
[5] guidelines for Robot writers. Author: Martijn Koster, 1993.
Transferred from: e800.com.cn