Chinese Word Segmentation technology for search engines

Source: Internet
Author: User

Automatic Chinese word segmentation is the basis of Web analysis. In the process of webpage analysis, the processing methods of Chinese and English are different, because there is a significant difference between Chinese and English information: there is a space between English words, there is no separator between Words in Chinese text. This requires that, before analyzing a Chinese Web page, cut the sentences in the web page into word sequences. This is the Chinese word segmentation. Chinese Automatic Word Segmentation involves many natural language processing technologies and evaluation standards. In search engines, we mainly focus on the speed and accuracy of Chinese automatic word segmentation. Word Segmentation accuracy is very important for search engines. However, if the word segmentation speed is too slow, even if the accuracy is higher, it is not available for search engines because search engines need to process hundreds of millions of webpages, if the time consumed by word splitting is too long, the update speed of the search engine is seriously affected. Therefore, search engines have high requirements on Word Segmentation accuracy and speed.

Currently, Chinese Automatic Word Segmentation is a mechanical word segmentation method based on Word Segmentation dictionaries. This method matches the string of Chinese characters to be analyzed with the entries in the dictionary according to certain rules. Based on different matching policies, the mechanical word segmentation method has the following algorithms: forward maximum matching algorithm, reverse maximum matching algorithm, and least word segmentation algorithm. The advantage of this method is that the word segmentation speed is fast and the accuracy is guaranteed, but the processing effect on Unlogged words is poor. The experimental results show that the maximum positive matching error rate is about 1/169, and the maximum reverse matching error rate is about 1/245. Another common Method for Automatic Chinese Word Segmentation is statistical-Based Word Segmentation. This method is used to calculate the word group frequency in the language, and does not require word segmentation, therefore, it is also called the dictionary-less word segmentation method. However, this method often regards word groups that are not commonly used words as words. the recognition accuracy of common words is poor, and the time-space overhead is also relatively large. In the practical application of the search engine field, the mechanical word segmentation method is generally combined with the statistical word segmentation method, first perform string matching word segmentation, and then use the statistical method to identify some new words that are not logged on, this not only gives full play to the advantages of fast and efficient matching word segmentation, but also utilizes the features of automatic recognition of new words in statistical word segmentation and automatic elimination of Word Segmentation ambiguity.

Word Segmentation dictionary is an important factor that affects the Automatic Word Segmentation of Chinese characters. Its size is generally about 60 thousand words, and the dictionary is too large or too small. The dictionary is too small and some words cannot be split, the dictionary is too large, and the uprising will greatly increase during the segmentation process, which also affects the accuracy of word segmentation. Therefore, word segmentation is strictly selected. In the network field where new words are constantly emerging, it is not enough to use only about 60 thousand word segmentation dictionaries. However, adding new words to the word segmentation dictionary at Will will lead to a decrease in Word Segmentation accuracy, the general solution is to use an auxiliary dictionary with a size of about 0.5 million words. In addition, the difficulty of Automatic Chinese Word Segmentation lies in the processing of Word Segmentation ambiguity and the recognition of unregistered words. How to deal with these two problems has been a hot topic in this field.

1. Ambiguity handling

Ambiguity means there may be two or more splitting methods. For example, the phrase "surface" can be divided into "surface +" and "table +" because "surface" and "surface" are both words ". This is called cross-ambiguity. Cross ambiguity like this is very common. "makeup and clothing" can be divided into "makeup + and + clothing" or "makeup + kimono + clothing ". Since there is no one to understand, it is difficult for computers to know which solution is correct.

Cross-ambiguity is relatively easier to deal with than the combination ambiguity. The combination ambiguity must be determined based on the entire sentence.

For example, in the sentence "this door handle is broken", the "handle" is a word, but in the sentence "please pull the handle", the "handle" is not a word; in the sentence "General appointed a Lieutenant", "Lieutenant" is a word, but in the sentence "production will increase twice in three years, "Lieutenant" is no longer a word. How Can computers identify these words?

Even if the computer can solve cross-ambiguity and combination ambiguity, there is still a problem in ambiguity, which is true ambiguity. True ambiguity means giving a sentence. People cannot determine which word should be and which should not be a word. For example, if "the Table Tennis auction is over", you can split it into "Table Tennis + rackets + sold + finished +" or "Table Tennis + auction + finished + ", if there are no other context sentences, I am afraid no one knows that "Auction" is not a word here.

Generally, a dynamic programming algorithm is used to resolve ambiguity into an optimization problem. In the process of solving the problem, we usually use auxiliary information such as Word Frequency or probability to obtain the most possible word segmentation result. This result is optimal in some sense.

2. Non-Logon Word Processing

Unlogged words are words that are not in the word segmentation dictionary, also known as new words. The most typical examples are personal names, place names, and terminology. For example, in the sentence "Wang junhu has gone to Guangzhou", "Wang junhu" is a word, because it is a person's name, however, it would be difficult for the computer to identify it. If you include "Wang junhu" as a word in the dictionary, there are so many names all over the world, and there are new names every moment. recording these names is a huge project. Even if this work can be completed, there will still be problems. For example, in the sentence "Wang Jun, Hu HU, and his brain", can "Wang junhu" still be regarded as a word?

In addition to the name of the person, the organization name, place name, product name, trademark name, abbreviation, and omission are all difficult issues to address, these words are frequently used. Therefore, word segmentation is very important for search engines. Currently, non-Logon words are generally processed using statistical methods. First, the corpus is used to calculate the frequently-occurring word groups and add them as new words to the auxiliary dictionary according to certain rules.

At present, the Chinese Automatic Word Segmentation technology has been widely used in search engines, and the word segmentation accuracy has reached more than 96%. However, when analyzing and processing large-scale webpages, there are still many shortcomings in the existing Chinese Automatic Word Segmentation technology, such as the ambiguity problem mentioned above and the problem of handling non-Logon words. Therefore, domestic and foreign scientific research institutions, such as Peking University, Tsinghua University, Chinese Emy of Sciences, Beijing Language Institute, Northeastern University, IBM Research Institute, and Microsoft Chinese Research Institute, have been paying attention to and Researching Chinese Automatic Word Segmentation technology, this is mainly because there are more and more Chinese information on the Internet, and the processing of Chinese information on the network will surely become a huge industry and broad market, with unlimited business opportunities. However, in order to better serve the processing of Chinese network information and form products, the Chinese Automatic Word Segmentation technology also needs to do a lot of work in basic research and system integration.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.