Unveiling Chinese search engine technology: Chinese word segmentation (3)
Source: Internet
Author: User
Chinese Word Segmentation technology the Chinese Word Segmentation technology belongs to the field of natural language processing technology. For a sentence, people can use their own knowledge to understand what is a word and what is not a word, but how can computers understand it? The processing process is word segmentation. Algorithm . The existing word segmentation algorithms can be divided into three categories: String Matching-based word segmentation, understanding-based word segmentation, and statistical-based word segmentation. 1. String Matching-based word segmentation. This method is also called the mechanical word segmentation method, it matches the Chinese character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to certain policies. If a string is found in the dictionary, the match is successful (a word is recognized ). According to the Scanning direction, the string matching and word segmentation methods can be divided into forward matching and reverse matching. According to the priority matching of different lengths, they can be divided into maximum (longest) Matching and minimum (shortest) matching; based on whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. Several common mechanical word segmentation methods are as follows: 1) forward maximum matching (from left to right); 2) reverse maximum matching (from right to left); 3) minimum segmentation (minimum number of words cut out in each sentence ). You can also combine the above methods. For example, you can combine the forward maximum matching method and the reverse maximum matching method to form a bidirectional matching method. Due to the word-based feature of Chinese, forward least matching and reverse least matching are rarely used. Generally, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and there are fewer ambiguities. The statistical results show that the error rate of positive matching is 1/169, and the error rate of reverse matching is 1/245. However, this accuracy is far from meeting the actual needs. The actual word segmentation system uses mechanical word segmentation as a preliminary scoring method, and uses other language information to further improve the accuracy of segmentation. One method is to improve the scanning method, which is called feature scanning or mark segmentation. First, some words with obvious features are recognized and segmented in the string to be analyzed, and these words are used as breakpoints, you can divide the original string into smaller strings and perform mechanical word segmentation to reduce the matching error rate. Another method is to combine word segmentation and word class tagging, and use rich word class information to help word segmentation decisions. In addition, the word segmentation results are verified and adjusted in turn during the tagging process, this greatly improves the accuracy of splitting. A general model can be established for the mechanical word segmentation method. There are professional academic papers in this regard, which will not be discussed in detail here. 2. comprehension-Based Word Segmentation: This word segmentation method simulates the understanding of sentences by computers to recognize words. The basic idea is to perform syntactic and semantic analysis at the same time of word segmentation, and process ambiguity through syntactic information and semantic information. It generally consists of three parts: the word segmentation subsystem, the syntax and semantics subsystem, and the general control component. Under the coordination of the general control, the word segmentation sub-system can obtain syntactic and semantic information about words and sentences to judge word segmentation ambiguity, that is, it simulates the process of human understanding of sentences. This word splitting method requires a large amount of language knowledge and information. Due to the general and complex nature of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by machines. Therefore, the comprehension-based word segmentation system is still in the experimental stage. 3. Statistics-Based Word Segmentation: words are stable combinations of words. Therefore, the more adjacent words appear at the same time in the context, the more likely a word is to be formed. Therefore, the frequency or probability of adjacent co-occurrence between words can better reflect the word credibility. The frequency of the combination of adjacent co-occurrence words in the corpus can be calculated to calculate their co-occurrence information. Defines the mutual occurrence information of two words and calculates the adjacent co-occurrence probability of two Chinese characters X and Y. The interaction information reflects the closeness between Chinese characters. When the closeness is higher than a threshold, the word group may constitute a word. This method only requires statistics on the word group frequency in the corpus, and does not need to be divided into dictionaries. Therefore, it is also called the dictionary-less word segmentation method or the statistical word acquisition method. However, this method also has some limitations. It will often extract frequently used word groups with high co-occurrence frequency but not words, such as "this", "one", "some", "my", and "many". In addition, the recognition accuracy of common words is poor and the time-space overhead is large. In practice, the statistical word segmentation system must use a basic word segmentation Dictionary (commonly used word dictionary) for string matching and word segmentation, and use statistical methods to identify some new words, the combination of string frequency statistics and string matching not only makes full use of the features of fast and efficient matching and word segmentation, but also uses dictionary-free Word Segmentation in combination with context to identify new words and automatically eliminate ambiguity. Which of the following word segmentation algorithms is more accurate and is currently inconclusive. For any mature word segmentation system, it is impossible to rely on a specific algorithm alone. Different algorithms need to be integrated. The author understands that the word segmentation algorithm of massive technologies uses the "Compound Word Segmentation Method". The so-called compound method is equivalent to the concept of compound medicine in Chinese medicine, that is, using different medicines to combine to treat diseases. Similarly, multiple algorithms are required to identify Chinese words.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.