Chinese Word segmentation technology

Source: Internet
Author: User

Chinese word segmentation technologyhttp://blog.csdn.net/u012637501First, Chinese word segmentation technology1. Chinese participle: Last Post we talked about using statistical language models for natural language processing, and theseThe language model is mainly based on the word, because the word is the smallest unit of expression semantics. For western Pinyin, there is a clear demarcation between the words, statistics and use of the language model is very straightforward, such as I love China very much.However, for Chinese, because there is no clear demarcation between the words, therefore, the first need to sentence segmentation, in order to do further natural language processing. 2. Word-breaker consistency issues   Corpus about the consistency of word segmentation has two aspects: consistency 1: in the premise of maintaining semantic identity, a structure in the corpus of the convergence is always consistent (for example: "Pork" is always maintained a whole, or always separate); Consistency 2: All other structures with the same structure type as a struct have the same consistency in the corpus as the structure (for example: "Beef" and "pork" are identical in structure, "beef" follows the "pork"). 3. Granularity and level of words Second, Chinese word segmentation method 1. Dictionary Search MethodProfessor Liangnanyum of Beihang University has put forward the simplest method of word segmentation-look up the dictionary, the word segmentation process can be summarized as follows: We will be a sentence from the left to the right to scan once, encountered in the dictionary word is marked out, encountered compound words (such as "Hunan University") to find the longest word match, When you encounter a string that you do not know, you divide it into a single word, so that the participle of a sentence is complete. As an example: China/Aerospace/Official/Invited/to/USA when we go from left to right, we first encounter the word "medium", which is itself a single word, we can make a cut here, but when we encounter the word "Guo", we find that it can form a longer term with the word "medium" in front, so we put the split point behind "China". Next, by looking up the words in the dictionary, we will find that "China" does not make a longer term with the words that follow, and the split point is finalized.    This method is also called the mechanical Word segmentation method, it is according to a certain strategy will be analyzed the Chinese character string and a "fully large machine dictionary" of the entry to match, if a string found in the dictionary, the match succeeds. According to the scanning direction, the word segmentation method can be divided into forward matching and inverse matching, and it can be divided into maximum matching and minimum matching according to the different length. Several common dictionary-based word segmentation methods are as follows. (1) Forward maximum matching algorithmthe idea of forward maximum matching algorithm: From left to right, the m characters of the Chinese sentence to be segmented are matched fields, where m is the number of Chinese characters of the longest entry in the machine readable dictionary. Find machine readable dictionaries and match them. If the match succeeds, the match field is sliced out as a word, and if the match is unsuccessful, the last word of the matching field is removed and the remaining string is matched again as a new matching field. Repeat the process until all words are cut out. (2) inverse maximum matching algorithmthe idea of inverse maximal matching algorithm: The algorithm is the mostThe inverse thinking of the large matching algorithm is mainly to match the string from right to left. If the match is successful, the match field is sliced out as a word, and if the match is unsuccessful, the first word of the matching field is removed and the remaining string is matched again as a new matching field. Repeat the process until all words are cut out. The experimental results show that the inverse maximum matching algorithm is better than the forward maximum matching algorithm. (3) Full binary maximum matching algorithmfull binary maximum matching fast segmentation algorithm: is based on ahash table, each matching operation can be remembered, do not need anyrepeat the matching operation, and the matching operation is done using the dichotomy method, so as to maximize the efficiency of the word segmentation. 2. Statistical language model participle  because the dictionary method of Chinese word segmentation can solve the seven or eight participle, but for a slightly more complex problems can not do. To this end, around 1990, Dr. Guojin of Tsinghua University successfully solved the problem of word segmentation ambiguity with statistical language model , and reduced the error rate of Chinese word segmentation by an order of magnitude. The mathematical description of the segmentation method using statistical language model is as follows:Suppose a sentence s can have several word segmentation methods, for simplicity's sake, assume the following three kinds:a1,a2,a3,...., Akb1,b2,b3,...., Bmc1,c2,c3,...., Cnamong them, a1,a2 ... B1,b2 ... C1,c2 .... And so on are all Chinese words, the above various participle results may produce different number of word strings, that is to say K,m,n is the number of words in different participle . Assumptions A1,A2,A3,...., AK The probability that the sentence appears the most when the word is finished, then A1,A2,A3,...., AK is the best word segmentation method, then its probability satisfies: P (a1,a2,a3,...., Ak) >p (b1,b2,b3,...., Bm) and P (a1,a2,a3,...., Ak) >p (c1,c2,c3,...., Cn). Therefore, we only need to use statistical models to calculate the concept of the sentence after each participle, and find out the probability of the most, we can find the best word segmentation method, getOptimalThe output string. Note: If all possible word segmentation methods are exhausted and the probability of each sentence is calculated, then the computational amount will be quite large. Therefore, we can see it as a dynamic programming problem, and use Viterbi (VITERBI) algorithm to find the best participle quickly.    at present, there are many kinds of segmentation algorithms based on statistics, the more common algorithm is the probability statistic algorithm based on mutual information, the N2gram algorithm, the Chinese word segmentation decision algorithm based on the combination degree and so on. (1) Probabilistic statistical algorithm for mutual informationMutual information is a statistic that measures the correlation between different strings. For the string x and Y, the mutual information is calculated as follows:    where P (x, y) is the probability of a string x and y co-occurrence, P (x.), and P (y) are the probabilities of the occurrence of string x and Y, respectively. The Mutual Information mi (x, y) reflects the tightness of the binding relationship between the pairs of strings: (1) Mutual Information mi (x, y) >0, then there is a credible bond between x, Y and mi (x, y), the greater the degree of bonding. (2) MI (x, y) = 0, the binding relationship between x, Y is ambiguous. (3) Mi (x, y) <0, there is essentially no binding between x, Y, and the smaller mi (x, y), the weaker the degree of union. (2) N-gram model algorithmN-gram Model thought: The appearance of a word and its upper and lowertext in the context of the word sequence is closely related to the appearance of the nth word is only related to the previous n-1, and no other word is irrelevant, set W1,W2,..., WN is a string of length n, due to the probability of predicting the occurrence of the word wn, we must know the probability of the occurrence of all the words in front of it, too complex. In order to simplify the calculation, the term "wi" is only related to the first two, and the ternary probability model is given as follows:In other words, the N-ary model assumes that the present probability of the occurrence of the current word is only related to the N21 word in front of it. (3) decision algorithm of combinatorial degreethe algorithm idea of the combination degree: In an article, if the kanjib immediately after the Chinese character A, called AB is a combination. Application Groupthe mathematical formula of Fit, calculates the combination degree of each phrase, the higher the combination degree, the greater the likelihood that it is a phrase, the lower the combination degree, the smaller the likelihood that it is a phrase. The formula is as follows: wherein, the HAB is AB in the article The combination degree, n is the Chinese character number,K is the number of AB combinations, N1 is the number of a, N2 is Bnumber. 3. Rule-based Word segmentation algorithmThe rule-based Word segmentation method is to achieve the effect of recognition words by simulating people's comprehension of sentences. The basic idea is to make syntactic and semantic analysis at the same time, using syntactic information and semantic information to segment the text.
three or three methods of performance analysis 1. The advantage of dictionary-based word segmentation algorithm is easy to realize, and it is well applied in the system with low accuracy requirement. The disadvantage is that because dictionaries are prepared before participles, their size and content are limited, and the addition of non-signed words is more difficult to achieve. 2.a method of Word segmentation based on statisticsThe advantage is that it can be summed up from a large number of existing examples, analysis of the relevant information within the language, and add it to the statistical model. A simple statistical method does not require a dictionary, but a statistical model based on the iteration of the training corpus. But the statistical method itself has certain limitations, especially the recognition accuracy of common words is very poor. 3. rule-based word segmentation algorithmThe advantage is that it can be automatically inferred and proven in an instance, and can automatically complement the non-signed words, but because it requires a lot of language knowledge. While the knowledge of Chinese language has its generality and complexity, it is difficult to organize all kinds of language information into the form that machine can read directly, so the current word segmentation method based on rules is not very mature. This method is always used in conjunction with other algorithms.
Iv. current difficulties in Chinese participlebecause Chinese words and words do not have the obvious delimiter as the western language, so the Chinese in the automatic segmentation of the great difficulty. In the existing Chinese automatic word segmentation method, the Word segmentation method based on the dictionary occupies the dominant position. The main difficulty of Chinese word segmentation is not the match of the entry in the dictionary, but the definition of the cut-off and the non-signed words. In the Chinese word segmentation process, these two major problems have not been completely broken. 1. Ambiguity processingambiguity refers to the same sentence, there may be two or more methods of segmentation. At present, there are three kinds of intersection type ambiguity, combinatorial ambiguity and true ambiguity. Among them, the number of ambiguous fields is large, the processing method is diverse, and the number of ambiguous fields is relatively difficult to deal with, while the number of true ambiguity fields is more scarce and difficult to deal with. The reason why word segmentation is one of the difficulties of Chinese word segmentation is that the ambiguity is divided into many types, and different solutions should be adopted for different types of ambiguity. In addition to the need to rely on the semantic information, semantic, pragmatic knowledge and other external conditions, there are difficult to resolve the true ambiguity, increase the difficulty of ambiguity segmentation. At the same time, there is also the problem of ambiguity segmentation, which also increases the difficulty of ambiguity segmentation. So ambiguity processing is the impact of word segmentation system cuttingThe important factor of precision is the most difficult and core problem in the design of automatic word segmentation system. 2. No sign-in Word recognition    new words, jargon called non-signed words. That is, those in the dictionary are not included in the word. No sign-in words can be classified as proper and non-proper names in two categories. The proper names include Chinese name, foreign translation, place name and so on, but not proper names including new words, abbreviations, dialect words, classical words, industry use words and so on. Neither the proper names nor the non-proper names are difficult to deal with, because of their large number, there is no corresponding norms, and with the change of social life, so that the number of non-registered words greatly increased, which adds difficulty for the recognition of the non-login words. Therefore, the non-login word recognition is another difficulty in Chinese word segmentation.
Reference: Research status and difficulties of Chinese word segmentation technology-Sun Tieli, Liu Yanji

Chinese Word segmentation technology

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.