Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
What is Chinese participle
What is participle? What's the difference between Chinese participle and other participle? Participle is the process of combining consecutive word sequences into word sequences according to certain specifications. In the above example we can see that in the English language, the words are separated by a space as a natural demarcation, while the Chinese only words, sentences and paragraphs can be simply delimited by the obvious demarcation, but the word does not have a formal demarcation, although the English language also exists the problem of the division of phrases, but in the word this layer, As we can see from the above example, Chinese is much more complicated and difficult than English.
The current mainstream Chinese word segmentation algorithm has the following 3 kinds:
1. Segmentation method based on string matching
This method is also called the machine segmentation method, it is according to a certain strategy of the Chinese character string to be analyzed with a "full large" machine Dictionary of the terms of the match, if found in the dictionary a string, then matching success (identify a word). According to the different scanning direction, the string matching segmentation method can be divided into forward matching and reverse matching. According to the case of different length preference, it can be divided into the maximum (longest) matching and the minimum (shortest) matching, according to whether or not the process of POS tagging, but also can be divided into simple word segmentation method and the combination of Word segmentation and annotation integration method. Several commonly used mechanical participle methods are as follows:
1 forward maximum matching method (from left to right direction);
2 Reverse Maximum matching method (from right to left direction);
3 Minimum segmentation (the smallest number of words in each sentence).
These methods can also be combined with each other, for example, the forward maximum matching method and the reverse maximum matching method can be combined to form a bidirectional matching method. Due to the characters of Chinese words, the forward minimum matching and inverse minimum matching are seldom used. Generally speaking, the segmentation precision of reverse matching is slightly higher than that of forward matching, and the ambiguity phenomenon is less. The statistic results show that the error rate of single positive maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. But this precision is far from satisfying the actual need. The actual use of the word segmentation system, is the mechanical participle as a primary means, but also through the use of various other language information to further improve the accuracy of segmentation.
One method is to improve the scanning mode, called feature scanning or symbol segmentation, priority in the string to be analyzed to identify and cut out some of the obvious features of the words, as a breakpoint, the original string can be divided into smaller strings and then into the mechanical participle, thereby reducing the matching error rate. Another method is to combine the word segmentation and lexical tagging, use rich parts of speech to help the decision making, and in the process of tagging in turn to the results of the word segmentation test, adjust, so as to greatly improve the accuracy of segmentation.
2, based on understanding of the word segmentation method
The method of Word segmentation is to make the computer simulate the people's understanding of the sentence, to achieve the effect of recognizing words. The basic idea is to make syntactic and semantic analysis at the same time, and use syntactic and semantic information to deal with ambiguity. It usually consists of three parts: the segmentation subsystem, the syntactic system, the general control part. Under the coordination of the general control part, the segmentation subsystem can get the syntactic and semantic information about words and sentences to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This kind of word segmentation method needs to use a lot of language knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into the form of machine direct reading, so the word segmentation system based on understanding is still in the experimental stage.
3. Segmentation method based on statistics
In terms of form, words are a combination of stable words, so the more times the adjacent words appear in the context, the more likely they are to form a word. Therefore, the frequency or probability of adjacent words and characters can better reflect the credibility of the word. The frequency of the combination of the adjacent words in the corpus can be counted, and their mutual information is calculated. Define the two-word mutual present information and compute the adjacent probability of two Chinese characters X and Y. The mutual information embodies the close degree of the bond between Chinese characters. When the tightness is higher than a certain threshold, it can be assumed that the word group may constitute a word. This method can only be used to statistics the frequency of the words in the corpus, do not need to cut the dictionary, so it is also called No dictionary segmentation method or statistical method. But this method also has certain limitation, will often take out a number of common frequently high, but not the words of the commonly used groups, such as "This One", "one", "some", "my", "many" and so on, and the common word recognition accuracy is poor, time and space overhead. The actual application of the statistical word segmentation system is to use a basic word dictionary (commonly used word dictionary) for string matching participle, at the same time using statistical methods to identify some new words, the serial frequency statistics and string matching, not only to play the matching segmentation speed, high efficiency, but also the use of dictionary segmentation and context to identify words, The advantages of automatically eliminating ambiguity.
Several points to note:
1. The time performance of Word segmentation algorithm is higher. In particular, today's web search, real-time requirements are high. Therefore, as the basis of Chinese processing of Chinese word segmentation must first occupy as little time as possible.
2. The improvement of the correct rate of word segmentation does not necessarily bring about the improvement of retrieval performance. The influence of Chinese information retrieval is no longer obvious, although there are still some effects, but this is not the performance bottleneck of Cir. So one-sided blindly pursuit of high accuracy of the word segmentation algorithm is not very suitable for large-scale Chinese information retrieval. We need to find a proper balance between the time and the precision when there is no balance between the two.
3. The granularity of segmentation can still be in accordance with the long term precedence criterion, but it needs to be followed up in the query extension level. In the information retrieval, the segmentation algorithm only needs to concentrate on how to eliminate the cross ambiguity. For overlay ambiguity, we can use the dictionary two times index and query expansion to solve.
4. The accuracy of the unidentified word recognition is more important than the recall rate. As far as possible to ensure that the unidentified word recognition without the wrong combination, so as to avoid the error of the unregistered words. If you combine a word error with a login, it is possible that the corresponding document will not be retrieved correctly.
Baidu participle
The query is first separated according to the split symbol. "Information Retrieval theory Tool" after word segmentation < information retrieval, theory, Tools >.
Then see if there is a duplicate string, and if so, discard the excess, leaving only one. "Theory tool theory" after participle < tool theory >,google not consider this and calculate.
Then judge whether there is English or the number, if any, the English or the number as a whole to retain and cut before and after the Chinese. Query "movie bt download" After participle < movie, bt, download >.
If the string contains less than or equal to 3 Chinese characters, it will remain fixed, when the string length is greater than 4 Chinese characters, Baidu's word-breaker procedure to gets, the string dismembered.
Word segmentation algorithm type forward maximum matching, reverse maximum matching, bidirectional maximum matching, language model method, the shortest path algorithm to judge a word system good, the key to see two points, one is to eliminate ambiguity ability; One is the identification of the dictionary's unregistered words, such as names of people, place names, organization names, etc.
Baidu participle adopted at least two dictionaries, one is a common dictionary, one is a special dictionary (names, place names, new words, etc.). It is a special dictionary to be divided first, and then the remainder of the pieces to the common dictionary to slice.
Baidu uses the word segmentation algorithm type is bidirectional maximum matching algorithm.
Example: Query "Mao Zedong Beijing China Smoke", Baidu's Word segmentation results: "Mao Zedong/North/Jinghua Smoke"
Baidu participle can identify names, but also can identify the "Jinghua smoke", which indicates that the dictionary does not sign the word recognition function
First inquires the special dictionary (person name, some place name and so on), will the exclusive name cut out, the remaining part takes the bidirectional participle strategy, if both (forward maximum match, reverse biggest match) The segmentation result is same, explained that has no ambiguity, the direct output participle result.
If inconsistent, then the output of the shortest path of the result, that is, the smaller fragments of the better, such as < Cuba, than, ethics > and < Ancient Babylon, rationale > Compared to the choice of the latter,< Beijing, China, Smoke > and < North, the Jinghua smoke > compared to choose the latter.
If the length is the same, select the group of segmentation results with fewer single words. "Distant Ancient Babylon", this query was divided into < remote, Cyangugu, Babylon, rather than cut into "remote/Ancient/ancient Babylon"
If the word is the same, select the positive participle result. Query "Wang Qiang:", Baidu will be cut into the "King/strong/small", rather than the reverse cut into "king/strong/Size"
Baidu has been promoting its own advantages in Chinese processing, from the above, the word segmentation algorithm is not special, disambiguation effect is not ideal, even if Baidu take more complex than the above algorithm is also difficult to say is the advantage, if Baidu has the advantage, the only advantage is that very large special dictionary, This special dictionary is logged into the name (for example, long today), the title (such as Old Lady), some places (such as the United Arab Emirates, etc.), estimated Baidu using the academic community announced a relatively new named entity recognition algorithm from the corpus to continuously identify the word, gradually expand this specialized dictionary. --This article to China SEO Forum original post address: http://www.web520.com/bbs/thread-2742-1-1.html
Author information: Lao Chen, founder of China SEO Forum (WWW.WEB520.COM/BBS)