Brief introduction of Chinese word segmentation technology

Last Update:2014-12-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Chinese word is also called Chinese cutting words, Chinese strokes, etc., is an SEO must master a basic knowledge. The existing Chinese word segmentation algorithm can be divided into three categories: based on string matching word segmentation method, based on the understanding of the word segmentation method and statistics based segmentation method.

1. Segmentation method based on string matching

This method is also called the machine segmentation method, it is according to a certain strategy of the Chinese character string to be analyzed with a fully large machine dictionary of the entry, if found in the dictionary a string, then match success (identify a word). According to the different scanning direction, the string matching segmentation method can be divided into forward matching and reverse matching. According to the case of different length preference, it can be divided into the maximum (longest) matching and the minimum (shortest) matching, according to whether or not the process of POS tagging, but also can be divided into simple word segmentation method and the combination of Word segmentation and annotation integration method. Several commonly used mechanical participle methods are as follows:

1 forward maximum matching method (from left to right direction);

2 Reverse Maximum matching method (from right to left direction);

3 Minimum segmentation (the smallest number of words in each sentence).

These methods can also be combined with each other, for example, the forward maximum matching method and the reverse maximum matching method can be combined to form a bidirectional matching method. Due to the characters of Chinese words, the forward minimum matching and inverse minimum matching are seldom used. Generally speaking, the segmentation precision of reverse matching is slightly higher than that of forward matching, and the ambiguity phenomenon is less. The statistic results show that the error rate of single positive maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. But this precision is far from satisfying the actual need. The actual use of the word segmentation system, is the mechanical participle as a primary means, but also through the use of various other language information to further improve the accuracy of segmentation.

One method is to improve the scanning mode, called feature scanning or symbol segmentation, priority in the string to be analyzed to identify and cut out some of the obvious features of the words, as a breakpoint, the original string can be divided into smaller strings and then into the mechanical participle, thereby reducing the matching error rate. Another method is to combine the word segmentation and lexical tagging, use rich parts of speech to help the decision making, and in the process of tagging in turn to the results of the word segmentation test, adjust, so as to greatly improve the accuracy of segmentation.

2. Segmentation method based on statistics

In terms of form, words are a combination of stable words, so the more times the adjacent words appear in the context, the more likely they are to form a word. Therefore, the frequency or probability of adjacent words and characters can better reflect the credibility of the word. The frequency of the combination of the adjacent words in the corpus can be counted, and their mutual information is calculated. Define the two-word mutual present information and compute the adjacent probability of two Chinese characters X and Y. The mutual information embodies the close degree of the bond between Chinese characters. When the tightness is higher than a certain threshold, it can be assumed that the word group may constitute a word. This method can only be used to statistics the frequency of the words in the corpus, do not need to cut the dictionary, so it is also called No dictionary segmentation method or statistical method. But this method also has certain limitation, will often take out a number of common frequently high, but not the words of the commonly used groups, such as "This One", "one", "some", "my", "many" and so on, and the common word recognition accuracy is poor, time and space overhead. The actual application of the statistical word segmentation system is to use a basic word dictionary (commonly used word dictionary) for string matching participle, at the same time using statistical methods to identify some new words, the serial frequency statistics and string matching, not only to play the matching segmentation speed, high efficiency, but also the use of dictionary segmentation and context to identify words, The advantages of automatically eliminating ambiguity.

3, based on understanding of the word segmentation method

The method of Word segmentation is to make the computer simulate the people's understanding of the sentence, to achieve the effect of recognizing words. The basic idea is to make syntactic and semantic analysis at the same time, and use syntactic and semantic information to deal with ambiguity. It usually consists of three parts: the segmentation subsystem, the syntactic system, the general control part. Under the coordination of the general control part, the segmentation subsystem can get the syntactic and semantic information about words and sentences to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This kind of word segmentation method needs to use a lot of language knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into the form of machine direct reading, so the word segmentation system based on understanding is still in the experimental stage.

In the end which Word segmentation algorithm accuracy is higher, at present has no conclusion. For any mature word segmentation system, it is impossible to rely on a single algorithm to achieve, all need to synthesize different algorithms. I understand that the vast number of technology to use the word segmentation algorithm "Compound word segmentation", the so-called compound, equivalent to the use of traditional Chinese medicine in the concept of compound, that is, the combination of different drugs to treat diseases, the same, for the recognition of Chinese words, need a variety of algorithms to deal with different problems

Have a mature word segmentation algorithm, whether it can easily solve the problem of Chinese participle? The truth is far from it. Chinese is a very complex language, it is more difficult for the computer to understand the Chinese language. In the Chinese word segmentation process, there are two major problems have not been completely broken.

1. Ambiguity recognition

Ambiguity refers to the same sentence, there may be two or more methods of segmentation. For example: surface, because "surface" and "face" are words, then this phrase can be divided into "surface" and "surface." This is called cross ambiguity. Like this intersection ambiguity is very common, the previous "Kimono" example, in fact, because of the cross ambiguity caused by the fault. "Makeup and clothing" can be divided into "makeup and clothing" or "makeup and clothing." It is difficult for a computer to know exactly which program is right because no one knows what it is.

Cross ambiguity is relatively easy to deal with relative combinatorial ambiguity, and combinatorial ambiguity must be judged by the whole sentence. For example, in the sentence "This doorknob is broken", "handle" is a word, but in the sentence "Please take your hands off", "the handle" is not a word; "In the sentence" The general appointed an Admiral "," Lieutenant "is a word, but in the sentence" output will grow twice times in three years "," lieutenant "is no longer a word. How do computers identify these words?

If the intersection ambiguity and the combination ambiguity computer can solve, there is still a problem in the ambiguity, is the real ambiguity. Really ambiguous meaning is to give a word, by people to judge also don't know which should be the word, which should not be words. For example: "Table tennis Auction is over", can be divided into "table tennis auction finished", can also be cut into "table tennis auction finished" If there is no context other sentences, I am afraid that no one knows that "auction" here is not a word.

2. Recognition of new words

New words, the technical term is called the unregistered word. Those words that are not included in the dictionary, but which are actually called words. The most typical is the name, people can easily understand the sentence "Wang Junhu to Guangzhou", "Wang June Tiger" is a word, because it is a person's name, but if the computer to identify the difficulty. If the "Wang June Tiger" as a word included in the dictionary, the world has so many names, and every moment there are new names, the inclusion of these names is a huge project. Even if this work can be completed, there will be problems, such as: In the sentence "Wang June bibs", "Wang June Tiger" can not calculate the word?

New words In addition to names, there are institutions, place names, product names, trademarks, abbreviations, ellipsis, etc. are difficult to deal with, and these are just the words people often use, so for search engines, Word segmentation system in the new word recognition is very important. At present, the accuracy rate of new words recognition has become one of the important signs to evaluate the quality of a word segmentation system. Interested friends can use Baidu search engine to do a try, input different keywords, check the return of Baidu results, to understand the Baidu Word segmentation method, I think this learning efficiency is the highest.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Brief introduction of Chinese word segmentation technology

Contact Us

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support