What is Chinese participle

Source: Internet
Author: User
Chinese

What is Chinese participle?

As we all know, English is a unit of words, words and words are separated by space, and the Chinese word is the unit, all the words in the sentence to describe a meaning. For example, English sentence I am a student, in Chinese is: "I am a student." Computer can be very simple to know student is a word, but it is not easy to understand the "learning", "Sheng" two words together to represent a word. The Chinese character sequence is divided into meaningful words, that is, Chinese participle, some people also known as cutting words. I am a student, the result of participle is: I am a student.
The current mainstream Chinese word segmentation algorithms are:

1. Segmentation method based on string matching

This method is also called the mechanical Word segmentation method, it is according to a certain strategy to the analysis of the Chinese character string and a "full large" machine Dictionary of the entry, if found in the dictionary a string, then match success (identify a word). According to the different scanning direction, the string matching segmentation method can be divided into forward matching and reverse matching. According to the case of different length preference, it can be divided into the maximum (longest) matching and the minimum (shortest) matching, according to whether or not the process of POS tagging, but also can be divided into simple word segmentation method and the combination of Word segmentation and annotation integration method. Several commonly used mechanical participle methods are as follows:
1) Forward maximum matching method (from left to right direction);
2 Reverse maximum matching method (from right to left direction);
3) Minimum segmentation (make the number of words cut out in each sentence the smallest). The
can also combine each of these methods, for example, by combining the forward maximum matching method with the reverse maximum matching method to form a bidirectional matching method. Due to the characters of Chinese words, the forward minimum matching and inverse minimum matching are seldom used. Generally speaking, the segmentation precision of reverse matching is slightly higher than that of forward matching, and the ambiguity phenomenon is less. The statistic results show that the error rate of single positive maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. But this precision is far from satisfying the actual need. The actual use of the word segmentation system, is the mechanical participle as a primary means, but also by using a variety of other language information to further improve the accuracy of segmentation.
One method is to improve the scanning mode, called feature scanning or symbol segmentation, priority in the string to be analyzed to identify and cut out some of the obvious features of the words, as a breakpoint, the original string can be divided into smaller strings and then into the mechanical participle, thereby reducing the matching error rate. Another method is to combine the word segmentation and lexical tagging, use rich parts of speech to help the decision making, and in the process of tagging in turn to the results of the word segmentation test, adjust, so as to greatly improve the accuracy of segmentation.
for mechanical Word segmentation methods, you can establish a general model, in this respect there are professional academic papers, here do not do a detailed discussion.

2, Word segmentation method based on understanding

The method of Word segmentation is to make the computer simulate the people's understanding of the sentence, to achieve the effect of recognizing words. The basic idea is to make syntactic and semantic analysis at the same time, and use syntactic information and semantic information to deal with ambiguity phenomenon. It usually consists of three parts: the segmentation subsystem, the syntactic system, the general control part. Under the coordination of the general control part, the segmentation subsystem can get the syntactic and semantic information about words and sentences to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This kind of word segmentation method needs to use a lot of language knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into the form of machine direct reading, so the word segmentation system based on understanding is still in the experimental stage.
  
3. Segmentation method based on statistics

In terms of form, words are a combination of stable words, so the more times the adjacent words appear in the context, the more likely they are to form a word. Therefore, the frequency or probability of adjacent words and characters can better reflect the credibility of the word. The frequency of the combination of the adjacent words in the corpus can be counted, and their mutual information is calculated. Define the two-word mutual present information and compute the adjacent probability of two Chinese characters X and Y. The mutual information embodies the close degree of the bond between Chinese characters. When the tightness is higher than a certain threshold, it can be assumed that the word group may constitute a word. This method can only be used to statistics the frequency of the words in the corpus, do not need to cut the dictionary, so it is also called No dictionary segmentation method or statistical method of word-taking. But this method also has certain limitation, will often take out a number of common frequently high, but not the words of the commonly used groups, such as "This One", "one", "some", "my", "many" and so on, and the common word recognition accuracy is poor, time and space overhead. The actual application of the statistical word segmentation system is to use a basic word dictionary (commonly used word dictionary) for string matching participle, at the same time using statistical methods to identify some new words, the serial frequency statistics and string matching, not only to play the matching segmentation speed, high efficiency, but also the use of dictionary segmentation and context to identify words, The advantages of automatically eliminating ambiguity.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.