What is Chinese Word Segmentation

Source: Internet
Author: User

What is Chinese Word Segmentation?

As we all know, English is based on words. Words and words are separated by spaces, while Chinese is based on words. All words in a sentence can be connected to each other to describe a meaning. For example, the English sentence I am a student, in Chinese, is: "I am a student ". A computer can easily know that student is a word by space, but it cannot easily understand that the words "Learning" and "Sheng" are combined to represent a word. The Chinese Character Sequence is segmented into meaningful words, that is, Chinese word segmentation. Some people are also called word segmentation. I am a student and the result of Word Segmentation is: I am a student.
Currently, mainstream Chinese word segmentation algorithms include:

  1. String Matching-Based Word Segmentation

This method is also called the mechanical word segmentation method. It matches the string of Chinese characters to be analyzed with the entry in a "sufficiently large" machine dictionary according to certain policies, if a string is found in the dictionary, the match is successful (a word is recognized ). According to the Scanning direction, the string matching and word segmentation methods can be divided into forward matching and reverse matching. According to the priority matching of different lengths, they can be divided into maximum (longest) Matching and minimum (shortest) matching; based on whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. Several common mechanical word segmentation methods are as follows:
1) forward maximum matching (from left to right );
2) reverse maximum matching (from right to left );
3) Minimum segmentation (minimum number of words cut out in each sentence ).
You can also combine the above methods. For example, you can combine the forward maximum matching method and the reverse maximum matching method to form a bidirectional matching method. Due to the word-based feature of Chinese, forward least matching and reverse least matching are rarely used. Generally, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and there are fewer ambiguities. The statistical results show that the error rate of positive matching is 1/169, and the error rate of reverse matching is 1/245. However, this accuracy is far from meeting the actual needs. The actual word segmentation system uses mechanical word segmentation as a preliminary scoring method, and uses other language information to further improve the accuracy of segmentation.
One method is to improve the scanning method, which is called feature scanning or mark segmentation. First, some words with obvious features are recognized and segmented in the string to be analyzed, and these words are used as breakpoints, you can divide the original string into smaller strings and perform mechanical word segmentation to reduce the matching error rate. Another method is to combine word segmentation and word class tagging, and use rich word class information to help word segmentation decisions. In addition, the word segmentation results are verified and adjusted in turn during the tagging process, this greatly improves the accuracy of splitting.
A general model can be established for the mechanical word segmentation method. There are professional academic papers in this regard, which will not be discussed in detail here.
  
2. comprehension-Based Word Segmentation

This word segmentation method allows a computer to simulate a person's understanding of a sentence to recognize words. The basic idea is to perform syntactic and semantic analysis at the same time of word segmentation, and process ambiguity through syntactic information and semantic information. It generally consists of three parts: the word segmentation subsystem, the syntax and semantics subsystem, and the general control component. Under the coordination of the general control, the word segmentation sub-system can obtain syntactic and semantic information about words and sentences to judge word segmentation ambiguity, that is, it simulates the process of human understanding of sentences. This word splitting method requires a large amount of language knowledge and information. Due to the general and complex nature of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by machines. Therefore, the comprehension-based word segmentation system is still in the experimental stage.
  
3. Statistical-Based Word Segmentation

In terms of form, words are a stable combination of words. Therefore, the more times adjacent words appear at the same time in the context, the more likely they are to form a word. Therefore, the frequency or probability of adjacent co-occurrence between words can better reflect the word credibility. The frequency of the combination of adjacent co-occurrence words in the corpus can be calculated to calculate their co-occurrence information. Defines the mutual occurrence information of two words and calculates the adjacent co-occurrence probability of two Chinese characters X and Y. The interaction information reflects the closeness between Chinese characters. When the closeness is higher than a threshold, the word group may constitute a word. This method only requires statistics on the word group frequency in the corpus, and does not need to be divided into dictionaries. Therefore, it is also called the dictionary-less word segmentation method or the statistical word acquisition method. However, this method also has some limitations. It will often extract frequently used word groups with high co-occurrence frequency but not words, such as "this", "one", "some", "my", and "many". In addition, the recognition accuracy of common words is poor and the time-space overhead is large. In practice, the statistical word segmentation system must use a basic word segmentation Dictionary (commonly used word dictionary) for string matching and word segmentation, and use statistical methods to identify some new words, the combination of string frequency statistics and string matching not only makes full use of the features of fast and efficient matching and word segmentation, but also uses dictionary-free Word Segmentation in combination with context to identify new words and automatically eliminate ambiguity.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.