The Chinese Word Segmentation technology belongs to the field of natural language processing technology. For a single sentence, people can use their own knowledge to understand what are words and what are not words, but how can computers understand them? The processing process is a word splitting algorithm.
The existing word segmentation algorithms can be divided into three categories: String Matching-based word segmentation, understanding-based word segmentation, and statistical-based word segmentation.
1. String Matching-Based Word Segmentation
This method is also called the mechanical word segmentation method. It matches the string of Chinese characters to be analyzed with the entry in a "sufficiently large" machine dictionary according to certain policies, if a string is found in the dictionary, the match is successful (a word is recognized ). According to the Scanning direction, the string matching and word segmentation methods can be divided into forward matching and reverse matching. According to the priority matching of different lengths, they can be divided into maximum (longest) Matching and minimum (shortest) matching; based on whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. Several common mechanical word segmentation methods are as follows:
1) forward maximum matching (from left to right );
2) reverse maximum matching (from right to left );
3) Minimum segmentation (minimum number of words cut out in each sentence ).
You can also combine the above methods. For example, you can combine the forward maximum matching method and the reverse maximum matching method to form a bidirectional matching method. Due to the word-based feature of Chinese, forward least matching and reverse least matching are rarely used. Generally, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and there are fewer ambiguities. The statistical results show that the error rate of positive matching is 1/169, and the error rate of reverse matching is 1/245. However, this accuracy is far from meeting the actual needs. The actual word segmentation system uses mechanical word segmentation as a preliminary scoring method, and uses other language information to further improve the accuracy of segmentation.
One method is to improve the scanning method, which is called feature scanning or mark segmentation. First, some words with obvious features are recognized and segmented in the string to be analyzed, and these words are used as breakpoints, you can divide the original string into smaller strings and perform mechanical word segmentation to reduce the matching error rate. Another method is to combine word segmentation and word class tagging, and use rich word class information to help word segmentation decisions. In addition, the word segmentation results are verified and adjusted in turn during the tagging process, this greatly improves the accuracy of splitting.
A general model can be established for the mechanical word segmentation method. There are professional academic papers in this regard, which will not be discussed in detail here.
2. comprehension-Based Word Segmentation
This word segmentation method allows a computer to simulate a person's understanding of a sentence to recognize words. The basic idea is to perform syntactic and semantic analysis at the same time of word segmentation, and process ambiguity through syntactic information and semantic information. It generally consists of three parts: the word segmentation subsystem, the syntax and semantics subsystem, and the general control component. Under the coordination of the general control, the word segmentation sub-system can obtain syntactic and semantic information about words and sentences to judge word segmentation ambiguity, that is, it simulates the process of human understanding of sentences. This word splitting method requires a large amount of language knowledge and information. Due to the general and complex nature of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by machines. Therefore, the comprehension-based word segmentation system is still in the experimental stage.
3. Statistical-Based Word Segmentation
In terms of form, words are a stable combination of words. Therefore, the more times adjacent words appear at the same time in the context, the more likely they are to form a word. Therefore, the frequency or probability of adjacent co-occurrence between words can better reflect the word credibility. The frequency of the combination of adjacent co-occurrence words in the corpus can be calculated to calculate their co-occurrence information. Defines the mutual occurrence information of two words and calculates the adjacent co-occurrence probability of two Chinese characters X and Y. The interaction information reflects the closeness between Chinese characters. When the closeness is higher than a threshold, the word group may constitute a word. This method only requires statistics on the word group frequency in the corpus, and does not need to be divided into dictionaries. Therefore, it is also called the dictionary-less word segmentation method or the statistical word acquisition method. However, this method also has some limitations. It will often extract frequently used word groups with high co-occurrence frequency but not words, such as "this", "one", "some", "my", and "many". In addition, the recognition accuracy of common words is poor and the time-space overhead is large. In practice, the statistical word segmentation system must use a basic word segmentation Dictionary (commonly used word dictionary) for string matching and word segmentation, and use statistical methods to identify some new words, the combination of string frequency statistics and string matching not only makes full use of the features of fast and efficient matching and word segmentation, but also uses dictionary-free Word Segmentation in combination with context to identify new words and automatically eliminate ambiguity.
Which of the following word segmentation algorithms is more accurate and is currently inconclusive. For any mature word segmentation system, it is impossible to rely on a specific algorithm alone. Different algorithms need to be integrated. The author understands that the word segmentation algorithm of massive technologies uses the "Compound Word Segmentation Method". The so-called compound method is equivalent to the concept of compound medicine in traditional Chinese medicine, that is, using different medicinal herbs to treat diseases. Similarly, multiple algorithms are required to identify Chinese words.
Difficulties in Word Segmentation
With mature word segmentation algorithms, can we easily solve the problem of Chinese word segmentation? This is far from the case. Chinese is a very complex language, making it even more difficult for computers to understand Chinese languages. In the process of Chinese word segmentation, two major problems have not been completely broken through.
1. Ambiguity Identification
Ambiguity refers to the same sentence. There may be two or more segmentation methods. For example, because "surface" and "surface" are both words, the phrase can be divided into "surface" and "table ". This is called cross-ambiguity. Cross-ambiguity is very common. The preceding example of "kimono" is actually an error caused by cross-ambiguity. "Makeup and clothing" can be divided into "makeup and clothing" or "makeup and clothing ". Since there is no one to understand, it is difficult for computers to know which solution is correct.
Cross-ambiguity is relatively easier to deal with than the combination ambiguity. The combination ambiguity must be determined based on the entire sentence. For example, in the sentence "this door handle is broken", the "handle" is a word, but in the sentence "please pull the handle", the "handle" is not a word; in the sentence "General appointed a Lieutenant", "Lieutenant" is a word, but in the sentence "production will increase twice in three years, "Lieutenant" is no longer a word. How Can computers identify these words?
If both cross-ambiguity and composite ambiguity can be solved, there is still a problem in ambiguity, which is true ambiguity. True ambiguity means giving a sentence. People cannot determine which word should be and which should not be a word. For example, if "the Table Tennis auction is over", it can be divided into "the table tennis racket is sold out" or "the Table Tennis auction is over". If there are no other context sentences, i'm afraid no one knows that "Auction" is not a word here.