[Tse learning notes for Peking University Skynet search engine] section 7th-Chinese Word Segmentation

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This section describes the third step of the search function Entry Program tsesearch. cpp-Chinese word segmentation.

(1) Introduction

The Chinese word segmentation mainly includes string matching-Based Word Segmentation and statistical-based word segmentation. The string matching-based method is also called the mechanical word segmentation method, it matches the string to be analyzed with a sufficiently large dictionary entry according to certain rules. If a string is found in the dictionary, the match is successful (a word is recognized). Therefore, this method is dictionary-based and requires a large and accurate dictionary. Tse is a string-based word segmentation method. The statistical-based word segmentation method is not described here. Interested readers can view relevant information.

The mechanical word segmentation method varies with different rules and strategies. common mechanical word segmentation methods include forward maximum matching, reverse maximum matching, minimum segmentation, and bidirectional maximum matching. In Tse, the maximum positive matching is used. This algorithm is not described here. The book "Search" has a very detailed introduction, and there are a lot of information on the Internet. You can check it on your own. Next, analyze the source code.

(2) Code Analysis

The Chinese word segmentation section of the main function in section 4th defines a chzseg Class Object and calls the segmentsentencemm function of the class. The program files and data files for Chinese word segmentation are in the./chseg directory. The./chseg/hzseg. cpp file contains two key functions: segmentsentencemm and segmenthzstrmm. The original string may contain English characters (ASCII characters), special Chinese characters, or Chinese Punctuation Marks. The segmentsentencemm function first filters out these symbols, the remaining Chinese character strings are handed over to the segmenthzstrmm function for Chinese word segmentation. Therefore, segmentsentencemm is a call interface for word segmentation, or a pre-processing function for Chinese word segmentation. segmenthzstrmm is an implementation function of the Chinese word segmentation algorithm. The following describes the source code of these two functions. The Code includes detailed annotations (the annotations starting with "lb_c" are added to me.

// Lb_c: a pre-processing function for Chinese word segmentation. It filters English characters (ASCII characters), special characters, and Chinese Punctuation Marks, hand over the pre-processed Chinese character string to the // segmenthzstrmm function for Chinese word segmentation. The dict parameter is a dictionary object, and S1 is a string to be processed. // Process a sentence before segmentationstring chzseg: segmentsentencemm (cdict & dict, string S1) const {// lb_c: processed string S2 = ""; unsigned int I, Len; // lb_c: reads a byte in S1, until S1 is empty while (! S1.empty () {// lb_c: Take the first byte of S1. Note that the type delimiter of CH is unsigned, because the value of each byte of a Chinese character is greater than 128 unsigned char CH = (unsigned char) S1 [0]; // lb_c: ch <128 indicates that CH is an ASCII character, this part filters ASCII characters if (CH <128) {// deal with ASCII I = 1; Len = s1.size (); // lb_c: if S1 [I] is an ASCII character that is not a line break (LF) or carriage return (CR), I ++, that is, I records the number of consecutive non-LF and Cr ASCII characters. While (I <Len & (unsigned char) S1 [I] <128) & (S1 [I]! = 10) & (S1 [I]! = 13) {// lf, Cr I ++;} // lb_c: If ch is not lf, Cr, or SP (Space ), insert the separator (defined as "/" in the source code) at the byte I. The processing here should be correct, I will analyze it later. If (Ch! = 32) & (Ch! = 10) & (Ch! = 13) {// sp, lf, Cr S2 + = s1.substr (0, I) + separator;} else {// lb_c: If ch is lf or Cr, copy the I-byte data starting from S1 [0] To the processed string S2. That is, no Delimiter is inserted to split the data. // It is obviously faulty here, that is, when CH = Sp, nothing is processed, the character is discarded and analyzed later. If (CH = 10 | CH = 13) {S2 + = s1.substr (0, I) ;}// lb_c: If S1 is not completed, assign S1 to the remaining string if (I <= s1.size () // added by YHF S1 = s1.substr (I); else break; // YHF continue; // lb_c: the CH in else is a non-ASCII character, that is, a Chinese GBK character. // Lb_c: else if (CH <176) can be written here!} Else {// lb_c: ch <176 indicates Chinese Punctuation or other Chinese characters. GBK Chinese characters are encoded from 176, that is, all the first bytes are greater than 176. Big // you can check the GBK encoding table. Therefore, this part deals with Chinese Punctuation Marks and special Chinese characters. If (CH <176) {I = 0; Len = s1.length (); // lb_c: the first byte value in GBK is greater than or equal to 161 and less than 176, it is a Chinese Punctuation and other special symbols. The following while loop filters out some special characters. If a Chinese Punctuation Mark is encountered, it is stopped, that is, the special characters do not need to be split, but Chinese Punctuation Marks need to be split. While // the value in the condition is the GBK encoding value of these punctuation marks. You can query the GBK encoding table comparison. While (I <Len & (unsigned char) S1 [I] <176) & (unsigned char) S1 [I]> = 161) // lb_c: ch is not a Chinese Punctuation Mark :,. · Success 〃&&(! (Unsigned char) S1 [I] = 161 & (unsigned char) S1 [I + 1]> = 162 & (unsigned char) s1 [I + 1] <= 168) // lb_c: ch is not a Chinese Punctuation Mark :~ ‖... ''" "[] <>" "" & (! (Unsigned char) S1 [I] = 161 & (unsigned char) S1 [I + 1]> = 171 & (unsigned char) s1 [I + 1] <= 191) // lb_c: ch is not a Chinese Punctuation Mark :,! ():;? &&(! (Unsigned char) S1 [I] = 163 & (unsigned char) S1 [I + 1] = 172 | (unsigned char) s1 [I + 1] = 161) | (unsigned char) S1 [I + 1] = 168 | (unsigned char) s1 [I + 1] = 169 | (unsigned char) S1 [I + 1] = 186 | (unsigned char) s1 [I + 1] = 187 | (unsigned char) S1 [I + 1] = 191) {I = I + 2; // assume there are no half Chinese characters} if (I = 0) I = I + 2; // lb_c: If ch is not a Chinese space, insert a delimiter to split the first byte of S1. If the Delimiter is a Chinese space, split the part. This is obviously also a // problem, which will be analyzed later. If (! (CH = 161 & (unsigned char) S1 [1] = 161) {if (I <= s1.size ()) // YHF // other non-Chinese double byte characters may consecutively output S2 + = s1.substr (0, I) + separator; else break; // YHF} // lb_c: if S1 is not completed, S1 is assigned as the remaining string if (I <= s1.size () // YHF S1 = s1.substr (I); else break; // YHF continue; }}// lb_c: the value of CH is not only the preceding Chinese character, that is, CH> = 176 is the encoding of Chinese characters and the processing of Chinese character strings I = 2; len = s1.length (); // lb_c: searches for consecutive Chinese character strings, while loops until a certain two bytes are not Chinese characters. Stop while (I <Len & (unsigned char) S1 [I ]> = 176) I + = 2; // lb_c: a continuous Chinese character string S1 (0, I) is found in the previous step. Here, the Chinese word segmentation function segmenthzstrmm is called to perform Chinese word segmentation. S2 + = segmenthzstrmm (dict, s1.substr (0, I); // lb_c: If S1 is not processed, assign S1 as the remaining string if (I <= Len) // YHF S1 = s1.substr (I); else break; // YHF} // lb_c: After the processing is complete, S2 stores the split result, separated by a separator. Return S2;} // lb_c: Chinese word segmentation of the forward maximum matching method. dict is the dictionary object queried by word segmentation, and S1 is a Chinese string. // Lb_c: using max-matching-method to segment the string S1 with dictory dict. // using max matching method to segment a character string. string chzseg: segmenthzstrmm (cdict & dict, string S1) const {string S2 = ""; // store segment result while (! S1.empty () {unsigned int Len = s1.size (); // lb_c: max_word_lenght indicates the maximum length of words, which is 8 bytes in Tse, that is, four Chinese characters, if (LEN> max_word_length) Len = max_word_length; // lb_c: take the substring with the maximum term length from the S1 header as the word to be matched (if it is the maximum backward matching rule, take it from the end of S1) string W = s1.substr (0, Len ); // The candidate word // lb_c: query whether W exists in the dictionary. The cdict: isword function checks whether the string is in the dictionary. In a large dictionary, word search efficiency is critical. // This is one of the factors that affect the efficiency of the system. As described in section 5th, the dictionaries in tse are stored in STL map structure, that is, the data knot/structure storage of the red and black trees. Some word splitting systems also use Hash Table storage to improve query efficiency. Bool ISW = dict. isword (w); // lb_c: if it is not found in the dictionary, remove the last word and continue the query until W is a single word. While (LEN> 2 & ISW = false) {// if not a word len-= 2; // cut a word w = W. substr (0, Len); ISW = dict. isword (w) ;}// lb_c: After matching word w, add the separator for segmentation. S2 + = W + separator; S1 = s1.substr (W. Size ();} return S2 ;}
(3) Problem AnalysisThere are some problems mentioned in the above Code. We will analyze them below.
Let's talk about line 29-34. In line 29, if ch is not lf, Cr, or SP, insert the separator and copy it to the string S2 for processing, in the 34 rows, if ch is lf or Cr, the separator is not inserted and copied directly to S2. What if CH = Sp (Space? Apparently, nothing is done, and the analyzed part of the word (s [0]-s [I-1]) is lost. For example, S1 = "Love search" (note that there is a space character before love). After word splitting, the result is S2 = "Search"
Love is lost.
Let's talk about the issue in line 8-83. This part of the code means that if s [0] s [1] is not a Chinese space and S1 is not analyzed, a separator is inserted at the I-byte, if s [0] s [1] is a Chinese space, nothing is done, that is, part of the analyzed word (s [0]-s [I-1]) for example, S1 = "ㄝ "(all Chinese characters encoded in GBK format and special characters in the "ㄝ" Chinese state, there is a Chinese space at the beginning). The result after word segmentation is S2 = "Search"; if S1 = "ㄝ", because the first two bytes are not Chinese spaces, therefore, a separator is inserted, and the result after word segmentation is S2 = "too many characters/
Search ".
In general, when the search string starts with a space, the non-man string following the space will be lost in the same State (that is, it is an ASCII character or a GBK character, for example, in the above example, "love" and "Never forget ". This obviously does not meet users' needs. For example, if a user wants to search for "love Tsinghua University", there is a space in front of the input, that is, "Love Tsinghua University". The word splitting result is "Tsinghua University ", therefore, you can only query Tsinghua University, rather than love. Result 1 of the TSE search.
Figure 1
Article example (2 ).
Figure 2
This is actually not correct, and the problem will be more serious. The Code indicates that if CH = 32, the I-byte data starting from S1 [0] is copied to the processed string S2, that is, no splitting is performed. For example, if S1 = "Love search" (note that there is a space character before "love"), the result after word segmentation is S2 = "Love search", when you search for keywords in the inverted table, you will find "ABC search". Obviously, the probability of finding keywords is very low, because "love" and "Search" are treated as a word. The search result in Tse is 0, 3.
Figure 3
The processing here is also one of my doubts. Do I need to insert a separator when analyzing the I-byte of S1? Why do I have to look at the value of S1 [0? For example, when analyzing S1 = "Love search" and analyzing 'E', why do we need to check if S1 [0] Is lf, Cr, and SP to determine whether to use separators? You can insert a separator directly here! I do not understand the true intention of the author. If you want to understand it, please explain it!
By:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Tse learning notes for Peking University Skynet search engine] section 7th-Chinese Word Segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Tse learning notes for Peking University Skynet search engine] section 7th-Chinese Word Segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support