This article will take you together to understand the search engine mystery of an important part---Chinese word segmentation technology: mainly about the implementation of Chinese word segmentation principle and the current comparison of several popular Java version of the search Word open source framework.

Any full-text search engine must perform an important preprocessing of the data before indexing is processed: participle. The role of participle is to make the machine easier to "learn" human language, search engines can show what we really want to find. Of course, if only for the search of this application scenario, the foreign language (English, Russian) word segmentation work seems to be a breeze, because every meaningful word in e text is separated by a space or a symbol, we only need to according to the space to complete the most basic word-breaker work. But Chinese (in fact, there are Japanese, Korean, perhaps you will think of CJK) is not so simple, as I now write in general, there is no space between the words, plus Chinese characters "profound", the general approach is to solve the Chinese participle.

If you want to design an algorithm now, the realization of Chinese word segmentation, "dichotomy" seems to be the most easy to think of, but also the least force of an algorithm, although simple, but the result is not high accuracy (probably the idea is to "Taobao was dismantled" into "Taobao", "Bao", "dismantled", "demolished", These words are then filtered into the thesaurus). The advantage of this kind of segmentation thought is simple and easy to realize, the disadvantage is that the word bank is large, and the question of ambiguity seems to be a difficult one. For example, "Amoy Baby" was finally divided into "Taobao" and "Pui", but the correct segmentation should be "Amoy" and "Baby", and it is the same as the word method, all need based on the "dictionary" this data structure to complete the word segmentation work. The Chinese word segmentation algorithm based on Thesaurus was first proposed by Professor Liangnanyum of Beihang University, and the idea of this algorithm was later derived into many kinds of algorithms: for example, the least word segmentation technique, that is, one sentence should be separated into a word string with the fewest number of words. That is, we often refer to the forward maximum matching algorithm FMM: The flowchart of this algorithm is as follows:

Of course, the above implementation is only the simplest implementation, because the size of L is fixed, so its accuracy and ambiguity still have a lot of problems; we can also change the algorithm on this basis, interested students can search the relevant optimization algorithm on the Internet;

In addition, there is the inverse maximum matching algorithm rmms, the only difference between it and the forward maximum matching algorithm is that it starts scanning at the end of a sentence, but both of these algorithms have a serious problem: in the case of ambiguity, the result may not be very accurate (practice proves that The inverse of the maximum match word out of the result is often ambiguous error than positive matching of a lot less, but still exist, so in order to reduce the inaccuracy caused by the ambiguity, some people try to use two scans: that is, the positive maximum matching and inverse maximum match each time, two matches the results of two times to do the analysis, We call the two-way matching, and received a good result, improve the accuracy of the word segmentation, but the performance is obviously reduced a lot. By the way, there are related SEO staff research Google and Baidu's Chinese word segmentation is the use of the largest matching algorithm to achieve the word segmentation, but Baidu in the dictionary to do more articles (dictionaries are divided into professional dictionaries and ordinary dictionaries), so in the participle than Google do a little better, because can not be confirmed, So I can only believe it.

After we have finished the algorithm flow, we introduce a data structure that implements the maximum matching algorithm-tire tree:

As shown in the two figure below, it is actually a dictionary tree, which is used to store a large number of strings to support fast pattern matching, so it is the best choice to implement the maximum matching algorithm. It is characterized by that all strings that contain public prefixes will be hung in the same node in the book, and each string cannot be called a prefix of another string (this problem can be solved by adding special characters to the end of the string), its query time complexity is O (nxd), n is the height of the tree, and D is the size of the dictionary.

The above is the most standard tire tree structure, in fact tire tree also has compression tire and suffix tire. Let's take a look at the suffix Tire tree:

Suffix tire tree is a tree consisting of a suffix substring of a specified string; If you want to build the suffix tire tree of "minimize", first its suffix set is {minimize,inimize,nimize,imize,mize, The construction process for the ize,ze,e},minimize suffix tire tree is this:

One of the principles for building a suffix tire tree is that when a new string is inserted, the existing leaf node is split into two leaf nodes if the new string has a common prefix to the string that already has the leaf node.

The algorithm for finding the P-substring according to the suffix Tire tree is:

Traverse all of its child nodes starting from the root root node;

If there is not a child the first character of the node and the first character of P are equal, the match fails and ends;

If the keyword of node n is the first character and the first word of P typeface, etc.

A. n.length>=p.length; if N.sub (0,p.len-1) =p, the match succeeds;

B. n.length<=p.length; if P.sub (0,n.len-1) =n, then p1=p.substring (n.length); root=n continue 1;

If using hash directly to locate the algorithm, the time Complexity O (p.length), the query efficiency can be seen as evidenced;

Compress Tire Tree:

Compression trie, similar to standard trie, can quickly find a prefix string, but it guarantees that each internal node in the trie has at least two child nodes (except the root node), and executes this rule by compressing the list node chain into the leaf nodes. If a non-root internal node V of T has only one sub-node, then we call the V redundant, and the two redundant nodes that are connected make up a redundant chain, at which point we can replace the redundant node with a single side, as shown in the following figure:

One of the great advantages of this compression representation is that no matter how long the nodes need to be stored, all of them can be represented by a ternary group, and the space occupied by the ternary group is fixed limited. As shown in the following illustration:

Above we mentioned the forward maximum matching algorithm, inverse maximum matching algorithm, the principle of bidirectional maximum matching algorithm, algorithm flow, merits and demerits and algorithm performance, good data structure realization etc. But it has to be explained that the successful solution to the ambiguity word segmentation problem is the language modeling of corpora---Use statistical language model to deal with Word segmentation, its accuracy than Thesaurus-based word segmentation algorithm to improve a whole order of magnitude, the Internet has relevant information on how Google based on the idea of statistics to build its thesaurus.

When it comes to statistics, we probably have to mention the great mathematician, and it is his naïve Bayesian algorithm theory that provides us with a good theoretical basis for the segmentation technology. Many of Google's applications are based on this model, such as "Google Translate", which we often use. The statistical language model is probably like this:

If a sentence can be broken down into the N-word method:

(1) Word segmentation mode 1:a1,a2,a3,...... Aj;

(2) Word segmentation mode 2:b1,b2,b3,...... Bj;

......

(n) Word-breaking mode n:n1,n2,n3,...... Nj; The word segmentation method with the largest probability in the N-type Word segmentation method is high accuracy. We use mathematical means to express that: P (y| X) ∝p (Y) *p (x| Y), based on statistical participle: X is a string (sentence), Y is a word string (a specific word-breaker hypothesis), we just need to find the P (y| X) the largest Y

Formula expansion for joint probabilities:

P (Y) = P (W1, W2, W3,..) = P (W1) * p (w2| W1) * P (w3| W2, W1) * P (...)

We assume that the probability of the occurrence of a word in a sentence depends only on the finite K-word in front of it, if it relies only on one of the preceding words, it is a 2-dollar language model (2-gram), the same as 3-gram etc. Of course, the poor lifting all the word segmentation method and calculate its probability, its computational capacity is also very large. This time it is necessary to use some more practical and efficient algorithms, such as "dynamic Programming" algorithm to improve.

Of course, if you want your segmentation results to be more accurate, you can also be based on statistical probability of the optimization: Query the results of the word segmentation, based on the query results to feedback whether such participle is reasonable/common.

Back to what we said the search engine word technology This topic, to illustrate the point is that the process of word segmentation is not only the process of cutting, there are many other jobs.

Pre-processing: In the pre-processing stage we also have to encode and convert sentences, delete spaces, standardization, digital recognition, English recognition, name recognition, place name recognition and a series of conversion operations, including Baidu, Taobao and other search engines necessary to go through this step of processing, The result of such processing can be better for the word segmentation work.

Post-processing: After the word segmentation, we also need to do word synthesis, suffix processing, two word processing compound words, phrase correction, multi-output part-of-speech labeling, restoration of space, encoding conversion and so on.

The above mainly introduces some theoretical knowledge, let's take a look at some practical examples. In fact, in the Java Open source community, because the Lucene Search Toolkit is constantly updated, the performance is increasingly high, the Lucene and SOLR enterprise-level applications are also growing, the resulting Java version of the word breaker is springing up. People are more impressive should be IK and paoding Chinese word breaker, IK introduced a lot on the iteye, for the common users of Lucene, paoding get you are the most familiar.

Let's take a look at paoding participle first. Performance: On the PIII 1G memory personal machine, 1 seconds can accurately participle 1 million kanji. It mainly uses the "No Limit number" dictionary file to effectively slice the article. Take a look at the paoding code: It has a beef class that is "cow", and then can use a lot of knife to cut, that is, "discovering":

You can then use it as long as the configuration file specifies which type of knife to use for cutting participle. Of course you can also use multiple knife to cut, our Knifebox and Smartknifebox class is to support multiple knife cutting, Box assumes the main task is to make a decision on a word encountered, exactly what kind of specific knife to use to cut. We take Cjkknife as an example: its algorithm implementation is the forward maximum matching algorithm: From the beginning of the string to find out whether there is a maximum matching value in the dictionary;

Paoding defines a bunch of dictionary for finding, which has a dictionarydelegate class that provides us with extensions to implement our own dictionary;

Let's take a look at the Hashbinarydictionary class:

Private word[] ascwords; This property means that all the words are loaded into memory,

/**

* First character to sub-dictionary mapping

*/

Private map/* <object, subdictionarywrap> */subs; This property holds the first character-to-sub-dictionary mapping relationship, That is, when the dictionary length is very small can be directly with a binarydictionary to save the value of the dictionary, but if the dictionary is a large amount of time, it is necessary to a dictionary with the first character split into multiple dictionaries, so that each time you find the time can be immediately hash to the sub-dictionary, Then in the collection of small sub-dictionaries to query again, so that greatly improve the efficiency of the word segmentation search.

If you find a word that needs to be cut, it calls the Collect method in the Collector class to save the result of the word:

In addition, Paoding also supports the addition of a filter dictionary: if the word to be analyzed is in the filter list, it will not be participle.

Paoding word breaker is now widely used, including Taobao's final search Java Search Server application is also using the paoding Word segmentation framework. And Ikanalyzer is an open source project Luence as the application of the subject, combined with the dictionary word segmentation and Grammar analysis algorithm of Chinese sub-phrases. It uses the "forward iteration of the most fine-grained segmentation algorithm" to support fine-grained and maximum word length of two kinds of segmentation mode, with 830,000 words per second (1600kb/s) of high-speed processing capacity, the use of multi-sub-processor analysis mode, support: English letters, numbers, Chinese word word processing, compatible with Korean, Japanese characters optimized dictionary storage, smaller memory footprint. Support User dictionary extension. The Ikqueryparser realizes the non-conflicting permutation of the ambiguous result of the word segmentation, makes a good support and upgrade to the query of lucene3.0, and recommends that it can try to use, as to the concrete realization here no longer say, please refer to the relevant information.

Of course, in addition, there is a Java Chinese word segmentation framework has to mention: mmseg4j, it is the Java implementation of MMSEG, support Chinese word segmentation. And the MMSEG algorithm has two kinds of word segmentation methods: simple and complex, the original version is C implementation, are based on the forward maximum matching algorithm implementation. Complex added four rules. The official given the correct recognition rate of 98.41%,mmseg4j has achieved the two algorithms of word segmentation. 1.5~1.6 version memory consumption of about 10M, simple algorithm of the word speed is the 1.1m/s,complex algorithm of the word speed is 700kb/s. While the 1.7 version of memory occupies 50m,complex speed 1.2m/s,simple speed 1.9m/s. Undoubtedly, mmseg4j also supports Lucene and SOLR well. Of course, it also has a corresponding C + + version: Libmmseg, in the Sphinxsearch-based development of Coreseek open source search engine is also used in the libmmseg to divide the word. However, in fact, Sphinxsearch itself has supported the mmseg segmentation algorithm. Let's take a look at MMSEG's algorithm implementation.

As mentioned earlier, the MMSEG algorithm is also based on the forward maximum match, but it can be very accurate because it adds 4 rules. A concept is involved in these rules: chunk, a chunk is a word breaker for a sentence (a candidate segmentation result), for sentences, each chunk defines the following properties, length (length), average length (Average length), and the square of the standard deviation ( Variance) and free morpheme degrees (degree of morphemic Freedom):

attribute meaning

Length (lengths) The sum of the lengths of each word in Chuck

Average Length (Average length)/number of words

The square of standard deviation (Variance) and the definition in mathematics

The sum of the logarithm of the word frequency of individual words in free morpheme degree (degree of morphemic Freedom)

The main rules are as follows:

Rule 1: Take the maximum matching chunks (rule 1:maximum matching), which is the longest length of the chunk.

Rule 2: Take the chunks of the average word length (rule 2:largest average word length), which is the longest average length of chunk.

Rule 3: chunks (rule 3:smallest variance of word lengths) that takes the longest standard deviation of the word, takes a few of the minimum standard deviations of the word length.

Rule 4: The sum of the largest chunk (rule 4:largest sum of degree of morphemic freedom of one-character words), which takes a single word, uses a word frequency dictionary, for example, The frequency is very high, then we tend to think "of" is a word, such as the emergence of "true" such sentences are not necessarily divided.

The chunk after filtering the above participle rules is the result of the last participle (starting with rule 1, until there is only one chunk). In the end, I wanted to introduce the core code of the Mmseg C implementation, because the implementation code was too complex to give up. Students who want to challenge can go to see, in view of the current popular segmentation algorithm is based on the dictionary to achieve, so here no longer introduce the statistical probability algorithm based on the relevant knowledge.