Introduction to the framework and implementation of Word segmentation system---This article is suitable for readers with good concept of search engine (original)
keywords : Search engine, participle, Lucene
The domestic vertical field of e-commerce or information sharing applications are in a high-speed development period, the content of rapid search for more and more strong demand for their business applications to adapt to the search engine solutions have become more and more important. At the same time, the general optional open source Search engine framework or solution is also more and more, such as the famous Lucene,solr,elasticsearch and so on. It is a challenging task to build a search engine that perfectly adapts to your business needs, and whether you choose to LUCENE,SOLR these search engine frameworks, programs, or write a search engine yourself, you are faced with a common Core issue---how to create a word breaker for your business.
I. Introduction to the principle of word segmentation
Classic joke: A nurse saw the liver cirrhosis patients secretly drinking in the ward, went to the past told: Beware of the liver! The patient smiles and responds: Little baby! Here the "careful liver" existence ambiguity, from the nurse's point of view is: careful/liver, the patient's angle has become: careful liver. In the Chinese world, filled with the need for accurate word segmentation to eliminate ambiguity scenes.
Word segmentation is extremely important for search engines, which affects the performance of indexes and directly affects the accuracy of search results and the calculation of correlation.
1.2 English participle
English (for all Latin and similar language languages) because of its language is natural to the word as the basic unit, so the word segmentation is relatively easy. The basic steps required are as follows:
- Initial segmentation, based on space/symbol/paragraph segmentation, to get basic words
- Disable word (stop word) filtering
- Stem extraction (stemming)
- Morphological reduction (lemmatisation)
The first step is well understood and does not make redundant explanations.
The stop word in the second step (stop word) refers to the high-frequency words such as a/an/and/are, which have little meaning to search, but have great disturbance to the calculation formula based on the word frequency, so we need to filter out.
In the third step, stemming (stemming) is a unique treatment for western languages, such as the transformation of single and complex variants in English words,-ing and-ed. However, you should do the same word when calculating the relevance. For example, Apple and apples,doing and done are the same word, and the purpose of extracting stems is to merge these perverts. This will better satisfy people's search habits and expectations of return content.
Lucene English participle comes with 3 commonly used stemming extraction algorithms, namely
- 中文版 Minimal Stemmer
- The famous Porter stemming
Stemming is not a complex algorithm, it is basically a bunch of rules, or mapping tables, easy to program, but must be proficient in the language of experts to be competent---must be very aware of its word-formation.
In the fourth step, the basic method of word-form reduction is through dictionary mapping, such as restoring drove to drive. And stemming often shorten the word, such as doing into do. In fact, stemming solves most of the western language of the word conversion problem, and further improve the principle of the search engine experience.
1.3 Digit participle
Numbers in regular search engines, such as Lucene, are treated by default as a purely string processing. Word segmentation effect is basically consistent with English. That is, like the value of 3.1415926 of the word is itself, to search for this number also need to enter the full 3.1415926 to accurately hit.
Based on the full match of the digital string search, for the user, the above example is not very friendly to the search. This is often optimized in custom search engines to support more value-friendly searches. If the numbers are extracted into numerical values, the search for the range of values and some numeric searches (section "vertical Field segmentation system differences" will be further mentioned).
1.4 Chinese participle
The perfect Chinese word segmentation system is a world-wide problem, because it not only must understand the Chinese grammatical structure, Chinese semantics also must adapt the context. Not only that, Chinese new words appear relatively fast, the meaning is constantly changing. So far, no one or organization has claimed that their Chinese word-breaker is 100% accurate. Perhaps the maturity of artificial intelligence will keep the accuracy of Chinese participle close to 100%.
The existing word breaker scheme:
- One-dollar cut and two-cut words
- Longest match based on dictionary
- Formation of segmentation diagram participle
- The word segmentation method of probabilistic language model
One-dollar cut and two-cut words
A unary-cut word is to separate each Chinese character (Chinese character) into an entry (term), so that the search results will be very bad, such as a normal search "Shanghai", the result of the concentration of completely unrelated "sea" together to return. Lucene is provided by default is the Intsche participle, the principle will be: Shanghai people participle: Shanghai/Sea People to do so the advantage is to avoid a yuan cut words appear completely meaningless results. However, the results of the two-word segmentation have no meaningful entry--- sea people , and this is searchable content, the effect is unsatisfactory.
Maximum dictionary-based matching
One of the methods based on dictionary segmentation is to build a dictionary into a trie search tree, each node put a word, while the word information in node, such as part of speech, weight and so on. The trie tree can be a multi-layer hash list, and the speed at which each layer is found is O (1), so the matching speed is very fast.
As shown in is a set of phrases < 10,000, 10,000, 10,000 yuan, one morning, one afternoon, all of a sudden > generation of Tire tree subtree
(Trie tree example)
The text is matched by layer on the trie tree until the trie tree no longer has a sub-level or the text cannot match any of the characters in that layer, so the resulting word-breaker results in a dictionary-based longest match.
Formation of segmentation diagram participle
First, the method is also dependent on the dictionary, and in order to eliminate the ambiguity in the word segmentation, improve the accuracy of the word segmentation, we need to find out a paragraph of all possible words, create a full-cut word graph.
(Chinese word segmentation path)
There are two paths that can be sliced in:
Path 1:0-1-3-5 corresponds to: have/opinions/disagreements
Path 2:0-2-3-5 corresponds to: intentional/visible/divergent
The segmentation results can be obtained by using the dynamic programming algorithm, and the results are determined by combining factors such as part of speech and weight.
The word segmentation method of probabilistic language model
From the point of view of statistical thinking, the problem of participle can be understood as input is a string: C=C1,C2,C3...,CN output is a word string s=w1,w2,w3...,wm (m<=n). For a particular string C, there will be more than one word scheme s corresponding to it, participle is from these s to find the largest probability of a segmentation scheme, that is, the input string tangent to the most likely word sequence.
For input string C "There is disagreement", there are S1 and S2 two kinds of segmentation possible.
- S1: have/opinions/disagreements
- S2: intention/See/disagreement
Calculate conditional probability P (s1| C) and P (s2| C), then use the probability of a large value corresponding to the segmentation scheme. According to the Bayesian formula, there is P (s| C) = (P (c| s) *p (s))/P (c), where P (c) is the probability that a string appears in the corpus, but a fixed value that is used as a set. There is only one way to recover from a word string to a Chinese character string, so P (c| S) = 1. Therefore, the comparison P (s1| C) and P (s2| C) becomes the size of the comparison P (S1) and P (S2). Further deduction can refer to "Decryption Search engine technology combat: Lucene&java" essence version of the second edition of the 4th chapter.
From another point of view, the maximum probability is equal to the shortest path of the segmentation Word graph, and the shortest path can be solved by the dynamic programming method.
The probability-based segmentation method relies on sufficient corpus and statistical analysis of the corpus. Therefore, it belongs to the pre-learning type of Word segmentation method. At present, some high-quality word segmentation method is based on probabilistic statistical segmentation method.
Two, vertical field segmentation system differences
Compared with the complex corpus of general-purpose search engines, corpus data in vertical search engines are often cleaned before they are indexed into the index. It also means that the complexity of the data is reduced. In the vertical domain there will be a lot of proprietary terminology, user search habits and the general search engine will be different. Therefore, it is necessary to make the vertical search engine appear smarter and more able to understand the user's search intentions. For example, automatic correction of the wrong professional vocabulary, in the input box to prompt more classification results set. These need to be in the search engine accurate participle and based on the word segmentation recommendation algorithm to complete.
1.1 English participle
In the vertical field of English participle, the general will reduce the traditional English word segmentation steps. Depending on the corpus, sometimes stemming (stemming) is omitted, and sometimes the filter stop word step is omitted. And it is generally not a stemmed restore. What to do depends mainly on the corpus to be indexed. For example, in the steel information industry, steel-related terminology is very large, and many English itself is a word abbreviation, this time to do stemming and speech reduction may be counterproductive effect. At the same time, due to the high concentration of industry information, fast Word segmentation and high index requirements, so that in the absence of accuracy, the English word segmentation step will be as concise as possible.
At the same time, the vertical search engine will support more advanced syntax to help users in the case of uncertain words, search results, the relevant content will be described in the following Chinese participle.
1.2 Digit participle
Numbers are an outlier in a text search engine, and the numbers that appear in the text are treated as text by default. Search engine users first need to use a combination of numbers as a word to search, if as a number of searches will appear puzzled. There are two typical doubts: first, how to search for a range of values in text, and second, how does a long number actually search? For example, 1.5,2.0,3.0 appears in a document. Can I match the result by searching for a numeric range value (2.5-4.0)? and 3.1415926 how to search for such long values to find?
Lucene has the Numericfield (numerical field) concept, which essentially converts numeric text to numeric values to support precise search or range search. The premise of using Numericfield is to parse the values from the text and to index the values into a separate numericfield.
In practical applications, vertical search engines tend to clean text fields with key values, extract values from them, and index them individually. When searching for numeric values, you need to map the Numericfield to the corresponding text field. Also in the original text field, the number is split in length, such as splitting 3.1415926 by 4-bit length: 3.14/1592/6 to satisfy the user to search for the number string with shorter characters, such as 3.14. Of course, it would be better if you entered the full long string match effect.
1.3 Chinese participle
The complexity of Chinese word segmentation in Vertical search engine is the same as English participle, also depends on corpus. For the Chinese corpus cleaning, the basic can be unrelated to the business or meaningless Chinese text removed. The rest of the high-speed participle.
At the same time, because there is no word segmentation method can be 100% accurate, in the advanced search, support more search syntax, typically has the following advanced search:
- Wildcard query (Wildcard search)
Support Wildcard (Wildcard) search can solve only some of the content of the query, such as search " in " can come out containing the " Middle ", " China *" and other words of the document
Support fuzzy queries, allowing users to search for the desired results in the wrong situation. For example, " Bao Gang Co., Ltd. ", under the fuzzy search, can be " Baosteel Co., Ltd. " search out.
The relationship between advanced search and participle is that the user's search statement is often synthesized into a complex query statement, and the word breaker comprehensive Grammar parser cuts out the correct query words and executes the final query.
The above two advanced search features are also available for searches in English.
Three, the realization principle of finding the steel net word-breaker system
Looking for a steel mesh search business has the following characteristics:
- Text content generation time is widely distributed, while some time periods are relatively concentrated, the speed of the index should be guaranteed
- After the content index, you need to be searched as soon as possible
A complete Word segmentation system is constructed for the characteristics of business and corpus.
3.1 Participle framework and process
The first step is to get the base entry (Token). The contents of the corpus are divided into words according to the language. Get English words, numbers, Chinese paragraphs, reserved words, useless words. Then in each language word breaker further segmentation words, such as Chinese word segmentation needs to be divided into Chinese words, the number of 4-bit segmentation and so on.
In the second part, the basic entries obtained in the first step are combined according to the custom requirements, such as if there is a decimal point between the two numbers to combine into one entry, and the search statement will combine the wildcard characters with other terms into one term.
The third step is to filter useful entries, discard useless entries, and get the final entry.
3.2 Chinese-Japanese-Korean participle
Chinese, Japanese, and Korean languages have some of the characters are overlapping, and can be based on dictionaries, using the same word segmentation method. Here the main description of Chinese participle.
Chinese word segmentation adopts the longest matching method based on the dictionary, and in order to eliminate some basic ambiguity, we use forward dictionary matching and reverse dictionary matching. Finally, according to the weight of the segmentation phrase, select better results.
After series of testing and comparison, in the existing dictionary and corpus, the current Chinese Word segmentation method accuracy of more than 85%, in the match of advanced search, the search coverage can be close to 100%. The speed of the participle is more than 200k/m (number of entries/min) and is fully acceptable at speed.
With the refinement of business requirement and the accumulation of more corpus, the Chinese Word segmentation method based on multi-Chinese segmentation and probability statistic is further realized to improve the accuracy of segmentation and improve the user experience of search engine.
- "Decryption search engine technology combat"---Lucene&java Essence Edition 2nd edition
- "Lucene in Action 2"
- "Managing Gigabyte"
The word segmentation system in vertical search engine