Original: http://3dobe.com/archives/44/Introduction
It is impossible to do search technology without touching the word breaker. The reason why the search engine can not be replaced by the database mainly has two points, one is in the large amount of data, search engine query speed, the 2nd is that the search engine can do more than the database to understand the user. 1th good understanding, whenever the database of a single table big, is a headache, and in the case of larger data magnitude, you let the database to do fuzzy query, it is also a relatively difficult thing (of course, the prefix match will be much better), design should be avoided. On the 2nd, the search engine how to understand the user, is certainly not simple by matching, which can be added to a lot of processing, and even add a variety of natural language processing advanced technology, and the more general and basic method is to rely on the word breaker to complete, and this is a relatively simple and efficient processing method.
Word segmentation technology is a cornerstone of search technology. Many people have used it, and if you just want to build a search engine simply and quickly, you really don't need to know too much. But once the effect of the problem, the word breaker can do a lot of articles. For example, in the actual use of our search in the field of e-commerce, the realization of the pre-judgment of the class is very dependent on participle, at least to be able to dynamically add rules to the word breaker. As a simple example, if your optimization method is to power the different words and increase the weight of some key words, you will need to rely on and understand the word breaker. This paper will analyze the implementation of IK allocator based on the original code. The key point, the main 3 points, 1, the construction of the dictionary tree, will now be loaded into a dictionary of memory structure, 2, the word matching search, it is quite generated on a sentence morphemes of the segmentation, 3, ambiguity judgment, that is, the different ways to determine the segmentation, which kind should be more reasonable. The code's original URL is: [https://code.google.com/p/ik-analyzer/] (https://code.google.com/p/ik-analyzer/) uploaded GitHub, accessible: [https:// github.com/quentinxxz/search/tree/master/ikanalyzer2012ff_hf1_source/] (https://github.com/quentinxxz/Search/ tree/master/ikanalyzer2012ff_hf1_source/)
Dictionary
Do background data related operations, all the source of work is the source of data. IK word breaker for our word for three categories are: 1, subjects table Main2012.dic 2, quantifier table Quantifier.dic 3, the stopword.dic of the discontinued word.
Dictionary is in the dictionary management class, loading the dictionary into the memory structure, respectively. The specific dictionary code, located in Org.wltea.analyzer.dic.DictSegment. This class implements a core data structure for a word breaker, the tire Tree.
Tire Tree (dictionary tree) is a fairly simple structure for building dictionaries, which are used to compare pairs of characters one by one, and to find words quickly, so they are sometimes referred to as prefix trees. The specific examples are as follows.
Figure 1
From the left, ABC,ABCD,ABD,B,BCD ... these words are the words that exist in the tree. Of course, Chinese characters can also be handled the same, but the number of Chinese characters is much more than 26, should not be represented by the position of the character (in English, each node can be wrapped up a length of 26 array), so that the tire tree will become quite diffuse, and occupy memory, so there is a tire tree variant, A three-pronged dictionary tree (ternary trees) that uses less memory. Ternary tree is not used in an IK word breaker, so please refer to the article http://www.cnblogs.com/rush/archive/2012/12/30/2839996.html
An example of a simple implementation is used in IK. First look at the members of the Dictsegment class:
ClassDictsegmentImplementscomparable<dictsegment>{Common dictionary tables, storing Chinese charactersPrivateStaticFinal Map<character, character> charmap =New Hashmap<character, Character> (16,0.95f);Maximum array sizeprivate static final int Array_length_limit = 3; //MAP storage structure private map<character, dictsegment> childrenmap; //array mode storage structure private dictsegment[] childrenarray; //the characters stored on the current node private Character nodechar; //the current node stores segment private int storesize = 0; //current dictsegment state, default 0, 1 indicates a word from the root node to the current node private int nodestate = 0;
There are two ways to store, according to Array_length_limit as a threshold to determine, if the number of child nodes, less than the threshold value, the way the array is childrenarray to store, when the number of child nodes is greater than the threshold value, Using the Map method Childrenmap to store, Childrenmap is implemented by HashMap. The advantage of this is that it saves memory space. Because the way of hashmap, it is necessary to pre-allocate memory, there may be a waste phenomenon, but if all the array to save the group (followed by two-point search), you will not get O (1) algorithm complexity. Therefore, the two ways are used, when the number of child nodes is very small, with the array storage, when the sub-nodes more time, then all moved to HashMap. During the build process, each word is added to the dictionary tree in a single step, which is a recursive process:
/** * Loading the Fill dictionary fragment *@param chararray *@param begin *@param length *@param enabled */PrivateSynchronizedvoid fillsegment (char[] chararray, int begin, span class= "keyword" >int length, int enabled) {... //searches the storage of the current node, queries KeyChar for KeyChar, and if not, creates dictsegment ds = Lookforsegment (KeyChar, enabled); span class= "keyword" >if (ds! = null) {//handle Keychar corresponding segment if (length > 1) {//the word element has not been fully added to the dictionary tree ds.fillsegment (Chararray, Begin + 1, Length-1, enabled);} else if (length = = 1) {// is already the last char of the word element, setting the current node state to Enabled, //enabled=1 indicates a complete word, enabled=0 means masking the current word from the dictionary ds.nodestate = Enabled } } }
Where lookforsegment, it is found in the child node of the subtree, if it is less than the array_length_limit threshold, it is an array of storage, binary lookup, if the array_length_limit threshold is greater than the HashMap storage, Direct Lookup.
Word segmentation
Ik word breaker, basically can be divided into two modes, one for smart mode, one for non-smart mode. For example the original text:
Zhang San said it's true.
The following participle results for smart mode are:
Zhang San | Speak of | true | Reasonable
Instead of the word segmentation in smart mode, the result is:
Zhang San | Three | Speak of | true | of | true | Reality | Reasonable
What the non-smart mode does is to output all the words that can be separated; in smart mode, the IK word breaker will output a word that is considered the most reasonable according to the intrinsic method, which involves ambiguous judgment.
First look at some of the most basic element structure classes:
public class lexeme implements comparable< lexeme>{... //the beginning displacement of the word element private int offset; //the relative starting position of the word element private int begin; //the length of the word element private int length; //word meta text private String lexemetext; //word meta type private int lexemetype;
Here Lexeme (the word element), it can be understood as a word or a word. The begin, which refers to its position in the input text. Note that it is implemented comparable, the starting position of the first priority, longer priority, which can be used to determine the word in a word in the result of a participle of the position of the chain, can be used to get the above example of the word in the results of the order of the words.
/* * The comparison algorithm of the lexical elements in the sorted set *@see Java.lang.comparable#compareto (java.lang.Object) */public int CompareTo (Lexeme other) {Start position FirstIfThis.begin < Other.getbegin ()) {return-1;} Else if (this.begin = = Other.getbegin ()) { //Word element length precedence if (this.length > Other.getlength ()) { return-1;} Else if (this.length = = Other.getlength ()) { return 0;} else {//this.length < Other.getlength () return 1;}} else{//this.begin > Other.getbegin () return 1;}}
There is also an important structure is the word element chain, declared as follows
/** * Lexeme链(路径) */ class LexemePath extends QuickSortSet implements Comparable<LexemePath>
A lexmepath, you can think is a result of the above-mentioned participle, according to the order of pre-and-post form a chain structure. Can see that it implements the quicksortset, so it itself in the addition of lexical elements, in the internal completion of the sort, forming an orderly chain, and the collation is the above Lexeme CompareTo method implemented. You will also notice that Lexemepath also implements the comparable interface, which is used for subsequent ambiguity analysis, as described in the next section.
Another important structure is analyzecontext, where the main store is the text of the input information, the segmentation of the Lemexepah, Word segmentation results and other related contextual information.
In IK, the default is to use three sub-word breakers, respectively, lettersegmenter (letter word breaker), cn_quantifiersegment (quantifier word breaker), Cjksegmenter (CJK Word breaker). Participle is going to pass through these three word breakers, we focus here according to Cjksegment analysis. At its core is an analyzer method.
Publicvoid analyze (Analyzecontext context) {...Priority handling of hit in tmphitsif (!This.tmpHits.isEmpty ()) {Processing Word segment Queue hit[] Tmparray =This.tmpHits.toArray (New hit[This.tmpHits.size ()]);For (hits Hit:tmparray) {hit = Dictionary.getsingleton (). Matchwithhit (Context.getsegmentbuff (), Context.getcursor (), Hit);if (Hit.ismatch ()) {Output the current word lexeme Newlexeme =New Lexeme (Context.getbufferoffset (), Hit.getbegin (), Context.getcursor ()-hit.getbegin () +1, Lexeme.type_cnword); Context.addlexeme (Newlexeme);if (!hit.isprefix ()) {Not a word prefix, hit does not need to continue matching, removeThis.tmpHits.remove (HIT); } }Elseif (Hit.isunmatch ()) {Hit is not a word, removeThis.tmpHits.remove (HIT); } } }//*********************************//and then matches the character of the current pointer position with the word hit Singlecharhit = Dictionary.getsingleton (). Matchinmaindict ( Context.getsegmentbuff (), Context.getcursor (), 1); if (Singlecharhit.ismatch ()) {//first word //output current Word Lexeme newlexeme = new lexeme (Context.getbufferoffset (), Context.getcursor (), 1, Lexeme.type_cnword); Context.addlexeme (Newlexeme); //is also the word prefix if (Singlecharhit.isprefix ()) {// The prefix match is placed in the hit list this.tmphits.add (singlecharhit);}} else if (Singlecharhit.isprefix ()) {//first word prefix The span class= "comment" >//prefix match is placed in the hit list this.tmphits.add (singlecharhit);} ...}
From the lower half of the code, the Matchinmain here is the way to match the words in the topic table. The subjects table here has been loaded into a dictionary tree, so the whole process is a layer of recursion that goes down the root layer, but only the word is processed, not recursively. The matching results were three unmatch (mismatched), match (match), PREFIX (prefix match), match means the complete match has reached the leaf node, and PREFIX refers to the current alignment on the path through the existence of the match, but not to reach the leaf node. In addition, a word can be either match or prefix, ABC in example 1. The prefix matches are stored in the Temphit. And the complete match is saved in the context.
Continue to look at the half code, prefix matching words should not end directly, because it is possible to continue to match the longer words, so the last half of the code is to continue to match these words. Matchwithhit, is to continue to match the results of the current hit. If you get the result of the match, you can add a new word element to the context.
By this way, we can get all the words, at least to meet the needs of non-smart mode.
Ambiguity judgment
Ikarbitrator (ambiguity analysis arbiter) is the main class to deal with ambiguity.
If you feel that I can't tell you, you can refer to the blog: http://fay19880111-yeah-net.iteye.com/blog/1523740
In the previous section, we mentioned that Lexemepath is implementing the Compareble interface.
Publicint CompareTo (Lexemepath o) {Compare valid text LengthsIfThis. payloadlength > O. payloadlength) {Return-1; }ElseIfThis. payloadlength < O. payloadlength) {Return1; }else{Compare the number of words, the less the betterIfThis. Size () < O. Size ()) {Return-1; }Elseif (This. Size () > O. Size ()) {Return1; }else{Larger path spans betterIfThis. Getpathlength () > O. Getpathlength ()) {Return-1; }ElseIfThis. Getpathlength () < O. Getpathlength ()) {Return1; }else {According to the statistical conclusion, the probability of reverse segmentation is higher than that of forward slicing, so the position is more priorityIfThis. pathend > O. pathend) {Return-1; }Elseif (Pathend < o. pathend) {Return1; }else{The longer the word, the better the average.IfThis. Getxweight () > O. Getxweight ()) {Return-1;} else if (this.getxweight ()) {return 1;} else {//word element positional weight comparison if (this .getpweight () > O.getpweight ()) {return-< Span class= "number" >1; }else if (this.getpweight ()) {return 1;}}} }} return 0; Obviously the author is here to kill some sort of rules, in order to compare valid text length, number of words, path span ...
Ikarbitrator has a judge method that compares the different paths.
Private Lexemepath judge (Quicksortset.cell Lexemecell, int fulltextlength) {Candidate Path Collection Treeset<lexemepath> pathoptions =new treeset<lexemepath> (); //candidate result path lexemepath option = new lexemepath (); //a traversal of crosspath while returning a conflicting lexeme stack stack<quicksortset.cell> lexemestack = this.forwardpath (lexemecell, option); //the current word meta-chain is not optimal, join the candidate path set Pathoptions.add (Option.copy ()); //existence ambiguous words, processing quicksortset.cell c = null; while (!lexemestack.isempty ()) {c = Lexemestack.pop (); //rollback word meta-chain this.backpath (c.getlexeme (), option); //begins with the ambiguous word position, recursively, generates an optional scheme this.forwardpath (c, option); Pathoptions.add ( Option.copy ()); } //returns the optimal scheme in the collection return pathoptions.first ();}
The core processing idea is to start with the first word element, traverse various paths, and then join to a treeset, realize the sort, take the first one.
Other instructions
1, StopWord (stop word), will be in the final output stage (Analyzecontext. Getnextlexeme) is removed and is not removed during the analysis, otherwise there is a risk.
2, can be seen from the CompareTo method of Lexemepath, Ik sorting method is particularly rough, if the comparison found path1 the number of words, less than the number of path2, the direct decision path1 better. In fact, such a rule, not a complete reference to the actual situation of the various points out, we may want to add each word by statistical frequency and other information, do more comprehensive grading, so that the original IK comparison method is not feasible.
The idea of how to modify can refer to another blog, which introduces a way to deal with the shortest path: http://www.hankcs.com/nlp/segment/ N-shortest-path-to-the-java-implementation-and-application-segmentation.html
3, the unmatched word, whether in smart mode, the final output, its processing time in the final output stage, the specific code is located in the Analyzecontext. The Outputtoresult method.
Starting at iteye:http://quentinxxz.iteye.com/blog/2180215
Site Link: http://3dobe.com/archives/44/
IK word breaker principle and source code analysis