Smartcn is a Chinese word segmentation tool provided by lucene. it originated from the ICTCLAS Chinese word segmentation system of the Chinese Emy of Sciences. For more information about ICTCLAS algorithms, see here. The behavior analysis in SmartChineseAnalyzer can start with the reusableTokenStream or tokenStream method. The former can be used repeatedly to improve the performance.
Smartcn is a Chinese word segmentation tool provided by lucene. it originated from the ICTCLAS Chinese word segmentation system of the Chinese Emy of Sciences. For more information about ICTCLAS algorithms, see here. The behavior analysis in SmartChineseAnalyzer can start with the reusableTokenStream or tokenStream method. The former can be used repeatedly to improve the performance (for example, some instances are put in ThreadLocal, and re-build is avoided in the next call ). The following are my analysis notes starting with reusableTokenStream.
The reusableTokenStream method is used to determine whether SavedStreams exists. If no SavedStreams exists, it is created. Otherwise, the SavedStreams will be returned directly after some statuses are reset. I only focus on the creation process. Creating a new streams operation is not actually performed.
Streams = new SavedStreams ();
Setprevioustmenstream (streams );
Streams. tokenStream = new SentenceTokenizer (reader );
Streams. filteredTokenStream = new WordTokenFilter (streams. tokenStream );
Streams. filteredTokenStream = new PorterStemFilter (streams. filteredTokenStream );
If (! StopWords. isEmpty ()){
Streams. filteredTokenStream = new StopFilter (StopFilter. getEnablePositionIncrementsVersionDefault (matchVersion ),
Streams. filteredTokenStream, stopWords, false );
}
This method returns the TokenStream instance. You can call the incrementToken () method of TokenStream to obtain word segmentation one by one. Tracking incrementToken to PorterStemFilter. incrementToken (), and this method calls WordTokenFilter. incrementToken ().
In WordTokenFilter. incrementToken (), the SentenceTokenizer. incrementToken () is called to cut sentences. The ending mark of a sentence is a punctuation mark (defined in Utility. PUNCTION) indicating the end of a sentence or a text ending. Utility. SPACES is ignored.
After the sentence is broken, segmentSentence () is called to split the sentence. The most important step in the segmentSentence () method is to call HHMMSegmenter. process () for word segmentation. The process () method is the core of the algorithm. Its class name indicates that it is based on the Hidden Markov Model (HMM ). Let's take a look at this method:
/**
* Return a list of {@ link SegToken} representing the best segmentation of a sentence
* @ Param sentence input sentence
* @ Return best segmentation as a {@ link List}
*/
Public List Process (String sentence ){
SegGraph segGraph = createSegGraph (sentence );
BiSegGraph biSegGraph = new BiSegGraph (segGraph );
List Export PATH = biSegGraph. get1_path ();
Return export path;
}
There are three steps here. The first step is to generate the SegGraph, the second step is to create the BiSegGraph, and the third step is to find the shortest path as the result. For the two figures here, refer to the related sections in the ICTCLAS algorithm introduction. Generally, the first graph is all possible words, and the second graph is the possible combination between words (path of directed graph ). The nodes in both charts have the weight attribute. The weight of the next graph is obtained by smoothing algorithms based on the frequency of the two word combinations. Finding the shortest path is actually looking for the best combination of words. The following code is commented out.
Step 1: org.apache.e.analysis.cn. smart. hhmm. HHMMSegmenter. createSegGraph (String ). The main attribute of SegGraph is a Map with the key as an integer and the value as the SegToken list:
Private Map > TokenListTable = new HashMap > ();
The key is actually the starting position of the split word in the source string. another maxStart =-1 attribute is used to record the maximum starting position.
/**
* Create the {@ link SegGraph} for a sentence.
*
* @ Param sentence
* Input sentence, without start and end markers
* @ Return {@ link SegGraph} corresponding to the input sentence.
*/
Private SegGraph createSegGraph (String sentence ){
Int I = 0, j;
Int length = sentence. length ();
Int foundIndex;
Int [] charTypeArray = getCharTypes (sentence );
StringBuilder wordBuf = new StringBuilder ();
SegToken token;
Int frequency = 0; // the number of times word appears.
Boolean hasFullWidth;
Int wordType;
Char [] charArray;
SegGraph segGraph = new SegGraph ();
/* Process from start to end */
While (I <length ){
HasFullWidth = false;
Switch (charTypeArray [I]) {
Case CharType. SPACE_LIKE:
I ++;
/* Skip spaces */
Break;
Case CharType. HANZI:
/* Process Chinese characters */
J = I + 1;
WordBuf. delete (0, wordBuf. length ());
// It doesn' t matter if a single Chinese character (Hanzi) can
// Form a phrase or not,
// It will store that single Chinese character (Hanzi) in
// SegGraph. Otherwise, it will
// Cause word division.
WordBuf. append (sentence. charAt (I ));
CharArray = new char [] {sentence. charAt (I )};
Frequency = wordDict. getFrequency (charArray );
Token = new SegToken (charArray, I, j, WordType. CHINESE_WORD,
Frequency );
/* Add a single Chinese character */
SegGraph. addToken (token );
FoundIndex = wordDict. getPrefixMatch (charArray );
/* Search for all words that can be formed by backward combination and add them to the foundIndex in the figure! =-1 indicates a possible word */
While (j <= length & foundIndex! =-1 ){
/* If a word has been formed (it is possible that it is only the prefix of a word )*/
If (wordDict. isEqual (charArray, foundIndex)
& CharArray. length> 1 ){
// It is the phrase we are looking for; In other words,
// We have found a phrase SegToken
// From I to j. It is not a monosyllabic word (single
// Word ).
Frequency = wordDict. getFrequency (charArray );
Token = new SegToken (charArray, I, j,
WordType. CHINESE_WORD, frequency );
SegGraph. addToken (token );
}
/* Spaces are omitted when words are grouped backward (the word smartcn with an empty lattice can be correctly split out )*/
While (j <length
& CharTypeArray [j] = CharType. SPACE_LIKE)
J ++;
/* If it is a Chinese character, continue to join and test whether it is a word */
If (j <length & charTypeArray [j] = CharType. HANZI ){
WordBuf. append (sentence. charAt (j ));
CharArray = new char [wordBuf. length ()];
WordBuf. getChars (0, charArray. length, charArray, 0 );
// IdArray has been found (foundWordIndex! =-1) as
// Prefix before.
// Therefore, idArray after it has been lengthened can
// Only appear after foundWordIndex.
// So start searching after foundWordIndex.
FoundIndex = wordDict. getPrefixMatch (charArray,
FoundIndex );
J ++;
} Else {
Break;/* end or not a Chinese character */
}
}
I ++;
Break;
/* The following is the processing of other character types, which will not be analyzed here */
Case CharType. FULLWIDTH_LETTER:
HasFullWidth = true;
Case CharType. LETTER:
J = I + 1;
While (j <length
& (CharTypeArray [j] = CharType. LETTER | charTypeArray [j] = CharType. FULLWIDTH_LETTER )){
If (charTypeArray [j] = CharType. FULLWIDTH_LETTER)
HasFullWidth = true;
J ++;
}
// Found a Token from I to j. Type is LETTER char string.
CharArray = Utility. STRING_CHAR_ARRAY;
Frequency = wordDict. getFrequency (charArray );
WordType = hasFullWidth? WordType. FULLWIDTH_STRING
: WordType. STRING;
Token = new SegToken (charArray, I, j, wordType, frequency );
SegGraph. addToken (token );
I = j;
Break;
Case CharType. FULLWIDTH_DIGIT:
HasFullWidth = true;
Case CharType. DIGIT:
J = I + 1;
While (j <length
& (CharTypeArray [j] = CharType. DIGIT | charTypeArray [j] = CharType. FULLWIDTH_DIGIT )){
If (charTypeArray [j] = CharType. FULLWIDTH_DIGIT)
HasFullWidth = true;
J ++;
}
// Found a Token from I to j. Type is NUMBER char string.
CharArray = Utility. NUMBER_CHAR_ARRAY;
Frequency = wordDict. getFrequency (charArray );
WordType = hasFullWidth? WordType. FULLWIDTH_NUMBER
: WordType. NUMBER;
Token = new SegToken (charArray, I, j, wordType, frequency );
SegGraph. addToken (token );
I = j;
Break;
Case CharType. DELIMITER:
J = I + 1;
// No need to search the weight for the punctuation. Picking
// Highest frequency will work.
Frequency = Utility. MAX_FREQUENCE;
CharArray = new char [] {sentence. charAt (I )};
Token = new SegToken (charArray, I, j, WordType. DELIMITER,
Frequency );
SegGraph. addToken (token );
I = j;
Break;
Default:
J = I + 1;
// Treat the unrecognized char symbol as unknown string.
// For example, any symbol not in GB2312 is treated as one
// These.
CharArray = Utility. STRING_CHAR_ARRAY;
Frequency = wordDict. getFrequency (charArray );
Token = new SegToken (charArray, I, j, WordType. STRING,
Frequency );
SegGraph. addToken (token );
I = j;
Break;
}
}
// Add two more Tokens: "beginning xx beginning"
/* Add the start flag at the beginning to facilitate subsequent processing (refer to the ICTCLAS algorithm )*/
CharArray = Utility. START_CHAR_ARRAY;
Frequency = wordDict. getFrequency (charArray );
Token = new SegToken (charArray,-1, 0, WordType. SENTENCE_BEGIN,
Frequency );
SegGraph. addToken (token );
// "End xx end"
/* Same as above, add an ending sign */
CharArray = Utility. END_CHAR_ARRAY;
Frequency = wordDict. getFrequency (charArray );
Token = new SegToken (charArray, length, length + 1,
WordType. SENTENCE_END, frequency );
SegGraph. addToken (token );
Return segGraph;
}
Step 2: org.apache.e.e.analysis.cn. smart. hhmm. BiSegGraph. generateBiSegGraph (SegGraph) to create the BiSegGraph.
In the construction method, BiSegGraph first generates an index for segGraph, that is, an index number for all words, which will be used later. Then, generateBiSegGraph () is called to generate the path graph.
/*
* Generate a BiSegGraph based upon a SegGraph
*/
Private void generateBiSegGraph (SegGraph segGraph ){
Double smooth = 0.1;
Int wordPairFreq = 0;
Int maxStart = segGraph. getMaxStart ();
Double oneWordFreq, weight, tinyDouble = 1.0/Utility. MAX_FREQUENCE;
Int next;
Char [] idBuffer;
// Get the list of tokens ordered and indexed
SegTokenList = segGraph. makeIndex ();/* is this repeated? The preceding constructor has already been used */
// Because the beginning position of startToken is-1, therefore
// StartToken can be obtained when key =-1
Int key =-1;
List NextTokens = null;
/* Process from start to end */
While (key <maxStart ){
/* Determine whether the entry point exists. (Refer to the source code above )*/
If (segGraph. isStartExist (key )){
List TokenList = segGraph. getStartList (key );
/**
* Process all tokens of a given key (that is, the starting position), traverse the tokens that can be adjacent to them, and
* Their joining relationships (token pair) are stored in the graph. The token pair is actually the path before the token.
*/
// Calculate all tokens for a given key.
For (SegToken t1: tokenList ){
OneWordFreq = t1.weight;
Next = t1.endOffset;
NextTokens = null;
// Find the next corresponding Token.
// For example: "Sunny seashore", the present Token is
// "Sunny", next one shoshould be "sea" or "seashore ".
// If we cannot find the next Token, then go to the end and
// Repeat the same cycle.
/* Find the start position of the adjacent token */
While (next <= maxStart ){
// Because the beginning position of endToken is
// SentenceLen, so equal to sentenceLen can find
// EndToken.
If (segGraph. isStartExist (next )){
NextTokens = segGraph. getStartList (next );
Break;
}
Next ++;
}
If (nextTokens = null ){
Break;
}
/* Traverse the adjacent tokens */
For (SegToken t2: nextTokens ){
IdBuffer = new char [t1.charArray. length
+ T2.charArray. length + 1];
System. arraycopy (t1.charArray, 0, idBuffer, 0,
T1.charArray. length );
IdBuffer [t1.charArray. length] = BigramDictionary. WORD_SEGMENT_CHAR;
System. arraycopy (t2.charArray, 0, idBuffer,
T1.charArray. length + 1, t2.charArray. length );
// Two linked Words frequency
WordPairFreq = bigramDict. getFrequency (idBuffer );
/* The smoothing algorithm for weight calculation has not been studied yet. now the mathematics is forgotten! Try again here */
// Smoothing
//-Log {a * P (Ci-1) + (1-a) P (Ci | Ci-1)} Note 0 weight =-Math. log (smooth
* (1.0 + oneWordFreq)
/(Utility. MAX_FREQUENCE + 0.0)
+ (1.0-smooth)
* (1.0-tinyDouble) * wordPairFreq
/(1.0 + oneWordFreq) + tinyDouble ));
SegTokenPair tokenPair = new SegTokenPair (idBuffer,
T1.index, t2.index, weight );
This. addSegTokenPair (tokenPair );
}
}
}
Key ++;
}
}
Step 3: org.apache.w.e.analysis.cn. smart. hhmm. BiSegGraph. getw.path () to find the optimal path. This part is actually a dynamic planning algorithm. you can refer to the data structure and algorithm books.
/**
* Find the shortest path with the Viterbi algorithm.
*
* @ Return {@ link List}
*/
Public List Getmediapath (){
Int current;
Int nodeCount = getToCount ();
List Path = new ArrayList ();
PathNode zeroPath = new PathNode ();
ZeroPath. weight = 0;
ZeroPath. preNode = 0;
Path. add (zeroPath);/* insert the header as the starting "origin */
/**
* Calculate the optimal path for each step from start to end and save the corresponding information. Later calculations use the preceding data,
* After this loop is completed, you can find the optimal path from the last node.
*/
For (current = 1; current <= nodeCount; current ++ ){
Double weight;
List Edges = getToList (current );
Double minWeight = Double. MAX_VALUE;
SegTokenPair minEdge = null;
For (SegTokenPair edge: edges ){
Weight = edge. weight;
PathNode preNode = path. get (edge. from );
If (preNode. weight + weight <minWeight ){
MinWeight = preNode. weight + weight;
MinEdge = edge;
}
}
PathNode newNode = new PathNode ();
NewNode. weight = minWeight;
NewNode. preNode = minEdge. from;
Path. add (newNode );
}
// Calculate PathNodes
Int preNode, lastNode;
LastNode = path. size ()-1;
Current = lastNode;
List Rpath = new ArrayList ();/* Think this is a bit redundant */
List ResultPath = new ArrayList ();
Rpath. add (current );
While (current! = 0 ){
PathNode currentPathNode = path. get (current );
PreNode = currentPathNode. preNode;
Rpath. add (Integer. valueOf (preNode ));
Current = preNode;
}
/* Why not generate resultPath directly in the previous step? Is it redundant? */
For (int j = rpath. size ()-1; j> = 0; j --){
Integer idInteger = (Integer) rpath. get (j );
Int id = idInteger. intValue ();
SegToken t = segTokenList. get (id );
ResultPath. add (t );
}
Return resultPath;
}
The main algorithm process of smartcn is like this. It is said that HMM algorithms are more academic and complex. Today, the main process is quite understandable. Of course, I would like to thank ICTCLAS for its principle. Otherwise, it would be very difficult to read the code. The next step is to learn its Smooth algorithm and analyze its dictionary (org.apache.e.analysis.cn. smart. hhmm. WordDictionary) and its binary Dictionary (org.apache.w.e.analysis.cn. smart. hhmm. BigramDictionary ).