Full-text search, data mining, and recommendation engine series 4-Remove stop words and add Synonyms

Source: Internet
Author: User

Lucene parses text as a pre-processing form for full-text indexing and full-text retrieval. Therefore, in general Lucene documents, this part is not an important part, and it often takes a long time, however, to build a text-based content recommendation engine, it is a critical step. Therefore, it is necessary to carefully study the process of Lucene parsing the text.
Lucene parses text to a sub-class of analyzer. Lucene has several built-in sub-classes, but standardanalyzer is the most common sub-class in English, it can process text parsing functions in general English. But for Chinese characters, Lucene provides two extension packages, cjkanalyzer and smartchineseanalyzer. Among them, smartanalyzer is very suitable for processing Chinese word segmentation, but unfortunately, this class integrates the dictionary with the hidden Markov process algorithm in the algorithm. The advantage is that it reduces the volume and is easy to install. However, if you want to add words to the dictionary, you need to re-learn it, not convenient. Therefore, we chose mmseg4j, an open-source Chinese Word Segmentation module. The biggest advantage of this open-source software is that users can expand the Chinese word segmentation, which is very convenient. The disadvantage is that it is slow to load a large volume.
First, let's look at the use of Chinese word segmentation through a simple program:
Analyzer analyzer = NULL;
// Analyzer = new standardanalyzer (version. paie_33 );
// Analyzer = new simpleanalyzer (version. paie_33 );
Analyzer = new mmseganalyzer ();
Tokenstream tokenstrm = analyzer. tokenstream ("content", new stringreader (examples ));
Offsetattribute offsetattr = tokenstrm. getattribute (offsetattribute. Class );
CharTermAttribute charTermAttr = tokenStrm. getAttribute (CharTermAttribute. class );
PositionIncrementAttribute posIncrAttr =
TokenStrm. addattriment( PositionIncrementAttribute. class );
TypeAttribute typeAttr = tokenStrm. addAttribute (TypeAttribute. class );
String term = null;
Int I = 0;
Int len = 0;
Char [] charBuf = null;
Int termPos = 0;
Int termIncr = 0;
Try {
While (tokenStrm. incrementToken ()){
CharBuf = charTermAttr. buffer ();
TermIncr = posIncrAttr. getPositionIncrement ();
If (termIncr> 0 ){
TermPos + = termIncr;
}
For (I = (charBuf. length-1); I> = 0; I --){
If (charBuf [I]> 0 ){
Len = I + 1;
Break;
}
}
// Term = new String (charBuf, offsetAttr. startOffset (), offsetAttr. endOffset ());
Term = new String (charBuf, 0, offsetAttr. endOffset ()-offsetAttr. startOffset ());
System. out. print ("[" + term + ":" + termPos + "/" + termIncr + ":" +
TypeAttr. type () + ";" + offsetAttr. startOffset () + "-" + offsetAttr. endOffset () + "]");
}
} Catch (IOException e ){
// TODO Auto-generated catch block
E. printStackTrace ();
}
Note the following:
Termattri has been marked as expired in the new Lucene version. Therefore, chartermattri is used in the program to extract the information of each Chinese word segmentation.
The word splitting effect of MMSegAnalyzer is basically the same as that of StandardAnalyzer built in Lucene in English.
After the preliminary Chinese word segmentation can be performed, we also need to stop word removal, such as the, location, obtained, and ah, and add Synonyms: the first is a synonym in the full sense, such as mobile phones and mobile phones. The second is the abbreviation and full name, such as China and the People's Republic of China. The third is Chinese and English, such as computers and PCs, the fourth type is synonyms of various specialized words, such as drug names and academic names. Finally, there may be some online words such as Shenma and what.
In the Lucene architecture, there are two implementation methods. The first is to write the TokenFilter class for conversion and addition, and the other is to directly integrate these functions into the corresponding Analyzer. If open-source software such as Lucene focuses on system scalability, it is better to develop an independent TokenFilter. However, for our own projects, it is better to choose integration in Analyzer, this can improve the program execution efficiency, because TokenFilter needs to repeat all words one by one, which is less efficient, integration in Analyzer ensures that various word segmentation operations are completed in the process of breaking down words, and the efficiency will certainly be improved.
In text parsing, Lucene first calls Tokenizer in Analyzer to split the text into the most basic units. English is a word, and Chinese is a word or phrase, we can remove stop words and add synonyms into Tokenizer to process each new word split. For details, refer to the MMSeg4j Chinese Word Segmentation module we selected, in the incrementToken method of the MMSegTokenizer class, remove the Stop Word and add a synonym:
Public boolean incrementToken () throws IOException {
If (0 = synonymCnt ){
ClearAttributes ();
Word word = mmSeg. next ();
CurrWord = word;
If (word! = Null ){
// Remove the cutoff words such as, location, obtained, and so on.
String wordStr = word. getString ();
If (stopWords. contains (wordStr )){
Return incrementToken ();
}
If (synonymKeyDict. get (wordStr )! = Null) {// If a synonym exists, add the word itself first, and then add the synonym in sequence.
SynonymCnt = synonymDict. get (synonymKeyDict. get (wordStr). size (); // obtain the synonym and use it as the end condition control.
}
// TermAtt. setTermBuffer (word. getSen (), word. getWordOffset (), word. getLength ());
OffsetAtt. setOffset (word. getStartOffset (), word. getEndOffset ());
CharTermAttr. copyBuffer (word. getSen (), word. getWordOffset (), word. getLength ());
PosIncrAttr. setPositionIncrement (1 );
TypeAtt. setType (word. getType ());
Return true;
} Else {
End ();
Return false;
}
} Else {
Char [] charArray = null;
String orgWord = currWord. getString ();
Int I = 0;
Vector <String> synonyms = (Vector <String>) synonymDict. get (synonymKeyDict. get (orgWord ));
If (orgWord. equals (synonyms. elementAt (synonymCnt-1) {// if the word appears in the original text, no processing is performed.
SynonymCnt --;
Return incrementToken ();
}

// Add the consent word
CharArray = synonyms. elementAt (synonymCnt-1). toCharArray (); // termAtt. setTermBuffer (t1, 0, t1.length );
OffsetAtt. setOffset (currWord. getStartOffset (), currWord. getStartOffset () + charArray. length); // currWord. getEndOffset ());
TypeAtt. setType (currWord. getType ());
CharTermAttr. copyBuffer (charArray, 0, charArray. length );
PosIncrAttr. setPositionIncrement (0 );
SynonymCnt --;

Return true;
}
}

Stop Word implementation method:

Private static String [] stopWordsArray = {"," ",", "Ah", "Ah ",
"A", "the", "in", "on "};

Initialize In the constructor:

If (null = stopWords ){
Int I = 0;
StopWords = new Vector <String> ();
For (I = 0; I <stopWordsArray. length; I ++ ){
StopWords. add (stopWordsArray [I]);
}
}

Synonym implementation:
Private static Collection <String> stopWords = null;
Private static Hashtable <String, String> synonymKeyDict = null;
Private static Hashtable <String, Collection <String> synonymDict = null;

Initialization is also performed in the initialization function: Note that this is just a simple initialization example.

// First find the key value of the synonym phrase of a word, and then use this key value
// The final part of the content will be initialized using the database driver.
If (null = synonymdict ){
Synonymkeydict = new hashtable <string, string> ();
Synonymdict = new hashtable <string, collection <string> ();
Synonymkeydict. Put ("Hunter", "0 ");
Synonymkeydict. Put ("Hunter", "0 ");
Synonymkeydict. Put ("Hunter", "0 ");
Synonymkeydict. Put ("Hunters", "0 ");
Collection <string> syn1 = new vector <string> ();
Syn1.add ("Hunter ");
Syn1.add ("Hunters ");
Syn1.add ("Hunter ");
Syn1.add ("Hunters ");
Synonymdict. Put ("0", syn1 );
// Add a dog or a dog
SynonymKeyDict. put ("dog", "1 ");
SynonymKeyDict. put ("dog", "1 ");
Collection <String> syn2 = new Vector <String> ();
Syn2.add ("dog ");
Syn2.add ("dog ");
SynonymDict. put ("1", syn2 );
}

After the above procedures, we will parse the following Chinese characters:

Resolution result:

[Bite: 1/1: word; 0-1] [dead: 2/1: word; 1-2] [Hunter: 3/1: word; 2-4] [Hunter: 3/0: word; 2-5] [Hunter: 3/0: word; 2-4] [Hunter: 3/0: word; 2-4] [dog: 4/1: word; 5-6] [dog: 4/0: word; 5-6]

From the above results, we can see that the synonym of the hunter and the dog has been successfully added to the word segmentation result. This tool can be used as the basis for the implementation of the recommendation engine in the following full text.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.