Full-text search, data mining, and recommendation engine series 4-Remove stop words and add Synonyms

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lucene parses text as a pre-processing form for full-text indexing and full-text retrieval. Therefore, in general Lucene documents, this part is not an important part, and it often takes a long time, however, to build a text-based content recommendation engine, it is a critical step. Therefore, it is necessary to carefully study the process of Lucene parsing the text.
Lucene parses text to a sub-class of analyzer. Lucene has several built-in sub-classes, but standardanalyzer is the most common sub-class in English, it can process text parsing functions in general English. But for Chinese characters, Lucene provides two extension packages, cjkanalyzer and smartchineseanalyzer. Among them, smartanalyzer is very suitable for processing Chinese word segmentation, but unfortunately, this class integrates the dictionary with the hidden Markov process algorithm in the algorithm. The advantage is that it reduces the volume and is easy to install. However, if you want to add words to the dictionary, you need to re-learn it, not convenient. Therefore, we chose mmseg4j, an open-source Chinese Word Segmentation module. The biggest advantage of this open-source software is that users can expand the Chinese word segmentation, which is very convenient. The disadvantage is that it is slow to load a large volume.
First, let's look at the use of Chinese word segmentation through a simple program:
Analyzer analyzer = NULL;
// Analyzer = new standardanalyzer (version. paie_33 );
// Analyzer = new simpleanalyzer (version. paie_33 );
Analyzer = new mmseganalyzer ();
Tokenstream tokenstrm = analyzer. tokenstream ("content", new stringreader (examples ));
Offsetattribute offsetattr = tokenstrm. getattribute (offsetattribute. Class );
CharTermAttribute charTermAttr = tokenStrm. getAttribute (CharTermAttribute. class );
PositionIncrementAttribute posIncrAttr =
TokenStrm. addattriment( PositionIncrementAttribute. class );
TypeAttribute typeAttr = tokenStrm. addAttribute (TypeAttribute. class );
String term = null;
Int I = 0;
Int len = 0;
Char [] charBuf = null;
Int termPos = 0;
Int termIncr = 0;
Try {
While (tokenStrm. incrementToken ()){
CharBuf = charTermAttr. buffer ();
TermIncr = posIncrAttr. getPositionIncrement ();
If (termIncr> 0 ){
TermPos + = termIncr;
}
For (I = (charBuf. length-1); I> = 0; I --){
If (charBuf [I]> 0 ){
Len = I + 1;
Break;
}
}
// Term = new String (charBuf, offsetAttr. startOffset (), offsetAttr. endOffset ());
Term = new String (charBuf, 0, offsetAttr. endOffset ()-offsetAttr. startOffset ());
System. out. print ("[" + term + ":" + termPos + "/" + termIncr + ":" +
TypeAttr. type () + ";" + offsetAttr. startOffset () + "-" + offsetAttr. endOffset () + "]");
}
} Catch (IOException e ){
// TODO Auto-generated catch block
E. printStackTrace ();
}
Note the following:
Termattri has been marked as expired in the new Lucene version. Therefore, chartermattri is used in the program to extract the information of each Chinese word segmentation.
The word splitting effect of MMSegAnalyzer is basically the same as that of StandardAnalyzer built in Lucene in English.
After the preliminary Chinese word segmentation can be performed, we also need to stop word removal, such as the, location, obtained, and ah, and add Synonyms: the first is a synonym in the full sense, such as mobile phones and mobile phones. The second is the abbreviation and full name, such as China and the People's Republic of China. The third is Chinese and English, such as computers and PCs, the fourth type is synonyms of various specialized words, such as drug names and academic names. Finally, there may be some online words such as Shenma and what.
In the Lucene architecture, there are two implementation methods. The first is to write the TokenFilter class for conversion and addition, and the other is to directly integrate these functions into the corresponding Analyzer. If open-source software such as Lucene focuses on system scalability, it is better to develop an independent TokenFilter. However, for our own projects, it is better to choose integration in Analyzer, this can improve the program execution efficiency, because TokenFilter needs to repeat all words one by one, which is less efficient, integration in Analyzer ensures that various word segmentation operations are completed in the process of breaking down words, and the efficiency will certainly be improved.
In text parsing, Lucene first calls Tokenizer in Analyzer to split the text into the most basic units. English is a word, and Chinese is a word or phrase, we can remove stop words and add synonyms into Tokenizer to process each new word split. For details, refer to the MMSeg4j Chinese Word Segmentation module we selected, in the incrementToken method of the MMSegTokenizer class, remove the Stop Word and add a synonym:
Public boolean incrementToken () throws IOException {
If (0 = synonymCnt ){
ClearAttributes ();
Word word = mmSeg. next ();
CurrWord = word;
If (word! = Null ){
// Remove the cutoff words such as, location, obtained, and so on.
String wordStr = word. getString ();
If (stopWords. contains (wordStr )){
Return incrementToken ();
}
If (synonymKeyDict. get (wordStr )! = Null) {// If a synonym exists, add the word itself first, and then add the synonym in sequence.
SynonymCnt = synonymDict. get (synonymKeyDict. get (wordStr). size (); // obtain the synonym and use it as the end condition control.
}
// TermAtt. setTermBuffer (word. getSen (), word. getWordOffset (), word. getLength ());
OffsetAtt. setOffset (word. getStartOffset (), word. getEndOffset ());
CharTermAttr. copyBuffer (word. getSen (), word. getWordOffset (), word. getLength ());
PosIncrAttr. setPositionIncrement (1 );
TypeAtt. setType (word. getType ());
Return true;
} Else {
End ();
Return false;
}
} Else {
Char [] charArray = null;
String orgWord = currWord. getString ();
Int I = 0;
Vector <String> synonyms = (Vector <String>) synonymDict. get (synonymKeyDict. get (orgWord ));
If (orgWord. equals (synonyms. elementAt (synonymCnt-1) {// if the word appears in the original text, no processing is performed.
SynonymCnt --;
Return incrementToken ();
}

// Add the consent word
CharArray = synonyms. elementAt (synonymCnt-1). toCharArray (); // termAtt. setTermBuffer (t1, 0, t1.length );
OffsetAtt. setOffset (currWord. getStartOffset (), currWord. getStartOffset () + charArray. length); // currWord. getEndOffset ());
TypeAtt. setType (currWord. getType ());
CharTermAttr. copyBuffer (charArray, 0, charArray. length );
PosIncrAttr. setPositionIncrement (0 );
SynonymCnt --;

Return true;
}
}

Stop Word implementation method:

Private static String [] stopWordsArray = {"," ",", "Ah", "Ah ",
"A", "the", "in", "on "};

Initialize In the constructor:

If (null = stopWords ){
Int I = 0;
StopWords = new Vector <String> ();
For (I = 0; I <stopWordsArray. length; I ++ ){
StopWords. add (stopWordsArray [I]);
}
}

Synonym implementation:
Private static Collection <String> stopWords = null;
Private static Hashtable <String, String> synonymKeyDict = null;
Private static Hashtable <String, Collection <String> synonymDict = null;

Initialization is also performed in the initialization function: Note that this is just a simple initialization example.

// First find the key value of the synonym phrase of a word, and then use this key value
// The final part of the content will be initialized using the database driver.
If (null = synonymdict ){
Synonymkeydict = new hashtable <string, string> ();
Synonymdict = new hashtable <string, collection <string> ();
Synonymkeydict. Put ("Hunter", "0 ");
Synonymkeydict. Put ("Hunter", "0 ");
Synonymkeydict. Put ("Hunter", "0 ");
Synonymkeydict. Put ("Hunters", "0 ");
Collection <string> syn1 = new vector <string> ();
Syn1.add ("Hunter ");
Syn1.add ("Hunters ");
Syn1.add ("Hunter ");
Syn1.add ("Hunters ");
Synonymdict. Put ("0", syn1 );
// Add a dog or a dog
SynonymKeyDict. put ("dog", "1 ");
SynonymKeyDict. put ("dog", "1 ");
Collection <String> syn2 = new Vector <String> ();
Syn2.add ("dog ");
Syn2.add ("dog ");
SynonymDict. put ("1", syn2 );
}

After the above procedures, we will parse the following Chinese characters:

Resolution result:

[Bite: 1/1: word; 0-1] [dead: 2/1: word; 1-2] [Hunter: 3/1: word; 2-4] [Hunter: 3/0: word; 2-5] [Hunter: 3/0: word; 2-4] [Hunter: 3/0: word; 2-4] [dog: 4/1: word; 5-6] [dog: 4/0: word; 5-6]

From the above results, we can see that the synonym of the hunter and the dog has been successfully added to the word segmentation result. This tool can be used as the basis for the implementation of the recommendation engine in the following full text.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Full-text search, data mining, and recommendation engine series 4-Remove stop words and add Synonyms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Full-text search, data mining, and recommendation engine series 4-Remove stop words and add Synonyms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support