Previous article Lucene participle process explained some of the process of participle, we also have a preliminary understanding of the word segmentation process, know that a word breaker consists of multiple tokenizer and Tokenfilter, This article explains that we use these two features to implement their own a simple synonym word breaker, the wrong place please point out
(i) Analysis
How to implement synonyms? For example, Chongqing can be called Mountain City, when we search the mountain city should also search for the article to include the word Chongqing . Then we have to understand what Lucene does with our documentation, and the last one says that Lucene gives 3 classes to a piece of text in a document. These 3 classes record the offset, position increment, and so on for each word. Lucene uses the position increment to determine the position, so do we have our own ideas now? We just have to add our own synonyms to the appropriate location.
(ii) to achieve
So first we have to implement a analazer of our own, covering the Tokenstream method (because we want to customize a filter,stream flow to change)
Package Org.xiezhaodong.lucene;import Java.io.reader;import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.analysis.tokenstream;import Org.apache.lucene.analysis.whitespacetokenizer;import Org.apache.lucene.util.version;public class Mysameanalyzer extends Analyzer {@Overridepublic Tokenstream tokenstream ( String arg0, Reader arg1) {return new Mysamefilter (New Whitespacetokenizer (version.lucene_35, arg1));//Invoke its own implementation of the filter, Word breaker We can use the space Word breaker}}
Look at our filter, the main note, the key is in the comments
Package Org.xiezhaodong.lucene;import Java.io.ioexception;import Java.util.hashmap;import java.util.Map;import Java.util.stack;import Org.apache.lucene.analysis.tokenfilter;import Org.apache.lucene.analysis.TokenStream; Import Org.apache.lucene.analysis.tokenattributes.chartermattribute;import Org.apache.lucene.analysis.tokenattributes.positionincrementattribute;import Org.apache.lucene.util.attributesource;public class Mysamefilter extends Tokenfilter {private Chartermattribute cta= null;//to get the vocabulary to get this, the previous article has been described private stack<string> wordstack=null;//store synonyms stackprivate attributesource.state Current=null;private positionincrementattribute pia;//Position increment setting protected mysamefilter (Tokenstream input) {super (input) ; wordstack=new stack<string> ();p ia=this.addattribute (Positionincrementattribute.class);// Get these two key classes of Cta=this.addattribute (Chartermattribute.class) from Tokenstream;} @Overridepublic Boolean Incrementtoken () throws IOException {while (wordstack.size () >0) {String Word=wordstack.pop ( ); restOrestate (current);//revert to Previous state cta.setempty (); Cta.append (word);p ia.setpositionincrement (0);//increment to 0 and original word relative return true;} if (!this.input.incrementtoken ()) Return false;if (Ishavesamewords (cta.tostring ())) {//have synonyms current=capturestate (); /Capture the current state, the previous section says, you can capture the state of the current stream}return true; Judging whether the word has synonyms, here we assume these synonyms, I scribble public boolean ishavesamewords (String word) {map<string, string[]> map=new Hashmap<string, string[]> (); Map.put ("How", New string[]{"What", "which"}); Map.put ("Thank", new string[]{"like" , "Love"}); String[] Sws=map.get (word); if (sws!=null) {//There is a synonym for (String S:sws) {Wordstack.push (s);} return true;} return false;}}
Test Tool Class
public class Analyzerutils {public static void Displayalltokeninfo (String str, Analyzer a) {try {Tokenstream Tokenstream = A.tokenstream ("Content", new StringReader (str)); Positionincrementattribute Positionincrementattribute = Tokenstream.addattribute (PositionIncrementAttribute.class ); Offsetattribute oa = Tokenstream.addattribute (Offsetattribute.class); Chartermattribute CTA = Tokenstream.addattribute (Chartermattribute.class); Typeattribute ta = Tokenstream.addattribute (Typeattribute.class); while (Tokenstream.incrementtoken ()) {// System.out.print (Positionincrementattribute.getpositionincrement ()); System.out.print (cta+ "{" +oa.startoffset () + "-" +oa.endoffset () + "}");} System.out.println ();} catch (Exception e) {e.printstacktrace ();}}}
@Testpublic void test04 () {//analyzer analyzer=new standardanalyzer (version.lucene_35);//analyzer analyzer2=new Stopanalyzer (version.lucene_35);//analyzer analyzer3=new Simpleanalyzer (version.lucene_35);//Analyzer analyzer4= New Whitespaceanalyzer (version.lucene_35);//analyzer analyzer=new Mystopanalyzer (New string[]{"I", "You"}); Analyzer analyzer=new Mysameanalyzer (); String txt= "How is Thank You";//analyzerutils.displaytoken (TXT, analyzer); Analyzerutils.displayalltokeninfo (TXT, analyzer);}
Output
HOW{0-3}WHICH{0-3}WHAT{0-3}ARE{4-7}YOU{8-11}THANK{12-17}LOVE{12-17}LIKE{12-17}YOU{18-21}
Have you found that synonyms have been added, and their offsets and position increments are the same, and then we search for what when the document can be searched out. Specifically do not demonstrate, the following are accessories, download a try, of course, in addition to implementing synonyms, we just know how to customize their own filter can tokenizer we can do a lot of things.
Reprint Please specify http://blog.csdn.net/a837199685/article/
Attachment link http://pan.baidu.com/s/1o6n2O9k
How Lucene writes its own synonym word breaker