I have been learning Lucene3.5 recently, and I feel that the knowledge inside is really great. Today we will share with you our own to implement a synonym for the word breaker.
A word breaker consists of a number of tokenizer and tokenfilter, this article explains that we use these two features to implement their own a simple synonym word breaker, please point out the wrong place.
First, design ideas
What do you mean by synonym search? For example, when we searched the word "China", we could also search for the word "continent", which searched for articles containing the word "china", which searched for articles containing the word "continent". Here we have to understand what Lucene is doing with our documentation, first we need to understand these 3 classes:
Positionincrementattribute (attribute of position increment, distance between stored token units)
Offsetattribute (the position offset of each token unit)
Chartermattribute (stores information for each lexical unit, i.e. participle unit information)
As shown in figure:
These things have a class decision called Attributesource, which holds the information in this class. Inside, there is a static inner class called State. We can capture the current state in a later process using this method.
/** * Captures The state of all
Attributes. The return value can be passed to
* {@link #restoreState} to restore the state of this or another attributesource.
*/Public State
Capturestate () {
final state state = This.getcurrentstate ();
return (state = = null)? Null: (state) State.clone ();
}
Lucene determines the position by the position increment, so we just add our synonyms to the corresponding position.
Second, the realization
1, first we define a synonym for the interface (in order to improve the extensibility of our program)
Package com.dhb.util;
Public interface Samewordcontext {public
string[] Getsamewords (String name);
2. We inherit this interface (put synonyms in our map)
Package com.dhb.util;
Import Java.util.HashMap;
Import Java.util.Map;
public class Simplesamewordcontext implements Samewordcontext {
map<string, string[]> maps = new Hashmap<st Ring, string[]> ();
Public Simplesamewordcontext () {
maps.put ("China", new string[] {"Celestial", "mainland"});
Maps.put ("I", new string[] {"I", "we"});
@Override public
string[] Getsamewords (String name) {
return maps.get (name);
}
}
3, we want to implement a own analyzer, covering the Tokenstream method (because we want to customize a tokenfilter,stream flow to change)
Package com.dhb.util;
Import Java.io.Reader;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import com.chenlb.mmseg4j.Dictionary;
Import Com.chenlb.mmseg4j.MaxWordSeg;
Import Com.chenlb.mmseg4j.analysis.MMSegTokenizer;
public class Mysameanalyzer extends Analyzer {
private samewordcontext samewordcontext;
Public Mysameanalyzer (Samewordcontext samewordcontext) {
this.samewordcontext = samewordcontext;
}
@Override public
tokenstream tokenstream (String fieldName, Reader Reader) {
Dictionary dic = Dictionary.getinstance ("f:\\ Deng Haibo jar\\mmseg4j\\mmseg4j-1.8.5\\data");
return new Mysametokenfilter (new Mmsegtokenizer (New Maxwordseg (DIC), reader), samewordcontext);
}
}
4, look at our filter, mainly look at the comments, the key is in the comments inside
Package com.dhb.util;
Import java.io.IOException;
Import Java.util.HashMap;
Import Java.util.Map;
Import Java.util.Stack;
Import Org.apache.lucene.analysis.TokenFilter;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
Import Org.apache.lucene.util.AttributeSource;
public class Mysametokenfilter extends Tokenfilter {private chartermattribute CTA = null;
Private Positionincrementattribute pia = null;
Private Attributesource.state current;
Private stack<string> sames = null;
Private Samewordcontext Samewordcontext;
Protected Mysametokenfilter (Tokenstream input, Samewordcontext samewordcontext) {super (input);
CTA = This.addattribute (Chartermattribute.class);
Pia = This.addattribute (Positionincrementattribute.class);
Sames = new stack<string> ();
This.samewordcontext = Samewordcontext; } @Override Public BoOlean Incrementtoken () throws IOException {//system.out.println (CTA);
/*if (Cta.tostring (). Equals ("China")) {cta.setempty ();
Cta.append ("mainland");
}*/while (sames.size () > 0) {//takes the element out of the stack, and gets the synonym String str = sames.pop ();
Restore state restorestate (current);
System.out.println ("--------" +cta);
Cta.setempty ();
Cta.append (str);
Set position 0 pia.setpositionincrement (0); return true; If you do not return true, the previous overwrite} if (!input.incrementtoken ()) return false;
It cannot be placed at the beginning, because it is not overwritten with the previous one (Addsames (cta.tostring ())) {//getsamewords changed to Addsames//If a synonym is caught, the current state is saved first
Current = Capturestate ();
} return true; }/*private Boolean getsamewords (String name) {map<string, string[]> maps = new hashmap<string, STRING[]&G
t; ();
Maps.put ("China", new string[] {"Celestial", "mainland"});
Maps.put ("I", new string[] {"I", "we"});
string[] SwS = maps.get (name);
if (SWS! = null) {for (String S:sws) {Sames.push (s);
} return true; } return false;
}*/Private Boolean addsames (String name) {string[] SwS = samewordcontext.getsamewords (name);
if (SWS! = null) {for (String S:sws) {Sames.push (s);
} return true;
} return false;
}
}
5. Finally, one of the methods in our Test tool class
@Test public
void test06 () {
Analyzer a = new Mysameanalyzer (new Simplesamewordcontext ());
String txt = "I come from Chongqing University of Posts and Telecommunications, No. 2nd Chongwen Road, Nanan District, Chongqing, China";
Directory dir =new ramdirectory ();
try {
IndexWriter writer = new IndexWriter (dir, New Indexwriterconfig (Version.lucene_35, a));
Document doc = new document ();
Doc.add (New Field ("content", txt, Field.Store.YES, Field.Index.ANALYZED));
Writer.adddocument (DOC);
Writer.close ();
Indexsearcher searcher = new Indexsearcher (Indexreader.open (dir));
Search the mainland or China can be found
topdocs TDS = Searcher.search (New Termquery ("content", "mainland"));
Document d =searcher.doc (tds.scoredocs[0].doc);
System.out.println (D.get ("content"));
} catch (Corruptindexexception e) {
e.printstacktrace ();
} catch (Lockobtainfailedexception e) {
E.printstacktrace ();
} catch (IOException e) {
e.printstacktrace ();
}
}
This is a location information:
1: I [0-1]-->word
0: I [0-1]-->word
0: I [0-1]-->word
1: From [1-3]-->word]
1: China [3-5]--> Word
0: Continent [3-5]-->word
0: Celestial [3-5]-->word
1: Chongqing [5-7]-->word
1: Southbank [7-9]-->word
1: District [ 9-10]-->word
1: Chongwen [10-12]-->word
1: Lu [12-13]-->word
1:2[13-14]-->digit
1: No. [14-15] -->word
1: Chongqing [15-17]-->word
1: Post and Telecommunications [17-19]-->word
1: tvu [18-20]-->word
1: University [19-21]--] >word
Have found that synonyms have been added, and their offset and position increment are the same, and then we search "" can be the document to search out.
Here is my idea map: