lucene3.5 Implementing a custom synonym word breaker

Last Update:2018-07-25 Source: Internet

Author: User

Tags comments

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I have been learning Lucene3.5 recently, and I feel that the knowledge inside is really great. Today we will share with you our own to implement a synonym for the word breaker.

A word breaker consists of a number of tokenizer and tokenfilter, this article explains that we use these two features to implement their own a simple synonym word breaker, please point out the wrong place.

First, design ideas

What do you mean by synonym search? For example, when we searched the word "China", we could also search for the word "continent", which searched for articles containing the word "china", which searched for articles containing the word "continent". Here we have to understand what Lucene is doing with our documentation, first we need to understand these 3 classes:

Positionincrementattribute (attribute of position increment, distance between stored token units)

Offsetattribute (the position offset of each token unit)

Chartermattribute (stores information for each lexical unit, i.e. participle unit information)

As shown in figure:

These things have a class decision called Attributesource, which holds the information in this class. Inside, there is a static inner class called State. We can capture the current state in a later process using this method.

  /** * Captures The state of all
   Attributes. The return value can be passed to
   * {@link #restoreState} to restore the state of this or another attributesource.
  
   */Public State
  Capturestate () {
    final state state = This.getcurrentstate ();
    return (state = = null)? Null: (state) State.clone ();
  }

Lucene determines the position by the position increment, so we just add our synonyms to the corresponding position.

Second, the realization

1, first we define a synonym for the interface (in order to improve the extensibility of our program)

Package com.dhb.util;

Public interface Samewordcontext {public
	string[] Getsamewords (String name);

2. We inherit this interface (put synonyms in our map)

Package com.dhb.util;

Import Java.util.HashMap;
Import Java.util.Map;

public class Simplesamewordcontext implements Samewordcontext {
	
	map<string, string[]> maps = new Hashmap<st Ring, string[]> ();
	
	Public Simplesamewordcontext () {
		maps.put ("China", new string[] {"Celestial", "mainland"});
		Maps.put ("I", new string[] {"I", "we"});
	@Override public
	string[] Getsamewords (String name) {
		return maps.get (name);
	}
	
}

3, we want to implement a own analyzer, covering the Tokenstream method (because we want to customize a tokenfilter,stream flow to change)

Package com.dhb.util;

Import Java.io.Reader;

Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;

Import com.chenlb.mmseg4j.Dictionary;
Import Com.chenlb.mmseg4j.MaxWordSeg;
Import Com.chenlb.mmseg4j.analysis.MMSegTokenizer;

public class Mysameanalyzer extends Analyzer {
	
	private samewordcontext samewordcontext;
	
	Public Mysameanalyzer (Samewordcontext samewordcontext) {
		this.samewordcontext = samewordcontext;
	}
	@Override public
	tokenstream tokenstream (String fieldName, Reader Reader) {
		Dictionary dic = Dictionary.getinstance ("f:\\ Deng Haibo jar\\mmseg4j\\mmseg4j-1.8.5\\data");
		return new Mysametokenfilter (new Mmsegtokenizer (New Maxwordseg (DIC), reader), samewordcontext);
	}
	
}

4, look at our filter, mainly look at the comments, the key is in the comments inside

Package com.dhb.util;
Import java.io.IOException;
Import Java.util.HashMap;
Import Java.util.Map;

Import Java.util.Stack;
Import Org.apache.lucene.analysis.TokenFilter;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

Import Org.apache.lucene.util.AttributeSource;
	public class Mysametokenfilter extends Tokenfilter {private chartermattribute CTA = null;
	Private Positionincrementattribute pia = null;
	Private Attributesource.state current;
	Private stack<string> sames = null;
	
	
	Private Samewordcontext Samewordcontext;
		Protected Mysametokenfilter (Tokenstream input, Samewordcontext samewordcontext) {super (input);
		CTA = This.addattribute (Chartermattribute.class);
		Pia = This.addattribute (Positionincrementattribute.class);
		Sames = new stack<string> ();
	This.samewordcontext = Samewordcontext; } @Override Public BoOlean Incrementtoken () throws IOException {//system.out.println (CTA);
			/*if (Cta.tostring (). Equals ("China")) {cta.setempty ();
		Cta.append ("mainland");
			}*/while (sames.size () > 0) {//takes the element out of the stack, and gets the synonym String str = sames.pop ();
			Restore state restorestate (current);
			System.out.println ("--------" +cta);
			Cta.setempty ();
			Cta.append (str);
			Set position 0 pia.setpositionincrement (0); return true;  If you do not return true, the previous overwrite} if (!input.incrementtoken ()) return false;
			It cannot be placed at the beginning, because it is not overwritten with the previous one (Addsames (cta.tostring ())) {//getsamewords changed to Addsames//If a synonym is caught, the current state is saved first
		Current = Capturestate ();
	} return true; }/*private Boolean getsamewords (String name) {map<string, string[]> maps = new hashmap<string, STRING[]&G
		t; ();
		Maps.put ("China", new string[] {"Celestial", "mainland"});
		Maps.put ("I", new string[] {"I", "we"});
		string[] SwS = maps.get (name);
			if (SWS! = null) {for (String S:sws) {Sames.push (s);
	} return true;	} return false;
		}*/Private Boolean addsames (String name) {string[] SwS = samewordcontext.getsamewords (name);
			if (SWS! = null) {for (String S:sws) {Sames.push (s);
		} return true;
	} return false;
 }

}

5. Finally, one of the methods in our Test tool class

@Test public
    void test06 () {
		Analyzer a = new Mysameanalyzer (new Simplesamewordcontext ());
		String txt = "I come from Chongqing University of Posts and Telecommunications, No. 2nd Chongwen Road, Nanan District, Chongqing, China";
		Directory dir =new ramdirectory ();
		try {
			IndexWriter writer = new IndexWriter (dir, New Indexwriterconfig (Version.lucene_35, a));
			Document doc = new document ();
			Doc.add (New Field ("content", txt, Field.Store.YES,  Field.Index.ANALYZED));
			Writer.adddocument (DOC);
			Writer.close ();
			
			Indexsearcher searcher = new Indexsearcher (Indexreader.open (dir));
			Search the mainland or China can be found
			topdocs TDS = Searcher.search (New Termquery ("content", "mainland"));
			Document d =searcher.doc (tds.scoredocs[0].doc);
			System.out.println (D.get ("content"));
		} catch (Corruptindexexception e) {
			e.printstacktrace ();
		} catch (Lockobtainfailedexception e) {
			E.printstacktrace ();
		} catch (IOException e) {
			e.printstacktrace ();
		}
		
	}

This is a location information:

1: I [0-1]-->word
0: I [0-1]-->word
0: I [0-1]-->word
1: From [1-3]-->word]
1: China [3-5]--> Word
0: Continent [3-5]-->word
0: Celestial [3-5]-->word
1: Chongqing [5-7]-->word
1: Southbank [7-9]-->word
1: District [ 9-10]-->word
1: Chongwen [10-12]-->word
1: Lu [12-13]-->word
1:2[13-14]-->digit
1: No. [14-15] -->word
1: Chongqing [15-17]-->word
1: Post and Telecommunications [17-19]-->word
1: tvu [18-20]-->word
1: University [19-21]--] >word

Have found that synonyms have been added, and their offset and position increment are the same, and then we search "" can be the document to search out.

Here is my idea map:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More