lucene3.5 Implementing a custom synonym word breaker

Source: Internet
Author: User
Tags comments

I have been learning Lucene3.5 recently, and I feel that the knowledge inside is really great. Today we will share with you our own to implement a synonym for the word breaker.

A word breaker consists of a number of tokenizer and tokenfilter, this article explains that we use these two features to implement their own a simple synonym word breaker, please point out the wrong place.

First, design ideas

What do you mean by synonym search? For example, when we searched the word "China", we could also search for the word "continent", which searched for articles containing the word "china", which searched for articles containing the word "continent". Here we have to understand what Lucene is doing with our documentation, first we need to understand these 3 classes:

Positionincrementattribute (attribute of position increment, distance between stored token units)

Offsetattribute (the position offset of each token unit)

Chartermattribute (stores information for each lexical unit, i.e. participle unit information)

As shown in figure:


These things have a class decision called Attributesource, which holds the information in this class. Inside, there is a static inner class called State. We can capture the current state in a later process using this method.

  /** * Captures The state of all
   Attributes. The return value can be passed to
   * {@link #restoreState} to restore the state of this or another attributesource.
  
   */Public State
  Capturestate () {
    final state state = This.getcurrentstate ();
    return (state = = null)? Null: (state) State.clone ();
  }
  
Lucene determines the position by the position increment, so we just add our synonyms to the corresponding position.

Second, the realization

1, first we define a synonym for the interface (in order to improve the extensibility of our program)

Package com.dhb.util;

Public interface Samewordcontext {public
	string[] Getsamewords (String name);


2. We inherit this interface (put synonyms in our map)

Package com.dhb.util;

Import Java.util.HashMap;
Import Java.util.Map;

public class Simplesamewordcontext implements Samewordcontext {
	
	map<string, string[]> maps = new Hashmap<st Ring, string[]> ();
	
	Public Simplesamewordcontext () {
		maps.put ("China", new string[] {"Celestial", "mainland"});
		Maps.put ("I", new string[] {"I", "we"});
	@Override public
	string[] Getsamewords (String name) {
		return maps.get (name);
	}
	
}


3, we want to implement a own analyzer, covering the Tokenstream method (because we want to customize a tokenfilter,stream flow to change)

Package com.dhb.util;

Import Java.io.Reader;

Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;

Import com.chenlb.mmseg4j.Dictionary;
Import Com.chenlb.mmseg4j.MaxWordSeg;
Import Com.chenlb.mmseg4j.analysis.MMSegTokenizer;

public class Mysameanalyzer extends Analyzer {
	
	private samewordcontext samewordcontext;
	
	Public Mysameanalyzer (Samewordcontext samewordcontext) {
		this.samewordcontext = samewordcontext;
	}
	@Override public
	tokenstream tokenstream (String fieldName, Reader Reader) {
		Dictionary dic = Dictionary.getinstance ("f:\\ Deng Haibo jar\\mmseg4j\\mmseg4j-1.8.5\\data");
		return new Mysametokenfilter (new Mmsegtokenizer (New Maxwordseg (DIC), reader), samewordcontext);
	}
	
}

4, look at our filter, mainly look at the comments, the key is in the comments inside

Package com.dhb.util;
Import java.io.IOException;
Import Java.util.HashMap;
Import Java.util.Map;

Import Java.util.Stack;
Import Org.apache.lucene.analysis.TokenFilter;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

Import Org.apache.lucene.util.AttributeSource;
	public class Mysametokenfilter extends Tokenfilter {private chartermattribute CTA = null;
	Private Positionincrementattribute pia = null;
	Private Attributesource.state current;
	Private stack<string> sames = null;
	
	
	Private Samewordcontext Samewordcontext;
		Protected Mysametokenfilter (Tokenstream input, Samewordcontext samewordcontext) {super (input);
		CTA = This.addattribute (Chartermattribute.class);
		Pia = This.addattribute (Positionincrementattribute.class);
		Sames = new stack<string> ();
	This.samewordcontext = Samewordcontext; } @Override Public BoOlean Incrementtoken () throws IOException {//system.out.println (CTA);
			/*if (Cta.tostring (). Equals ("China")) {cta.setempty ();
		Cta.append ("mainland");
			}*/while (sames.size () > 0) {//takes the element out of the stack, and gets the synonym String str = sames.pop ();
			Restore state restorestate (current);
			System.out.println ("--------" +cta);
			Cta.setempty ();
			Cta.append (str);
			Set position 0 pia.setpositionincrement (0); return true;  If you do not return true, the previous overwrite} if (!input.incrementtoken ()) return false;
			It cannot be placed at the beginning, because it is not overwritten with the previous one (Addsames (cta.tostring ())) {//getsamewords changed to Addsames//If a synonym is caught, the current state is saved first
		Current = Capturestate ();
	} return true; }/*private Boolean getsamewords (String name) {map<string, string[]> maps = new hashmap<string, STRING[]&G
		t; ();
		Maps.put ("China", new string[] {"Celestial", "mainland"});
		Maps.put ("I", new string[] {"I", "we"});
		string[] SwS = maps.get (name);
			if (SWS! = null) {for (String S:sws) {Sames.push (s);
	} return true;	} return false;
		}*/Private Boolean addsames (String name) {string[] SwS = samewordcontext.getsamewords (name);
			if (SWS! = null) {for (String S:sws) {Sames.push (s);
		} return true;
	} return false;
 }

}

5. Finally, one of the methods in our Test tool class

@Test public
    void test06 () {
		Analyzer a = new Mysameanalyzer (new Simplesamewordcontext ());
		String txt = "I come from Chongqing University of Posts and Telecommunications, No. 2nd Chongwen Road, Nanan District, Chongqing, China";
		Directory dir =new ramdirectory ();
		try {
			IndexWriter writer = new IndexWriter (dir, New Indexwriterconfig (Version.lucene_35, a));
			Document doc = new document ();
			Doc.add (New Field ("content", txt, Field.Store.YES,  Field.Index.ANALYZED));
			Writer.adddocument (DOC);
			Writer.close ();
			
			Indexsearcher searcher = new Indexsearcher (Indexreader.open (dir));
			Search the mainland or China can be found
			topdocs TDS = Searcher.search (New Termquery ("content", "mainland"));
			Document d =searcher.doc (tds.scoredocs[0].doc);
			System.out.println (D.get ("content"));
		} catch (Corruptindexexception e) {
			e.printstacktrace ();
		} catch (Lockobtainfailedexception e) {
			E.printstacktrace ();
		} catch (IOException e) {
			e.printstacktrace ();
		}
		
	}

This is a location information:

1: I [0-1]-->word
0: I [0-1]-->word
0: I [0-1]-->word
1: From [1-3]-->word]
1: China [3-5]--> Word
0: Continent [3-5]-->word
0: Celestial [3-5]-->word
1: Chongqing [5-7]-->word
1: Southbank [7-9]-->word
1: District [ 9-10]-->word
1: Chongwen [10-12]-->word
1: Lu [12-13]-->word
1:2[13-14]-->digit
1: No. [14-15] -->word
1: Chongqing [15-17]-->word
1: Post and Telecommunications [17-19]-->word
1: tvu [18-20]-->word
1: University [19-21]--] >word

Have found that synonyms have been added, and their offset and position increment are the same, and then we search "" can be the document to search out.

Here is my idea map:



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.