How Lucene writes its own synonym word breaker

Source: Internet
Author: User

Previous article Lucene participle process explained some of the process of participle, we also have a preliminary understanding of the word segmentation process, know that a word breaker consists of multiple tokenizer and Tokenfilter, This article explains that we use these two features to implement their own a simple synonym word breaker, the wrong place please point out

(i) Analysis

How to implement synonyms? For example, Chongqing can be called Mountain City, when we search the mountain city should also search for the article to include the word Chongqing . Then we have to understand what Lucene does with our documentation, and the last one says that Lucene gives 3 classes to a piece of text in a document. These 3 classes record the offset, position increment, and so on for each word. Lucene uses the position increment to determine the position, so do we have our own ideas now? We just have to add our own synonyms to the appropriate location.

(ii) to achieve

So first we have to implement a analazer of our own, covering the Tokenstream method (because we want to customize a filter,stream flow to change)

Package Org.xiezhaodong.lucene;import Java.io.reader;import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.analysis.tokenstream;import Org.apache.lucene.analysis.whitespacetokenizer;import Org.apache.lucene.util.version;public class Mysameanalyzer extends Analyzer {@Overridepublic Tokenstream tokenstream ( String arg0, Reader arg1) {return new Mysamefilter (New Whitespacetokenizer (version.lucene_35, arg1));//Invoke its own implementation of the filter, Word breaker We can use the space Word breaker}}


Look at our filter, the main note, the key is in the comments

Package Org.xiezhaodong.lucene;import Java.io.ioexception;import Java.util.hashmap;import java.util.Map;import Java.util.stack;import Org.apache.lucene.analysis.tokenfilter;import Org.apache.lucene.analysis.TokenStream; Import Org.apache.lucene.analysis.tokenattributes.chartermattribute;import Org.apache.lucene.analysis.tokenattributes.positionincrementattribute;import Org.apache.lucene.util.attributesource;public class Mysamefilter extends Tokenfilter {private Chartermattribute cta= null;//to get the vocabulary to get this, the previous article has been described private stack<string> wordstack=null;//store synonyms stackprivate attributesource.state Current=null;private positionincrementattribute pia;//Position increment setting protected mysamefilter (Tokenstream input) {super (input) ; wordstack=new stack<string> ();p ia=this.addattribute (Positionincrementattribute.class);// Get these two key classes of Cta=this.addattribute (Chartermattribute.class) from Tokenstream;} @Overridepublic Boolean Incrementtoken () throws IOException {while (wordstack.size () >0) {String Word=wordstack.pop ( ); restOrestate (current);//revert to Previous state cta.setempty (); Cta.append (word);p ia.setpositionincrement (0);//increment to 0 and original word relative return true;} if (!this.input.incrementtoken ()) Return false;if (Ishavesamewords (cta.tostring ())) {//have synonyms current=capturestate (); /Capture the current state, the previous section says, you can capture the state of the current stream}return true; Judging whether the word has synonyms, here we assume these synonyms, I scribble public boolean ishavesamewords (String word) {map<string, string[]> map=new Hashmap<string, string[]> (); Map.put ("How", New string[]{"What", "which"}); Map.put ("Thank", new string[]{"like" , "Love"}); String[] Sws=map.get (word); if (sws!=null) {//There is a synonym for (String S:sws) {Wordstack.push (s);} return true;} return false;}}

Test Tool Class
public class Analyzerutils {public static void Displayalltokeninfo (String str, Analyzer a) {try {Tokenstream Tokenstream = A.tokenstream ("Content", new StringReader (str)); Positionincrementattribute Positionincrementattribute = Tokenstream.addattribute (PositionIncrementAttribute.class ); Offsetattribute oa = Tokenstream.addattribute (Offsetattribute.class); Chartermattribute CTA = Tokenstream.addattribute (Chartermattribute.class); Typeattribute ta = Tokenstream.addattribute (Typeattribute.class); while (Tokenstream.incrementtoken ()) {// System.out.print (Positionincrementattribute.getpositionincrement ()); System.out.print (cta+ "{" +oa.startoffset () + "-" +oa.endoffset () + "}");} System.out.println ();} catch (Exception e) {e.printstacktrace ();}}}

@Testpublic void test04 () {//analyzer analyzer=new standardanalyzer (version.lucene_35);//analyzer analyzer2=new Stopanalyzer (version.lucene_35);//analyzer analyzer3=new Simpleanalyzer (version.lucene_35);//Analyzer analyzer4= New Whitespaceanalyzer (version.lucene_35);//analyzer analyzer=new Mystopanalyzer (New string[]{"I", "You"}); Analyzer analyzer=new Mysameanalyzer (); String txt= "How is Thank You";//analyzerutils.displaytoken (TXT, analyzer); Analyzerutils.displayalltokeninfo (TXT, analyzer);}


Output

HOW{0-3}WHICH{0-3}WHAT{0-3}ARE{4-7}YOU{8-11}THANK{12-17}LOVE{12-17}LIKE{12-17}YOU{18-21}
Have you found that synonyms have been added, and their offsets and position increments are the same, and then we search for what when the document can be searched out. Specifically do not demonstrate, the following are accessories, download a try, of course, in addition to implementing synonyms, we just know how to customize their own filter can tokenizer we can do a lot of things.

Reprint Please specify http://blog.csdn.net/a837199685/article/



Attachment link http://pan.baidu.com/s/1o6n2O9k

How Lucene writes its own synonym word breaker

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.