How Lucene writes its own synonym word breaker

Last Update:2015-02-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Previous article Lucene participle process explained some of the process of participle, we also have a preliminary understanding of the word segmentation process, know that a word breaker consists of multiple tokenizer and Tokenfilter, This article explains that we use these two features to implement their own a simple synonym word breaker, the wrong place please point out

(i) Analysis

How to implement synonyms? For example, Chongqing can be called Mountain City, when we search the mountain city should also search for the article to include the word Chongqing . Then we have to understand what Lucene does with our documentation, and the last one says that Lucene gives 3 classes to a piece of text in a document. These 3 classes record the offset, position increment, and so on for each word. Lucene uses the position increment to determine the position, so do we have our own ideas now? We just have to add our own synonyms to the appropriate location.

(ii) to achieve

So first we have to implement a analazer of our own, covering the Tokenstream method (because we want to customize a filter,stream flow to change)

Package Org.xiezhaodong.lucene;import Java.io.reader;import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.analysis.tokenstream;import Org.apache.lucene.analysis.whitespacetokenizer;import Org.apache.lucene.util.version;public class Mysameanalyzer extends Analyzer {@Overridepublic Tokenstream tokenstream ( String arg0, Reader arg1) {return new Mysamefilter (New Whitespacetokenizer (version.lucene_35, arg1));//Invoke its own implementation of the filter, Word breaker We can use the space Word breaker}}

Look at our filter, the main note, the key is in the comments

Package Org.xiezhaodong.lucene;import Java.io.ioexception;import Java.util.hashmap;import java.util.Map;import Java.util.stack;import Org.apache.lucene.analysis.tokenfilter;import Org.apache.lucene.analysis.TokenStream; Import Org.apache.lucene.analysis.tokenattributes.chartermattribute;import Org.apache.lucene.analysis.tokenattributes.positionincrementattribute;import Org.apache.lucene.util.attributesource;public class Mysamefilter extends Tokenfilter {private Chartermattribute cta= null;//to get the vocabulary to get this, the previous article has been described private stack<string> wordstack=null;//store synonyms stackprivate attributesource.state Current=null;private positionincrementattribute pia;//Position increment setting protected mysamefilter (Tokenstream input) {super (input) ; wordstack=new stack<string> ();p ia=this.addattribute (Positionincrementattribute.class);// Get these two key classes of Cta=this.addattribute (Chartermattribute.class) from Tokenstream;} @Overridepublic Boolean Incrementtoken () throws IOException {while (wordstack.size () >0) {String Word=wordstack.pop ( ); restOrestate (current);//revert to Previous state cta.setempty (); Cta.append (word);p ia.setpositionincrement (0);//increment to 0 and original word relative return true;} if (!this.input.incrementtoken ()) Return false;if (Ishavesamewords (cta.tostring ())) {//have synonyms current=capturestate (); /Capture the current state, the previous section says, you can capture the state of the current stream}return true; Judging whether the word has synonyms, here we assume these synonyms, I scribble public boolean ishavesamewords (String word) {map<string, string[]> map=new Hashmap<string, string[]> (); Map.put ("How", New string[]{"What", "which"}); Map.put ("Thank", new string[]{"like" , "Love"}); String[] Sws=map.get (word); if (sws!=null) {//There is a synonym for (String S:sws) {Wordstack.push (s);} return true;} return false;}}

Test Tool Class

public class Analyzerutils {public static void Displayalltokeninfo (String str, Analyzer a) {try {Tokenstream Tokenstream = A.tokenstream ("Content", new StringReader (str)); Positionincrementattribute Positionincrementattribute = Tokenstream.addattribute (PositionIncrementAttribute.class ); Offsetattribute oa = Tokenstream.addattribute (Offsetattribute.class); Chartermattribute CTA = Tokenstream.addattribute (Chartermattribute.class); Typeattribute ta = Tokenstream.addattribute (Typeattribute.class); while (Tokenstream.incrementtoken ()) {// System.out.print (Positionincrementattribute.getpositionincrement ()); System.out.print (cta+ "{" +oa.startoffset () + "-" +oa.endoffset () + "}");} System.out.println ();} catch (Exception e) {e.printstacktrace ();}}}

@Testpublic void test04 () {//analyzer analyzer=new standardanalyzer (version.lucene_35);//analyzer analyzer2=new Stopanalyzer (version.lucene_35);//analyzer analyzer3=new Simpleanalyzer (version.lucene_35);//Analyzer analyzer4= New Whitespaceanalyzer (version.lucene_35);//analyzer analyzer=new Mystopanalyzer (New string[]{"I", "You"}); Analyzer analyzer=new Mysameanalyzer (); String txt= "How is Thank You";//analyzerutils.displaytoken (TXT, analyzer); Analyzerutils.displayalltokeninfo (TXT, analyzer);}

Output

HOW{0-3}WHICH{0-3}WHAT{0-3}ARE{4-7}YOU{8-11}THANK{12-17}LOVE{12-17}LIKE{12-17}YOU{18-21}

Have you found that synonyms have been added, and their offsets and position increments are the same, and then we search for what when the document can be searched out. Specifically do not demonstrate, the following are accessories, download a try, of course, in addition to implementing synonyms, we just know how to customize their own filter can tokenizer we can do a lot of things.

Reprint Please specify http://blog.csdn.net/a837199685/article/

Attachment link http://pan.baidu.com/s/1o6n2O9k

How Lucene writes its own synonym word breaker

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How Lucene writes its own synonym word breaker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How Lucene writes its own synonym word breaker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support