Lucene+ word breaker for precise extraction of user-defined keywords (lucene version 3.6)

Last Update:2018-07-26 Source: Internet

Author: User

Tags readline

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Lucene version of this blog is 3.6.0, if your lucene version is 5.X you can go to this blog http://blog.csdn.net/echoyy/article/details/78468225

In the process of word segmentation, it is sometimes necessary to extract only the custom keywords in the dictionary, while the traditional word breaker (Ikanalyzer) does not seem to support this function

In the CSDN forum, some solutions were given, and the Termquery method of Lucene was used to retrieve the index.

But I can't retrieve anything when I call termquery.

The final discovery is the establishment of the index problem

The first way to use a stream is to read a dictionary file like a sauce-like doc.add (new Field ("Contents", New FileReader (files));

Then use Termquery on GG, so only the results of the theory that this way in the indexing should be the text is disassembled, that is, such as the dictionary is "hello", this way in the index when the "hello" is broken into "you" and "good", resulting in termquery in the "Hello "The value was not retrieved during the retrieval process.

The solution is to use Bufferreader to read the text in the dictionary, save it in a list<string> content, and then loop through the values in the list to call Doc.add (the new Field ("contents", Content,field.store.yes, Field.Index.NOT_ANALYZED)); method

Note that the domain index option (field.index.*) here uses the index.not_analyzed parameter, because the parameter actually takes the domain value as a singleton unit and enables it to be searched. Applies to indexed domain values such as URLs, file paths, dates, names, phone numbers, etc. that cannot be decomposed.

And then, fix it.

The last is the index and the retrieved code

1. Building an Index

Import Java.io.BufferedReader;
Import Java.io.File;
Import Java.io.FileInputStream;
Import java.io.FileNotFoundException;
Import Java.io.FileReader;
Import java.io.IOException;
Import Java.io.InputStream;
Import Java.io.InputStreamReader;
Import java.io.UnsupportedEncodingException;
Import java.util.ArrayList;

Import java.util.List;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.document.Field;
Import Org.apache.lucene.index.IndexWriter;
Import Org.apache.lucene.index.IndexWriterConfig;
Import Org.apache.lucene.store.Directory;
Import Org.apache.lucene.store.FSDirectory;

Import org.apache.lucene.util.Version; /** * Class Introduction: ① Simple to write the index to the document, ② to read the document according to the index; ③ use the path to find the indexed document, find the return result */public class Indexer {//write the instance of the index to the specified directory private indexwrite

	R writer; /** * Construction Method: In order to instantiate IndexWriter */public Indexer (String indexdir) throws Exception {//Get the path to the directory where the index is located directory dir = Fsdirectory.open (new File (Indexdir));

		Instantiation Analyzer Analyzer Analyzer = new StandardAnalyzer (version.lucene_35);

		Instantiate indexwriterconfig indexwriterconfig con = new Indexwriterconfig (version.lucene_35, analyzer);

	Instantiate indexwriter writer = new IndexWriter (dir, con);
	}/** * Close Write Index * * @throws Exception */public void Close () throws Exception {writer.close (); /** * Index All files in the specified directory * * @throws Exception */public int index (String datadir) throws Exception {///definition file array,

		Loops the file to be indexed file[] File = new file (datadir). Listfiles ();
		for (File files:file) {//From this onwards, each file is indexed indexfile (files);

	}//Returns how many files have been indexed, and several files have returned several return Writer.numdocs (); }/** * Index specified file * * @throws Exception */private void Indexfile (file files) throws Exception {SYSTEM.OUT.PR

		Intln ("Index file:" + Files.getcanonicalpath ());

		Index to one line of the search, in the data for the document, so you want to get all the lines, that is, documents document file = GetDocument (files); Start writing, the document is written into the index file; writer.adddocument (doCument); /** * Get the document, in the Document Set three fields * * To obtain a document, equivalent to a row in the database * * @throws Exception */Private document GetDocument (File F
		Iles) throws Exception {//Instantiate document Document DOC = new document ();

		Add (): Adds a set index to the document so that it can be indexed.
		list<string> contents = this.getcontent (files);
		for (String content:contents) {doc.add (new Field ("Contents", Content, Field.Store.YES, Field.Index.NOT_ANALYZED));
	}//returns doc to document return;
		} private list<string> GetContent (File files) {list<string> strlist = new arraylist<string> ();
			try {InputStream stream = new FileInputStream (files);
			String code = "UTF-8";
			BufferedReader br = new BufferedReader (new InputStreamReader (stream, code));
			String str = br.readline ();
				while (str! = null) {strlist.add (str);
			str = Br.readline ();
		} br.close ();
		} catch (FileNotFoundException e) {e.printstacktrace (); } catch (Unsupportedencodingexception e) {E.printstacktrace();
		} catch (IOException e) {e.printstacktrace ();
	} return strlist;
		}//Start test write index public static void main (string[] args) {//Index specified document path String Indexdir = "./index/keyword";
		The path to the indexed data file DataDir = new file ("D:\\workspace\\iktest\\src\\ext.dic");
		Write index Indexer Indexer = null;
			try {//through the path specified by the index, get indexer indexer = new indexer (indexdir);
		The data path to be indexed (int: As this is the data to be indexed, how many index files are returned) Indexer.indexfile (DataDir);
		} catch (Exception e) {e.printstacktrace ();
			} finally {try {indexer.close ();
			} catch (Exception e) {//TODO auto-generated catch block E.printstacktrace ();
 }
		}
	}
}

2. Search Key Words

Import Java.io.File;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.index.IndexReader;
Import Org.apache.lucene.index.Term;
Import Org.apache.lucene.queryParser.QueryParser;
Import Org.apache.lucene.search.IndexSearcher;
Import Org.apache.lucene.search.Query;
Import Org.apache.lucene.search.ScoreDoc;
Import Org.apache.lucene.search.TermQuery;
Import Org.apache.lucene.search.TopDocs;
Import Org.apache.lucene.store.Directory;
Import Org.apache.lucene.store.FSDirectory;

Import org.apache.lucene.util.Version;

	/** * * * Read document by indexed field * */public class SearchKeyword {private static final String Indexdir = "./index/keyword"; public boolean search (String keyword) throws exception{//Gets the path to the index file read Directory Dir=fsdirectory.open (the new file (index
		
		Dir));
		
		Through dir get all the files under the path Indexreader reader=indexreader.open (dir); Build Index finder Indexsearcher is=new inDexsearcher (reader); 
		Instantiation Analyzer Analyzer Analyzer=new StandardAnalyzer (version.lucene_35); Establishing a Query Parser/** * The first parameter is the field to query; * The second parameter is Analyzer Analyzer */Queryparser parser=new queryparser (version.lucene_35,
		"Contents", analyzer);
		According to the incoming P find//Query query=parser.parse (keyword);
		Termquery query = new Termquery (New term ("contents", keyword));
		Start Query/** * The first parameter is to find the resulting query by passing arguments; * The second parameter is the number of rows to be queried * */topdocs Hits=is.search (query, 10);
		Boolean flag = false;
		if (hits.totalhits>0) {flag = true;
		} reader.close ();
	return flag;
 }
}

3. Main function

Import Java.io.StringReader;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

Import Org.wltea.analyzer.lucene.IKAnalyzer;
		public class Main {public static void main (string[] args) {String KeyWord = "This is an entire keyword Demacia";
		Ikanalyzer Analyzer = new Ikanalyzer (true);
		try {Printanalysisresult (analyzer, KeyWord);
		} catch (Exception e) {e.printstacktrace ();  }}/** * Print out the word breaker results for a given word breaker * * @param analyzer * Word breaker * @param keyWord * keywords * @throws Exception */private static void Printanalysisresult (Analyzer analyzer, String KeyWord) throws Exception {System
		. OUT.PRINTLN ("[" +keyword+ "] participle effect as follows");
		Tokenstream Tokenstream = Analyzer.tokenstream ("Content", new StringReader (KeyWord));
		Tokenstream.addattribute (Chartermattribute.class);
		SearchKeyword SK = new SearchKeyword (); while (Tokenstream.incrementtoken ()) {CharteRmattribute Chartermattribute = Tokenstream. getattribute (Chartermattribute.class);
			if (Sk.search (chartermattribute.tostring ()) ==true) {System.out.println (chartermattribute.tostring ());
 }

		}
	}
}

4. User definable Keywords

5. Operation Result

Can see "This is a whole keyword" is precisely indexed,

and "Demacia" keyword because in the dictionary is stored "* demacia *", do not meet the exact search so it is not cut out

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More