Lucene+ word breaker for precise extraction of user-defined keywords (lucene version 3.6)

Source: Internet
Author: User
Tags readline

The Lucene version of this blog is 3.6.0, if your lucene version is 5.X you can go to this blog

In the process of word segmentation, it is sometimes necessary to extract only the custom keywords in the dictionary, while the traditional word breaker (Ikanalyzer) does not seem to support this function

In the CSDN forum, some solutions were given, and the Termquery method of Lucene was used to retrieve the index.

But I can't retrieve anything when I call termquery.

The final discovery is the establishment of the index problem

The first way to use a stream is to read a dictionary file like a sauce-like doc.add (new Field ("Contents", New FileReader (files));

Then use Termquery on GG, so only the results of the theory that this way in the indexing should be the text is disassembled, that is, such as the dictionary is "hello", this way in the index when the "hello" is broken into "you" and "good", resulting in termquery in the "Hello "The value was not retrieved during the retrieval process.

The solution is to use Bufferreader to read the text in the dictionary, save it in a list<string> content, and then loop through the values in the list to call Doc.add (the new Field ("contents", Content,, Field.Index.NOT_ANALYZED)); method

Note that the domain index option (field.index.*) here uses the index.not_analyzed parameter, because the parameter actually takes the domain value as a singleton unit and enables it to be searched. Applies to indexed domain values such as URLs, file paths, dates, names, phone numbers, etc. that cannot be decomposed.

And then, fix it.

The last is the index and the retrieved code

1. Building an Index

Import java.util.ArrayList;

Import java.util.List;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.document.Field;
Import Org.apache.lucene.index.IndexWriter;
Import Org.apache.lucene.index.IndexWriterConfig;

Import org.apache.lucene.util.Version; /** * Class Introduction: ① Simple to write the index to the document, ② to read the document according to the index; ③ use the path to find the indexed document, find the return result */public class Indexer {//write the instance of the index to the specified directory private indexwrite

	R writer; /** * Construction Method: In order to instantiate IndexWriter */public Indexer (String indexdir) throws Exception {//Get the path to the directory where the index is located directory dir = (new File (Indexdir));

		Instantiation Analyzer Analyzer Analyzer = new StandardAnalyzer (version.lucene_35);

		Instantiate indexwriterconfig indexwriterconfig con = new Indexwriterconfig (version.lucene_35, analyzer);

	Instantiate indexwriter writer = new IndexWriter (dir, con);
	}/** * Close Write Index * * @throws Exception */public void Close () throws Exception {writer.close (); /** * Index All files in the specified directory * * @throws Exception */public int index (String datadir) throws Exception {///definition file array,

		Loops the file to be indexed file[] File = new file (datadir). Listfiles ();
		for (File files:file) {//From this onwards, each file is indexed indexfile (files);

	}//Returns how many files have been indexed, and several files have returned several return Writer.numdocs (); }/** * Index specified file * * @throws Exception */private void Indexfile (file files) throws Exception {SYSTEM.OUT.PR

		Intln ("Index file:" + Files.getcanonicalpath ());

		Index to one line of the search, in the data for the document, so you want to get all the lines, that is, documents document file = GetDocument (files); Start writing, the document is written into the index file; writer.adddocument (doCument); /** * Get the document, in the Document Set three fields * * To obtain a document, equivalent to a row in the database * * @throws Exception */Private document GetDocument (File F
		Iles) throws Exception {//Instantiate document Document DOC = new document ();

		Add (): Adds a set index to the document so that it can be indexed.
		list<string> contents = this.getcontent (files);
		for (String content:contents) {doc.add (new Field ("Contents", Content, Field.Store.YES, Field.Index.NOT_ANALYZED));
	}//returns doc to document return;
		} private list<string> GetContent (File files) {list<string> strlist = new arraylist<string> ();
			try {InputStream stream = new FileInputStream (files);
			String code = "UTF-8";
			BufferedReader br = new BufferedReader (new InputStreamReader (stream, code));
			String str = br.readline ();
				while (str! = null) {strlist.add (str);
			str = Br.readline ();
		} br.close ();
		} catch (FileNotFoundException e) {e.printstacktrace (); } catch (Unsupportedencodingexception e) {E.printstacktrace();
		} catch (IOException e) {e.printstacktrace ();
	} return strlist;
		}//Start test write index public static void main (string[] args) {//Index specified document path String Indexdir = "./index/keyword";
		The path to the indexed data file DataDir = new file ("D:\\workspace\\iktest\\src\\ext.dic");
		Write index Indexer Indexer = null;
			try {//through the path specified by the index, get indexer indexer = new indexer (indexdir);
		The data path to be indexed (int: As this is the data to be indexed, how many index files are returned) Indexer.indexfile (DataDir);
		} catch (Exception e) {e.printstacktrace ();
			} finally {try {indexer.close ();
			} catch (Exception e) {//TODO auto-generated catch block E.printstacktrace ();

2. Search Key Words

Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.index.IndexReader;
Import Org.apache.lucene.index.Term;
Import Org.apache.lucene.queryParser.QueryParser;

Import org.apache.lucene.util.Version;

	/** * * * Read document by indexed field * */public class SearchKeyword {private static final String Indexdir = "./index/keyword"; public boolean search (String keyword) throws exception{//Gets the path to the index file read Directory (the new file (index
		Through dir get all the files under the path Indexreader (dir); Build Index finder Indexsearcher is=new inDexsearcher (reader); 
		Instantiation Analyzer Analyzer Analyzer=new StandardAnalyzer (version.lucene_35); Establishing a Query Parser/** * The first parameter is the field to query; * The second parameter is Analyzer Analyzer */Queryparser parser=new queryparser (version.lucene_35,
		"Contents", analyzer);
		According to the incoming P find//Query query=parser.parse (keyword);
		Termquery query = new Termquery (New term ("contents", keyword));
		Start Query/** * The first parameter is to find the resulting query by passing arguments; * The second parameter is the number of rows to be queried * */topdocs (query, 10);
		Boolean flag = false;
		if (hits.totalhits>0) {flag = true;
		} reader.close ();
	return flag;

3. Main function

Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

Import Org.wltea.analyzer.lucene.IKAnalyzer;
		public class Main {public static void main (string[] args) {String KeyWord = "This is an entire keyword Demacia";
		Ikanalyzer Analyzer = new Ikanalyzer (true);
		try {Printanalysisresult (analyzer, KeyWord);
		} catch (Exception e) {e.printstacktrace ();  }}/** * Print out the word breaker results for a given word breaker * * @param analyzer * Word breaker * @param keyWord * keywords * @throws Exception */private static void Printanalysisresult (Analyzer analyzer, String KeyWord) throws Exception {System
		. OUT.PRINTLN ("[" +keyword+ "] participle effect as follows");
		Tokenstream Tokenstream = Analyzer.tokenstream ("Content", new StringReader (KeyWord));
		Tokenstream.addattribute (Chartermattribute.class);
		SearchKeyword SK = new SearchKeyword (); while (Tokenstream.incrementtoken ()) {CharteRmattribute Chartermattribute = Tokenstream. getattribute (Chartermattribute.class);
			if ( (chartermattribute.tostring ()) ==true) {System.out.println (chartermattribute.tostring ());


4. User definable Keywords

5. Operation Result

Can see "This is a whole keyword" is precisely indexed,

and "Demacia" keyword because in the dictionary is stored "* demacia *", do not meet the exact search so it is not cut out

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.