Full-text retrieval of Lucene index configuration and creation, full-text retrieval of lucene Index

Source: Internet
Author: User

Full-text retrieval of Lucene index configuration and creation, full-text retrieval of lucene Index
Lucene

Is an open-source full-text search engine toolkit, but it is not a complete full-text search engine, but a full-text search engine architecture, provides a complete query engine and index engine, some text analysis engines (two Western languages: English and German ). Lucene aims to provide software developers with a simple and easy-to-use toolkit to conveniently implement full-text retrieval in the target system, or build a complete full-text retrieval engine based on this.

Advantages

 

(1) The index file format is independent from the application platform. Lucene defines an index file format based on 8 bytes, so that apps compatible with systems or different platforms can share the created index file. (2) Based on the inverted index of the traditional full-text search engine, a multipart index is implemented, which can create small file indexes for new files and improve the indexing speed. Then, it is optimized by merging with the original index. (3) The excellent object-oriented system architecture reduces the learning difficulty of Lucene extension and facilitates the expansion of new functions. (4) A text analysis interface independent of language and file format is designed. The indexer creates an index file by accepting the Token stream. Users can expand the new language and file format, you only need to implement the text analysis interface. (5) A powerful query engine has been implemented by default. You do not need to write your own code to make the system have powerful query capabilities, by default, Lucene supports Boolean operations, Fuzzy Search [11], and grouping queries. Concept

 

First of all, you can take a look at this figure. It has been around for a long time. In my understanding, it is: Left side: collect various types of data, such as online data, text data, and databases, and collect indexes created by Lucene. On the right side: a process in which the user returns results through some searches and indexes.Lucene Configuration

 

Very simple guide a few jar package, create an index file I used is the latest version of 6.6.0 core package: lucene-core-6.6.0.jar, can download on the official website not much said on the first code: first, based on the lucene concept diagram above, we need to create an index first. I directly throw these exceptions here. In fact, what needs to be handled is too lazy.
Public static void createindex () throws Exception {// create a file Directory and create an index under the project Directory dir = FSDirectory. open (FileSystems. getDefault (). getPath (System. getProperty ("user. dir ") +"/index "); // word segmentation is an abstract class word segmentation, standard Analyzer analyzer = new IKAnalyzer (); // create the IndexWriterConfig object IndexWriterConfig config = new IndexWriterConfig (analyzer); // create the IndexWriter object IndexWriter iWriter = new IndexWriter (dir, config); // clear the previous index iWriter. deleteAll (); // create the Document Object Document doc = new Document (); // Add text content fields and Field Types doc to the Document. add (new Field ("fieldname", "Stick to the gl blogger's blog post. repost the post and comment the source.", TextField. TYPE_STORED); // Add the document to indexWriter and write it to iWriter in the index file. addDocument (doc); // close writing to iWriter. close ();}

In this way, you can see that the content file in your index has been created.

The index has been created. Next, query the index and input the words to be queried.

Public static void search (String string) throws Exception {Directory dir = FSDirectory. open (FileSystems. getDefault (). getPath (System. getProperty ("user. dir ") +"/search "); // open the DirectoryReader dReader = DirectoryReader In the index directory. open (dir); IndexSearcher searcher = new IndexSearcher (dReader); // The field value of the first parameter, and the string Term t = new Term ("fieldname ", string); // encapsulate the string to be indexed into a Query query = new TermQuery (t) That lucene can recognize; // Query, the maximum return value is 10 TopDocs top = searcher. search (query, 10); // number of hits, which field hits, the number of hit fields has several systems. out. println ("hits:" + top. totalHits); // query the returned doc array ScoreDoc [] sDocs = top. scoreDocs; for (ScoreDoc scoreDoc: sDocs) {// output the content of the hit field System.out.println(searcher.doc(scoreDoc.doc ). get (field ));}}

So a full-text search test will come out. Let's think about the summary and expand it.

Adding another code is helpful for understanding.

Public static void main (String [] args) throws Exception {String chString = "stick to the end of the article. for reprinting, please comment out the source"; Analyzer analyzer = new IKAnalyzer (); tokenStream stream = analyzer. tokenStream ("word", chString); stream. reset (); CharTermAttribute cta = stream. addAttribute (CharTermAttribute. class); while (stream. incrementToken () {System. out. println (cta. toString ();} stream. close ();}

Shown as follows:

You can also add these files. Note that your encoding format is correct.

First: ext. dic extended dictionary, which must be grouped together in Word Segmentation. For example, word segmentation may divide the words "stick to the end" into "stick to the end" and "to the end ", you can directly add the stick to the end in this file to display the index that sticks to the end.

Third: stopword. dic extended stop dictionary, which does not want to appear in Word Segmentation. If you do not want it to appear separately or separately, you can write it into it, and there will be no such word during retrieval.

Second: Specify the two extended dictionaries

These are the most basic information, and there are many types of word segmentation algorithms that need to be expanded.

 

[Version Declaration] This article is an original article by the blogger. For more information, see the source.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.