Full-text retrieval of Lucene index configuration and creation, full-text retrieval of lucene Index

Last Update:2017-09-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Full-text retrieval of Lucene index configuration and creation, full-text retrieval of lucene Index
Lucene

Is an open-source full-text search engine toolkit, but it is not a complete full-text search engine, but a full-text search engine architecture, provides a complete query engine and index engine, some text analysis engines (two Western languages: English and German ). Lucene aims to provide software developers with a simple and easy-to-use toolkit to conveniently implement full-text retrieval in the target system, or build a complete full-text retrieval engine based on this.

Advantages

(1) The index file format is independent from the application platform. Lucene defines an index file format based on 8 bytes, so that apps compatible with systems or different platforms can share the created index file. (2) Based on the inverted index of the traditional full-text search engine, a multipart index is implemented, which can create small file indexes for new files and improve the indexing speed. Then, it is optimized by merging with the original index. (3) The excellent object-oriented system architecture reduces the learning difficulty of Lucene extension and facilitates the expansion of new functions. (4) A text analysis interface independent of language and file format is designed. The indexer creates an index file by accepting the Token stream. Users can expand the new language and file format, you only need to implement the text analysis interface. (5) A powerful query engine has been implemented by default. You do not need to write your own code to make the system have powerful query capabilities, by default, Lucene supports Boolean operations, Fuzzy Search [11], and grouping queries. Concept

First of all, you can take a look at this figure. It has been around for a long time. In my understanding, it is: Left side: collect various types of data, such as online data, text data, and databases, and collect indexes created by Lucene. On the right side: a process in which the user returns results through some searches and indexes.Lucene Configuration

Very simple guide a few jar package, create an index file I used is the latest version of 6.6.0 core package: lucene-core-6.6.0.jar, can download on the official website not much said on the first code: first, based on the lucene concept diagram above, we need to create an index first. I directly throw these exceptions here. In fact, what needs to be handled is too lazy.

Public static void createindex () throws Exception {// create a file Directory and create an index under the project Directory dir = FSDirectory. open (FileSystems. getDefault (). getPath (System. getProperty ("user. dir ") +"/index "); // word segmentation is an abstract class word segmentation, standard Analyzer analyzer = new IKAnalyzer (); // create the IndexWriterConfig object IndexWriterConfig config = new IndexWriterConfig (analyzer); // create the IndexWriter object IndexWriter iWriter = new IndexWriter (dir, config); // clear the previous index iWriter. deleteAll (); // create the Document Object Document doc = new Document (); // Add text content fields and Field Types doc to the Document. add (new Field ("fieldname", "Stick to the gl blogger's blog post. repost the post and comment the source.", TextField. TYPE_STORED); // Add the document to indexWriter and write it to iWriter in the index file. addDocument (doc); // close writing to iWriter. close ();}

In this way, you can see that the content file in your index has been created.

The index has been created. Next, query the index and input the words to be queried.

Public static void search (String string) throws Exception {Directory dir = FSDirectory. open (FileSystems. getDefault (). getPath (System. getProperty ("user. dir ") +"/search "); // open the DirectoryReader dReader = DirectoryReader In the index directory. open (dir); IndexSearcher searcher = new IndexSearcher (dReader); // The field value of the first parameter, and the string Term t = new Term ("fieldname ", string); // encapsulate the string to be indexed into a Query query = new TermQuery (t) That lucene can recognize; // Query, the maximum return value is 10 TopDocs top = searcher. search (query, 10); // number of hits, which field hits, the number of hit fields has several systems. out. println ("hits:" + top. totalHits); // query the returned doc array ScoreDoc [] sDocs = top. scoreDocs; for (ScoreDoc scoreDoc: sDocs) {// output the content of the hit field System.out.println(searcher.doc(scoreDoc.doc ). get (field ));}}

So a full-text search test will come out. Let's think about the summary and expand it.

Adding another code is helpful for understanding.

Public static void main (String [] args) throws Exception {String chString = "stick to the end of the article. for reprinting, please comment out the source"; Analyzer analyzer = new IKAnalyzer (); tokenStream stream = analyzer. tokenStream ("word", chString); stream. reset (); CharTermAttribute cta = stream. addAttribute (CharTermAttribute. class); while (stream. incrementToken () {System. out. println (cta. toString ();} stream. close ();}

Shown as follows:

You can also add these files. Note that your encoding format is correct.

First: ext. dic extended dictionary, which must be grouped together in Word Segmentation. For example, word segmentation may divide the words "stick to the end" into "stick to the end" and "to the end ", you can directly add the stick to the end in this file to display the index that sticks to the end.

Third: stopword. dic extended stop dictionary, which does not want to appear in Word Segmentation. If you do not want it to appear separately or separately, you can write it into it, and there will be no such word during retrieval.

Second: Specify the two extended dictionaries

These are the most basic information, and there are many types of word segmentation algorithms that need to be expanded.

[Version Declaration] This article is an original article by the blogger. For more information, see the source.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Full-text retrieval of Lucene index configuration and creation, full-text retrieval of lucene Index

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Full-text retrieval of Lucene index configuration and creation, full-text retrieval of lucene Index

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support