[Java Web] Full-Text Search class library for Java Lucene

Source: Internet
Author: User

Lucene is a subproject of the Apache Software Foundation 4 Jakarta Project group, an open source full-Text Search engine toolkit, which is not a full-text search engine, but a full-text search engine architecture that provides a complete query engine and index engine. Part of the text analysis engine (English and German two Western languages). Lucene's goal is to provide software developers with a simple and easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis.


The implementation mechanism of Lucene is the inverted table. For example, there are now 2 documents that need to be indexed, namely "Lucene learning" and "Lucene Hadoop", and the established index is actually like this:

The advantage of indexes is that although indexing can be time-consuming, efficient queries can always be supported once they are established.


The core classes of Lucene are:

Directory: Indexed objects, commonly used with ramdirectory and fsdirectory two, which represent an in-memory index file, which represents a local index file.

Filed: field, consisting of name and value, commonly used fields are divided into three types:

    1. storedfiled: A field that stores data only, and the data for that field is not indexed.
    2. Stringfiled, longfiled, and so on: a field that sets the data as an index as a whole.
    3. Textfiled: The field in which the data is segmented and indexed.

Document: The structure of documents stored in the index, a document consisting of multiple fields.

Term: A lexical element, equivalent to a key in an index, each of which points to a linked list of documents.

Indexreader: The read stream of the index, which can be read from the index to take out the required documents.

IndexWriter: The output stream of the index, which writes the document to the index.

Indexsearcher: The search function of the index.

Query: Search criteria.

Analyzer: A word breaker that breaks down the data in a field into multiple lexical elements.


The process of Lucene:


Write: Convert the data source to a document, and then use the IndexWriter adddocument method to add the document to the index.

Query: Use Indexreader to get a Indexsearcher object and create a query object to specify the type of queries, and then use the Indexsearcher search method to get the document.


code example:

To add an index:

public static void Init () throws IOException {//index file directory dir = fsdirectory.open (new file ("E:/index"));//Index configuration ( Specify LUCENE version and word breaker) indexwriterconfig config = new Indexwriterconfig (Version.lucene_4_10_0, New StandardAnalyzer ());// Get the output stream IndexWriter writer = new IndexWriter (dir, config);//Create Document DOC = new documents ();//id will be indexed as a whole as a lexical element And will be stored in the document Doc.add (new Stringfield ("id", "123", Field.Store.YES)),//content will be divided into several words by the word breaker is indexed but not stored in the document Doc.add (new TextField ("Content", "Lucene is a sub-project of the Apache Software Foundation 4 Jakarta" + "project group, is an open source full-Text search engine toolkit, i.e. it is not a full full-text search engine," + " Instead, a full-text search engine is architected, providing the complete query engine and indexing engine, "+" part of the text analysis engine (both English and German Western languages). Lucene aims to provide software "+" developers with a simple and easy-to-use toolkit to facilitate the "+" function of full-text indexing in the target system, or to build a complete full-text search engine on this basis. (Field.Store.NO));//address will be divided into several words by the word breaker is indexed and stored in the document Doc.add (new TextField ("Address", "Cui Wei Road, Haidian District, Beijing", Field.Store.YES));//Add the document to index Writer.adddocument (DOC),//print the number of documents currently indexed SYSTEM.OUT.PRINTLN ("Number of documents indexed:" + Writer.numdocs ()); Writer.close ();}

Traverse all the documents in the index:

public static void Ergodic () throws IOException {//index file directory dir = fsdirectory.open (new file ("E:/index"));// Get input stream Indexreader reader = Directoryreader.open (dir); Indexsearcher searcher = new Indexsearcher (reader);D ocument doc = null;for (int i = 0; i < Reader.maxdoc (); i++) {doc = Searcher.doc (i); System.out.println ("Doc [" + i + "]: ID:" + doc.get ("id") + ", desc:" + doc.get ("content");}}


Search for documents:

public static void Search () throws IOException, ParseException {//index file directory dir = fsdirectory.open (new file ("E:/index . txt "));//input stream Indexreader reader = Directoryreader.open (dir); Indexsearcher searcher = new Indexsearcher (reader);// Search criteria Queryparser parser = new Queryparser ("desc", New StandardAnalyzer ()); Query query = parser.parse ("index file");//Search result Topdocs hits = Searcher.search (query, 2); SYSTEM.OUT.PRINTLN ("Total number of Queries:" + hits.totalhits);//Get search results Document DOC = null;for (Scoredoc scoreDoc:hits.scoreDocs) {D OC = Searcher.doc (Scoredoc.doc); System.out.println ("Doc [" + Scoredoc.doc + "]: ID:" + doc.get ("id") + ", desc:" + doc.get ("content");}}

Some details:

    1. For an index file, you can only have one output stream to manipulate it.
    2. To avoid frequent I/O operations, you can first set up the index in Ramdirectory, and then write the local file once.
    3. Chinese word segmentation library is relatively easy to use ANSJ and IK, you can self-Baidu understand.
    4. Lucene does not support the deduplication, Duplicatefilter will first go through the index after the condition query, which will result in data loss.
    5. An update to an index can be efficient in bulk operations by first deleting the original document and then adding a new document based on the word element.


[Java Web] Full-Text Search class library for Java Lucene

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.