Lucene is a subproject of the Apache Software Foundation 4 Jakarta Project group, an open source full-Text Search engine toolkit, which is not a full-text search engine, but a full-text search engine architecture that provides a complete query engine and index engine. Part of the text analysis engine (English and German two Western languages). Lucene's goal is to provide software developers with a simple and easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis.
The implementation mechanism of Lucene is the inverted table. For example, there are now 2 documents that need to be indexed, namely "Lucene learning" and "Lucene Hadoop", and the established index is actually like this:
The advantage of indexes is that although indexing can be time-consuming, efficient queries can always be supported once they are established.
The core classes of Lucene are:
Directory: Indexed objects, commonly used with ramdirectory and fsdirectory two, which represent an in-memory index file, which represents a local index file.
Filed: field, consisting of name and value, commonly used fields are divided into three types:
- storedfiled: A field that stores data only, and the data for that field is not indexed.
- Stringfiled, longfiled, and so on: a field that sets the data as an index as a whole.
- Textfiled: The field in which the data is segmented and indexed.
Document: The structure of documents stored in the index, a document consisting of multiple fields.
Term: A lexical element, equivalent to a key in an index, each of which points to a linked list of documents.
Indexreader: The read stream of the index, which can be read from the index to take out the required documents.
IndexWriter: The output stream of the index, which writes the document to the index.
Indexsearcher: The search function of the index.
Query: Search criteria.
Analyzer: A word breaker that breaks down the data in a field into multiple lexical elements.
The process of Lucene:
Write: Convert the data source to a document, and then use the IndexWriter adddocument method to add the document to the index.
Query: Use Indexreader to get a Indexsearcher object and create a query object to specify the type of queries, and then use the Indexsearcher search method to get the document.
code example:
To add an index:
public static void Init () throws IOException {//index file directory dir = fsdirectory.open (new file ("E:/index"));//Index configuration ( Specify LUCENE version and word breaker) indexwriterconfig config = new Indexwriterconfig (Version.lucene_4_10_0, New StandardAnalyzer ());// Get the output stream IndexWriter writer = new IndexWriter (dir, config);//Create Document DOC = new documents ();//id will be indexed as a whole as a lexical element And will be stored in the document Doc.add (new Stringfield ("id", "123", Field.Store.YES)),//content will be divided into several words by the word breaker is indexed but not stored in the document Doc.add (new TextField ("Content", "Lucene is a sub-project of the Apache Software Foundation 4 Jakarta" + "project group, is an open source full-Text search engine toolkit, i.e. it is not a full full-text search engine," + " Instead, a full-text search engine is architected, providing the complete query engine and indexing engine, "+" part of the text analysis engine (both English and German Western languages). Lucene aims to provide software "+" developers with a simple and easy-to-use toolkit to facilitate the "+" function of full-text indexing in the target system, or to build a complete full-text search engine on this basis. (Field.Store.NO));//address will be divided into several words by the word breaker is indexed and stored in the document Doc.add (new TextField ("Address", "Cui Wei Road, Haidian District, Beijing", Field.Store.YES));//Add the document to index Writer.adddocument (DOC),//print the number of documents currently indexed SYSTEM.OUT.PRINTLN ("Number of documents indexed:" + Writer.numdocs ()); Writer.close ();}
Traverse all the documents in the index:
public static void Ergodic () throws IOException {//index file directory dir = fsdirectory.open (new file ("E:/index"));// Get input stream Indexreader reader = Directoryreader.open (dir); Indexsearcher searcher = new Indexsearcher (reader);D ocument doc = null;for (int i = 0; i < Reader.maxdoc (); i++) {doc = Searcher.doc (i); System.out.println ("Doc [" + i + "]: ID:" + doc.get ("id") + ", desc:" + doc.get ("content");}}
Search for documents:
public static void Search () throws IOException, ParseException {//index file directory dir = fsdirectory.open (new file ("E:/index . txt "));//input stream Indexreader reader = Directoryreader.open (dir); Indexsearcher searcher = new Indexsearcher (reader);// Search criteria Queryparser parser = new Queryparser ("desc", New StandardAnalyzer ()); Query query = parser.parse ("index file");//Search result Topdocs hits = Searcher.search (query, 2); SYSTEM.OUT.PRINTLN ("Total number of Queries:" + hits.totalhits);//Get search results Document DOC = null;for (Scoredoc scoreDoc:hits.scoreDocs) {D OC = Searcher.doc (Scoredoc.doc); System.out.println ("Doc [" + Scoredoc.doc + "]: ID:" + doc.get ("id") + ", desc:" + doc.get ("content");}}
Some details:
- For an index file, you can only have one output stream to manipulate it.
- To avoid frequent I/O operations, you can first set up the index in Ramdirectory, and then write the local file once.
- Chinese word segmentation library is relatively easy to use ANSJ and IK, you can self-Baidu understand.
- Lucene does not support the deduplication, Duplicatefilter will first go through the index after the condition query, which will result in data loss.
- An update to an index can be efficient in bulk operations by first deleting the original document and then adding a new document based on the word element.
[Java Web] Full-Text Search class library for Java Lucene