Preface: What is the relationship between Lucene and information retrieval, and what is unique about it? This section highlights the basics of Lucene.
What is Lucene?
Lucene is a subproject of the Apache Software Foundation Jakarta Project, which is an open source, free, pure Java language Full-text Search toolkit. Doug Cutting, its original author, is a senior full-text search expert who has dominated the research and development work of a search engine.
the benefits of Lucene
Lucene can be so popular with some of its advantages is inseparable: The index file format cross-platform: Lucene defines a set of 8-byte-based index file format, which makes the application of different platforms compatible. Block index based on inverted index: The block index can be used to create small file indexes on new files, improve index speed, and then achieve optimization by merging with the original index. Object-oriented system architecture: Lucene is based on the Java language, and there are many users of the language, and the difficulty of getting started is relatively low. Independent language Analysis Interface: Whatever the language, just implement your own text analyzer and you can use it. User-friendly extension of new languages. The default implementation of the query engine: Lucene default implementation of Boolean query, fuzzy Query, group query, etc., users can easily use the query function.
related products of Lucene Compass
Campass is the enhancement of Lucene search engine in enterprise application. It combines popular frameworks like Hibernate and spring, making it easier to use search engines in Java programs.
Nutch
Nutch is an implementation of Web search based on Lucene core, which adds a web crawler and related interface parts to Lucene, which makes the whole search engine work properly.
SOLR
SOLR is a high-performance, Lucene-based full-text Search server. Through the HTTP request to complete the query work, it is difficult to get started, retrieval performance is excellent, is widely used in enterprises.
ElasticSearch
Elasticsearch is a Lucene-based search server. It provides a distributed capability to make it more stable and reliable, and is currently a popular enterprise-class search engine.
Lucene System Structure
Lucene system consists of three parts: Basic encapsulation Class, Core search class and external interface.
Currently the latest Lucene has been updated to 7.x, but most companies are still using the lower version. In the case of the Lucene 6.0.0 version we are using, Lucene has the following packages:
Package Name |
Description |
Org.apache.lucene.analysis |
Language analyzer, mainly used for participle |
Org.apache.lucene.codecs |
Provides an abstract encoding of a decoded inverted index structure |
Org.apache.lucene.document |
Document structure when managing the index store |
Org.apache.lucene.index |
Responsible for the creation and deletion of indexes, etc. |
Org.apache.lucene.search |
To process a user's query request |
Org.apache.lucene.store |
Provides support for index storage |
Org.apache.lucene.util |
Contains a number of useful data structures and tool classes |
Lucene main logic diagram
Lucene provides a powerful full-text search function, which can be divided into the following two blocks: Text index: Use the word breaker to slice the text resource and then deposit it into the index library. Conditional query: According to the user's query request, go to the index library to find the information that matches the condition.
Lucene's query logic
A match is considered when two query conditions are met. Tom and Jerry, for example, indicate that Tom and Jerry are not returned in the document.
OR
Two query conditions as long as there is a satisfaction, it is considered a match. For example, Tom or Jerry means that the document contains either Tom or Jerry or is present at the same time. not
Indicates that a query condition is excluded. For example, Tom not Jerry says the document contains Tom and does not return without Jerry.
wildcard characters? *
Lucene supports simple wildcard queries, such as matching a single character, * matching 0 or more characters.
For example to? can query out ' Too ', ' Tom ' and so on, to* can query out ' to ', ' Tomas ' and so on.
Lucene's query logic looks simple and easy to understand. Here are some of them, in the actual application, often use a combination of multiple query logic. By default, Lucene only supports English, so the content of the presentation is in English, then the Chinese word breaker is used to enable Lucene to support Chinese.