Lucene Deep Learning (2) about Lucene

Last Update:2018-07-26 Source: Internet

Author: User

Tags data structures http request solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface: What is the relationship between Lucene and information retrieval, and what is unique about it? This section highlights the basics of Lucene.

What is Lucene?

Lucene is a subproject of the Apache Software Foundation Jakarta Project, which is an open source, free, pure Java language Full-text Search toolkit. Doug Cutting, its original author, is a senior full-text search expert who has dominated the research and development work of a search engine.

the benefits of Lucene

Lucene can be so popular with some of its advantages is inseparable: The index file format cross-platform: Lucene defines a set of 8-byte-based index file format, which makes the application of different platforms compatible. Block index based on inverted index: The block index can be used to create small file indexes on new files, improve index speed, and then achieve optimization by merging with the original index. Object-oriented system architecture: Lucene is based on the Java language, and there are many users of the language, and the difficulty of getting started is relatively low. Independent language Analysis Interface: Whatever the language, just implement your own text analyzer and you can use it. User-friendly extension of new languages. The default implementation of the query engine: Lucene default implementation of Boolean query, fuzzy Query, group query, etc., users can easily use the query function.

related products of Lucene Compass

Campass is the enhancement of Lucene search engine in enterprise application. It combines popular frameworks like Hibernate and spring, making it easier to use search engines in Java programs.

Nutch

Nutch is an implementation of Web search based on Lucene core, which adds a web crawler and related interface parts to Lucene, which makes the whole search engine work properly.

SOLR

SOLR is a high-performance, Lucene-based full-text Search server. Through the HTTP request to complete the query work, it is difficult to get started, retrieval performance is excellent, is widely used in enterprises.

ElasticSearch

Elasticsearch is a Lucene-based search server. It provides a distributed capability to make it more stable and reliable, and is currently a popular enterprise-class search engine.

Lucene System Structure

Lucene system consists of three parts: Basic encapsulation Class, Core search class and external interface.

Currently the latest Lucene has been updated to 7.x, but most companies are still using the lower version. In the case of the Lucene 6.0.0 version we are using, Lucene has the following packages:

Package Name	Description
Org.apache.lucene.analysis	Language analyzer, mainly used for participle
Org.apache.lucene.codecs	Provides an abstract encoding of a decoded inverted index structure
Org.apache.lucene.document	Document structure when managing the index store
Org.apache.lucene.index	Responsible for the creation and deletion of indexes, etc.
Org.apache.lucene.search	To process a user's query request
Org.apache.lucene.store	Provides support for index storage
Org.apache.lucene.util	Contains a number of useful data structures and tool classes

Lucene main logic diagram

Lucene provides a powerful full-text search function, which can be divided into the following two blocks: Text index: Use the word breaker to slice the text resource and then deposit it into the index library. Conditional query: According to the user's query request, go to the index library to find the information that matches the condition.

Lucene's query logic

A match is considered when two query conditions are met. Tom and Jerry, for example, indicate that Tom and Jerry are not returned in the document.

Two query conditions as long as there is a satisfaction, it is considered a match. For example, Tom or Jerry means that the document contains either Tom or Jerry or is present at the same time. not

Indicates that a query condition is excluded. For example, Tom not Jerry says the document contains Tom and does not return without Jerry.

wildcard characters? *

Lucene supports simple wildcard queries, such as matching a single character, * matching 0 or more characters.
For example to? can query out ' Too ', ' Tom ' and so on, to* can query out ' to ', ' Tomas ' and so on.

Lucene's query logic looks simple and easy to understand. Here are some of them, in the actual application, often use a combination of multiple query logic. By default, Lucene only supports English, so the content of the presentation is in English, then the Chinese word breaker is used to enable Lucene to support Chinese.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More