Lucene Deep Learning (2) about Lucene

Source: Internet
Author: User
Tags data structures http request solr


Preface: What is the relationship between Lucene and information retrieval, and what is unique about it? This section highlights the basics of Lucene. 

What is Lucene?



Lucene is a subproject of the Apache Software Foundation Jakarta Project, which is an open source, free, pure Java language Full-text Search toolkit. Doug Cutting, its original author, is a senior full-text search expert who has dominated the research and development work of a search engine. 

the benefits of Lucene



Lucene can be so popular with some of its advantages is inseparable: The index file format cross-platform: Lucene defines a set of 8-byte-based index file format, which makes the application of different platforms compatible. Block index based on inverted index: The block index can be used to create small file indexes on new files, improve index speed, and then achieve optimization by merging with the original index. Object-oriented system architecture: Lucene is based on the Java language, and there are many users of the language, and the difficulty of getting started is relatively low. Independent language Analysis Interface: Whatever the language, just implement your own text analyzer and you can use it. User-friendly extension of new languages. The default implementation of the query engine: Lucene default implementation of Boolean query, fuzzy Query, group query, etc., users can easily use the query function.

related products of Lucene Compass



Campass is the enhancement of Lucene search engine in enterprise application. It combines popular frameworks like Hibernate and spring, making it easier to use search engines in Java programs. 

Nutch



Nutch is an implementation of Web search based on Lucene core, which adds a web crawler and related interface parts to Lucene, which makes the whole search engine work properly. 

SOLR



SOLR is a high-performance, Lucene-based full-text Search server. Through the HTTP request to complete the query work, it is difficult to get started, retrieval performance is excellent, is widely used in enterprises. 

ElasticSearch



Elasticsearch is a Lucene-based search server. It provides a distributed capability to make it more stable and reliable, and is currently a popular enterprise-class search engine. 

Lucene System Structure



Lucene system consists of three parts: Basic encapsulation Class, Core search class and external interface.






Currently the latest Lucene has been updated to 7.x, but most companies are still using the lower version. In the case of the Lucene 6.0.0 version we are using, Lucene has the following packages:


Package Name Description
Org.apache.lucene.analysis Language analyzer, mainly used for participle
Org.apache.lucene.codecs Provides an abstract encoding of a decoded inverted index structure
Org.apache.lucene.document Document structure when managing the index store
Org.apache.lucene.index Responsible for the creation and deletion of indexes, etc.
Org.apache.lucene.search To process a user's query request
Org.apache.lucene.store Provides support for index storage
Org.apache.lucene.util Contains a number of useful data structures and tool classes
Lucene main logic diagram


Lucene provides a powerful full-text search function, which can be divided into the following two blocks: Text index: Use the word breaker to slice the text resource and then deposit it into the index library. Conditional query: According to the user's query request, go to the index library to find the information that matches the condition.



Lucene's query logic 


A match is considered when two query conditions are met. Tom and Jerry, for example, indicate that Tom and Jerry are not returned in the document. 

OR



Two query conditions as long as there is a satisfaction, it is considered a match. For example, Tom or Jerry means that the document contains either Tom or Jerry or is present at the same time. not



Indicates that a query condition is excluded. For example, Tom not Jerry says the document contains Tom and does not return without Jerry. 

wildcard characters? *



Lucene supports simple wildcard queries, such as matching a single character, * matching 0 or more characters.
For example to? can query out ' Too ', ' Tom ' and so on, to* can query out ' to ', ' Tomas ' and so on.



Lucene's query logic looks simple and easy to understand. Here are some of them, in the actual application, often use a combination of multiple query logic. By default, Lucene only supports English, so the content of the presentation is in English, then the Chinese word breaker is used to enable Lucene to support Chinese.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.