Lucene some concepts and the process of index building when building indexes

Source: Internet
Author: User

The thing to do before searching for document content is to index documents from a variety of different sources (Web pages, databases, e-mails, etc.), and the process of indexing is to extract the content, normalize it (by modeling the content), and store it.

In the process of indexing there are a few basic concepts, according to my own understanding probably write:

Document:

Documents are used in indexing and searching, which is the basic unit of Index and search ( similar to records in relational database tables ), and if we index and search the content of the Web page, every page that crawls from the Internet will eventually be analyzed, Extract the meaningful part (such as the page title, URL, contains keywords, release date, etc.), the formation of a document stored up, in the search for 0 according to these content to match, found a matching document, and then from the document to find the required content, to restore

Domain (field):

A field is something that is really used in a document to match, a document is made up of one or more fields (a field in a Category relationship table, a record consists of multiple fields, each field has its type and corresponding value, and the Lucene document is made up of fields, each with its name, type, and value). For the field options in Lucene, refer to the previous article I wrote: Domain options in Lucene

Analyzer/Word Element (term):

The parser is also used for indexing and searching, and the parser is parsing the original document (or user input) into a single word (called a lexical Element), and the index of Lucene is the structure of an inverted index, which stores the mapping from the word element to the document. The original document is converted to a lexical element by the parser and then stored as an index for the relationship between the word element and the document, and the parser converts the user's input to a WORD element and then to the index to find the matching document.

Lucene's indexing process is divided into three main steps:

1. Convert documents from a variety of ways to text

2. Parsing text with a parser

3. Save the parsed text to the index

Here is a picture I looked up from the Internet, very good to explain the Lucene indexing process (and the process of searching)

Lucene some concepts and the process of index building when building indexes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.