The thing to do before searching for document content is to index documents from a variety of different sources (Web pages, databases, e-mails, etc.), and the process of indexing is to extract the content, normalize it (by modeling the content), and store it.
In the process of indexing there are a few basic concepts, according to my own understanding probably write:
Document:
Documents are used in indexing and searching, which is the basic unit of Index and search ( similar to records in relational database tables ), and if we index and search the content of the Web page, every page that crawls from the Internet will eventually be analyzed, Extract the meaningful part (such as the page title, URL, contains keywords, release date, etc.), the formation of a document stored up, in the search for 0 according to these content to match, found a matching document, and then from the document to find the required content, to restore
Domain (field):
A field is something that is really used in a document to match, a document is made up of one or more fields (a field in a Category relationship table, a record consists of multiple fields, each field has its type and corresponding value, and the Lucene document is made up of fields, each with its name, type, and value). For the field options in Lucene, refer to the previous article I wrote: Domain options in Lucene
Analyzer/Word Element (term):
The parser is also used for indexing and searching, and the parser is parsing the original document (or user input) into a single word (called a lexical Element), and the index of Lucene is the structure of an inverted index, which stores the mapping from the word element to the document. The original document is converted to a lexical element by the parser and then stored as an index for the relationship between the word element and the document, and the parser converts the user's input to a WORD element and then to the index to find the matching document.
Lucene's indexing process is divided into three main steps:
1. Convert documents from a variety of ways to text
2. Parsing text with a parser
3. Save the parsed text to the index
Here is a picture I looked up from the Internet, very good to explain the Lucene indexing process (and the process of searching)
Lucene some concepts and the process of index building when building indexes