Lucene, as an excellent full-text search engine, has a strong object-oriented feature in its system structure. The first is to define an index file format unrelated to the platform. Secondly, the core components of the system are designed as abstract classes through abstraction, and the specific platform implementation is designed as abstract classes, in addition, the parts related to specific platforms, such as file storage, are encapsulated as classes. After layer-by-layer object-oriented processing, a retrieval engine system with low coupling and high efficiency and easy secondary development is achieved.
1. System Structure Organization
Lucene, as an excellent full-text search engine, has a strong object-oriented feature in its system structure. The first is to define an index file format unrelated to the platform. Secondly, the core components of the system are designed as abstract classes through abstraction, and the specific platform implementation is designed as abstract classes, in addition, the parts related to specific platforms, such as file storage, are encapsulated as classes. After layer-by-layer object-oriented processing, a retrieval engine system with low coupling and high efficiency and easy secondary development is finally achieved.
The following describes the structure of the Lucene system and provides the system structure and source code organization diagram:
We can clearly see that Lucene's system consists of three parts: basic structure encapsulation, index core, and external interface. The core index of direct operations on index files is the focus of the system. Lucene divides all source code into seven modules (represented by a package in Java), and shows the system part of each module. It must be noted that org. apache. lucene. querypaser is used as Org. apache. lucene. the syntax parser of search exists and is not actually called outside the system. Therefore, it is not regarded as an external interface, but independent.
From the perspective of the object, Lucene applies the most basic programming principle: introducing additional abstraction layers to reduce coupling. First, the org. apache. lucene. store encapsulation, and then build the implementation of the index part in (Org. apache. lucene. the core of the index. Based on the core of the index, the external interfaces org. Apache. Lucene. Search and org. Apache. Lucene. analysis are designed. Lucene fully applies this criterion to every part of the details, such as some common data structures and algorithms. With the support of highly object-oriented theory, Lucene is easy to understand and expand.
Lucene introduces an application structure other than the traditional Client Server structure. Lucene can be included in the application as a running database, rather than as a separate Index Server. This is naturally inseparable from the open source code features of Lucene, but it also reflects the original intention of Lucene: to provide a full-text index engine architecture, rather than implementation.
Ii. Data Stream Analysis
Another way to understand the Lucene system structure is to explore the trend of the data stream in the Lucene system, and to understand the call sequence in the Lucene system. On this basis, we can have a more in-depth understanding of Lucene's system structure organization to facilitate future development work on Lucene systems. This part of analysis is the key to go deep into the Lucene system and also the basis for rewriting.
Let's take a look at the main data streams in the Lucene System and Their Relationships:
Figure 2.2 shows Lucene's internal data stream organization, and along the data stream direction, we can also have a clear understanding of Lucene's internal execution sequence. Now we will describe the relationship between the stream types involved in the figure and the related parts of the corresponding system of each logic.
The figure contains four types of data streams: Text Stream, Token stream, byte stream, and query statement object stream. Text Stream indicates the abstraction of the index target and interactive control, that is, text stream indicates the file to be indexed, and text is used to flow to the user to output information. In actual implementation, the text stream in Lucene uses the UCS-2 [19] As the encoding, in order to adapt to the multi-language text processing. The token stream is a concept used inside Lucene. It is an abstraction of the concepts of words in traditional words and the smallest unit that Lucene can process directly when indexing; in short, token is a combination of the word and the value of the domain. The token will be further involved in the description file format, which is not detailed here. Byte streams are the embodiment of direct operations on file abstraction. The processing of a stream is based on a fixed-length byte (Lucene is defined as an 8-bit long and will be detailed in the file format below, this frees file operations from being unrelated to the platform file system. The object stream of the query statement is used only when the query statement is parsed. It abstracts the query statement and reflects the structure of the query statement through the inheritance structure of the class, send it to the search logic for the search operation.
The figure involves a variety of logics, which basically correspond directly to a certain module of the system, but there are also cross-module calls. This is because Lucene has a very good degree of reuse, therefore, many implementations directly call previous work results, which actually enhances module coupling to some extent, however, it is also a compromise between the system being too large and unnecessary repetitive design. The lexical analysis logic corresponds to org. Apache. Lucene. analysis. The query statement syntax analysis logic corresponds to the org. Apache. Lucene. queryparser section and calls the code of org. Apache. Lucene. analysis. After the query is completed, the token stream is output to the scoring sorting logic, and the result of the text stream is displayed after the scoring sorting logic is processed. This part of implementation is also included in org. apache. lucene. search. The indexing logic corresponds to org. Apache. Lucene. index. The index search logic is mainly org. Apache. Lucene. search, but the code and Interface Definition of org. Apache. Lucene. index are also widely used. The storage abstraction corresponds to org. Apache. Lucene. Store. Modules that are not mentioned exist as public infrastructure of the system.
Iii. Lucene-based application development
Through the above system structure analysis and data stream analysis, we have clearly understood the structural characteristics of Lucene's system. On this basis, we can complete a complete full-text search engine by expanding the Lucene system, and then we can build various application systems based on the full-text search engine. Since the purpose of this article is not here, the following is just a brief description of the relevant steps to give some ideas for application development.
First, we need to construct the corresponding lexical analysis Logic Based on the lexical structure of the target language to implement Lucene in org. apache. lucene. interfaces defined in analysis provide Lucene with the language processing capability used by the target system. Lucene has implemented simple lexical analysis logic in English and German by default (word segmentation by space and common syntax words, such as is, AM, and are in English ). Here, we mainly need to refer to the Implemented interfaces in org. apache. lucene. analyzer in analysis. java and tokenizer. as defined in Java, Lucene provides many implementation samples of English specifications and can also be used as a reference for implementation. Document inheritance of the class defined in ument, define your own htmldocument class, and then you can hand it over to the org. Apache. Lucene. index module to write the index file. After completing these two steps, the Lucene full-text search engine is basically complete. This process can be expressed as follows:
Of course, the above shows only the basic expansion process for Lucene, which converts Lucene from incomplete to complete (especially for non-English language retrieval ). In addition, we can also transform Lucene in many aspects. The first aspect is to sort the returned query results by the domain indexed by the document, such as the title and author information. This requires the transformation of Lucene's scoring sorting logic. By default, Lucene uses its internal correlation method to process scoring and sorting. We can change it as needed. Unfortunately, this part of Lucene is not as clear as expanded lexical parsing and document types, and there is no good interface left. Therefore, you need to carefully analyze the implementation of its source code, self-expanding and so on. Other aspects, such as improving the index efficiency and improving the buffer mechanism when returning results, are all about strengthening the Lucene system and will not be described here.
After the Lucene system is completed, you can start to consider the application system development on it. If the application system is also developed in Java, the Lucene system can be conveniently embedded in the entire system and called as an API set. This process is very simple. The following is an example program, which is easy to understand with annotations.