Adding search capabilities to your application is often a common requirement. This article describes a framework that developers can use to implement search engine functionality with minimal effort, ideally requiring only one configuration file. The framework is based on a number of open source libraries and tools, such as the Apache lucene,spring framework, Cpdetector, and so on. It supports multiple resources. Two typical examples are database resources and file system resources. Indexer indexes the configured resources and transfers them to the central server, which can then be searched through the API. The Spring-style configuration file allows for clear and flexible customization and tuning. The core API also provides an extensible interface.
Introduction
Adding search capabilities to your application is often a common requirement. Although there are already a number of libraries that provide support for the search infrastructure, for many people, using them to build a search engine from scratch is a much more expensive and potentially tedious process. On the other hand, many small applications have a great similarity to the requirements of search function and the application scenario. This paper attempts to build a flexible search engine framework using the Java language as a starting point for the applicability of most small applications. Using this framework, you can build a search engine in most cases with minimal effort. Ideally, only one configuration file is required. In special cases, it is possible to extend the framework to meet the requirements flexibly. Of course, as the title describes, this is the power of the use of open source tools.
Basic knowledge
Apache Lucene is the most common Java class library to develop search-class applications, and our framework will be based on it. For a better description below, we need to know a little bit about Lucene and search basics first. Note that this article does not pay attention to the index file format, Word segmentation technology and other topics.
What is search and indexing
From a user's point of view, the search process is the process of finding specific content in a resource through keywords. From a computer point of view, there are two ways to implement this process. The first one is to match all the resources to the keyword, and return all the content that satisfies the match. The second is like a dictionary in advance to establish a corresponding table, the keyword and the contents of the resource corresponding to the search directly to find this table can be. Obviously, the second approach is much more efficient. The establishment of the corresponding table is in fact the process of establishing a reverse index (inverted index).
Lucene Basic Concepts
Lucene is Doug Cutting's tool library for Full-text Search, developed in Java. Here, I assume that the reader has a basic understanding of it, and we have only a brief introduction to some important concepts. For more information, refer to the relevant articles and books listed in the reference resources. Here are some of the more important classes in Lucene.
Document: Index contains multiple Document. Each Document contains more than one Field object. Document can be a bunch of data from the database table, can be a file, can be a Web page and so on. Note that it is not equivalent to a file in the file system.
Field: A field has a name that corresponds to a portion of the document, representing the contents of the documentation or the metadata of the document (not a concept with the resource metadata mentioned below). A Field object has two important attributes: the Store (can have yes, NO, COMPACT three values) and Index (can have tokenized, un_tokenized, no, no_norms four values)
Query: Abstracts the statements that are used when searching.
Indexsearcher: Provides it with a Query object that uses an existing index to search for and returns search results.
Hits: A container that contains pointers to a subset of search results.
The process of indexing using Lucene is roughly to unify the input data source as a string or text stream, and then extract the data from the data source and create the appropriate Field to add to the Document object of the corresponding data source.
System Overview
In order to build a common framework, the generality of different situations must be abstracted. reflect the design need to pay attention to two points. The first is to provide the expansion interface, and the second is to minimize the coupling between the modules. Our framework is very simple to divide into two modules: Indexing module and search module. Indexing modules are indexed to resources on different machines, and the index files (in fact, we'll say, and metadata) are uniformly transferred to the same place (either on a remote server or locally). The search module uses the data collected from multiple indexing modules to complete the user's search request.
Figure 1 shows the overall framework. As you can see, two modules are relatively independent, and the association between them is not through code, but through indexing and metadata. In the following sections, we will detail how to design and implement these two modules based on open source tools.
Figure 1. System Architecture Diagram