Lucene. Net Learning 1: full-text search project Lucene. Net Introduction

Source: Internet
Author: User
Tags split words

What is full-text search (full-text index)

Full-text retrieval is a technology used to search massive amounts of data. A computer Index program builds an index for each word by scanning every word in the article, indicating the number and position of the word in the article. When a user queries, the retrieval program searches based on the pre-created index and returns the search results to the user's retrieval method. This process is similar to the process of searching Word tables in the dictionary.

Full-text retrieval is divided into word-based retrieval and word-based retrieval. Word-based retrieval refers to the indexing of each word in an article. Words are divided into word combinations during retrieval. For different languages, words have different meanings. For example, words and Words in English are actually integrated, while words and Words in Chinese are quite different. Word-based retrieval refers to the indexing of words in the document, that is, the semantic unit. Word-based retrieval can be used to process synonymous items. Because English and other western texts are segmented by blank spaces, it is similar to word-based processing, and it is easy to add synonymous processing. Chinese and other Eastern texts need to split words to achieve the goal of word-based indexing. In this regard, full-text retrieval technology, especially Chinese full-text retrieval technology, is a difficult issue, we will not detail it here.

A full-text retrieval system is a software system established based on the full-text retrieval theory to provide full-text retrieval services. Generally, full-text search requires indexing and basic query functions, in addition, modern full-text retrieval systems also require convenient user interfaces, development interfaces for WWW [1], and secondary Application Development interfaces. The core function of the full-text search system is to create an index, process the returned result set of a query, add an index, and optimize the index structure. The peripheral function is composed of the functions of different applications. In terms of structure, the core of the full-text search system is the index engine, query engine, text analysis engine, and external interface. In addition, various peripheral application systems constitute a full-text search system.

What is Lucene. Net?

Lucene. NET is the. NET porting version of Lucene, a famous Java-based full-text index toolkit. Lucene is not a complete full-text search engine, but a full-text search engine architecture. It provides a complete query engine and index engine, some text analysis engines (two Western languages: English and German ). Lucene aims to provide software developers with a simple and easy-to-use toolkit to conveniently implement full-text retrieval in the target system, or build a complete full-text retrieval engine based on this.

 

Lucene. NET applications, features, and advantages

Lucene, as a full-text search engine, has the following outstanding advantages:

(1) The index file format is independent from the application platform. Lucene defines an index file format based on 8 bytes, so that apps compatible with systems or different platforms can share the created index file.

(2) Based on the inverted index of the traditional full-text search engine, a multipart index is implemented, which can create small file indexes for new files and improve the indexing speed. Then, it is optimized by merging with the original index.

(3) The excellent object-oriented system architecture reduces the learning difficulty of Lucene extension and facilitates the expansion of new functions.

(4) A text analysis interface independent of language and file format is designed. The indexer creates an index file by accepting the token stream. Users can expand the new language and file format, you only need to implement the text analysis interface.

(5) A powerful query engine has been implemented by default. You do not need to write your own code, even if the system can obtain powerful query capabilities, by default, Lucene supports Boolean operations, fuzzy search [11], and grouping queries.

 

Lucene system structure Organization

Lucene, as an excellent full-text search engine, has a strong object-oriented feature in its system structure. The first is to define an index file format unrelated to the platform. Secondly, the core components of the system are designed as abstract classes through abstraction, and the specific platform implementation is designed as abstract classes, in addition, the parts related to specific platforms, such as file storage, are encapsulated as classes. After layer-by-layer object-oriented processing, a retrieval engine system with low coupling and high efficiency and easy secondary development is achieved.

 

The following describes the structure of the Lucene system and provides the system structure and source code organization diagram:

We can clearly see that Lucene's system consists of three parts: basic structure encapsulation, index core, and external interface. The core index of direct operations on index files is the focus of the system. Lucene divides all the source code into seven modules (represented by a package in Java), as shown in the system section of each module. It must be noted that org. apache. lucene. querypaser is used as Org. apache. lucene. the syntax parser of search exists and is not actually called outside the system. Therefore, it is not regarded as an external interface, but independent.

 

From the perspective of the object, Lucene applies the most basic programming principle: introducing additional abstraction layers to reduce coupling. First, the org. apache. lucene. store encapsulation, and then build the implementation of the index part in (Org. apache. lucene. the core of the index. Based on the core of the index, the external interfaces org. Apache. Lucene. Search and org. Apache. Lucene. analysis are designed. Lucene fully applies this criterion to every part of the details, such as some common data structures and algorithms. With the support of highly object-oriented theory, Lucene is easy to understand and expand.

 

Lucene introduces an application structure other than the traditional Client Server structure. Lucene can be included in the application as a running database, rather than as a separate Index Server. This is naturally inseparable from the open source code features of Lucene, but it also reflects the original intention of Lucene: to provide a full-text index engine architecture, rather than implementation.

 

Lucene. net

Lucene. Net project address: http://lucene.apache.org/lucene.net/

Lucene. Net API document: http://lucene.apache.org/lucene.net/docs/

Lucene. Net download: https://svn.apache.org/repos/asf/lucene/lucene.net/tags/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.