Core classes of apache lucene

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The following describes the core classes of lucene: (refer to lucene in action)
It consists of two parts: the Core Index Class and the core search class, which are used for indexing and searching.

IndexWriter: You can write indexes, but cannot read or search indexes. Is the only class that can write indexes.

Directory: The Directory class represents the location of a Lucene index. It is an abstract class that allows its subclass (two of which are contained in Lucene) to store indexes when appropriate. In our Indexer example, we use the path of an actual file system Directory to pass to the IndexWriter constructor to obtain an instance of Directory. IndexWriter then uses a specific Directory implementation FSDirectory and creates an index in a Directory of the file system. In your application, you may prefer to store Lucene indexes on disks. In this case, you can use FSDirectory, A driecloud subclass that contains the real file list of the file system, as we do in Indexer. The specific subclass of another Directory is RAMDirectory. Although it provides the same interface as FSDirectory, RAMDirectory loads all its data into memory. Therefore, this implementation is very useful for small indexes and can be fully loaded into the memory and destroyed when the program is closed. Because all data is loaded into the memory for fast access instead of on a slow hard disk, RAMDirectory is suitable for the scenario where you need to quickly access the index, whether it is an index or a search. As an example, Lucene developers extended all their unit tests: During the test run, the rapid memory resident index was created for search. When the test was completed, the index is automatically destroyed and no residue is left on the disk. Of course, the performance difference between RAMDirectory and FSDirectory is small when files are cached to the memory operating system.

Analyzer: Analyze text content and extract keywords

Document: a Document represents a set of fields. You can think of it as a virtual document that can be obtained in the future-a piece of data, such as a webpage, an email message, or a text file. A document field represents this document or metadata related to this document

Field: Each Document in the index contains one or more fields and is embodied in the Field class. Each field corresponds to a segment of the data and will be queried during search or retrieved from the index again.
Lucene provides four different field types, from which you can choose:

Keyword-not analyzed, but indexed and stored in the index word by word. This type is suitable for fields whose original values need to be unchanged, such as URLs, file system paths, dates, personal names, social security numbers, telephone numbers, and so on. For example, in Indexer (list 1.1), we use the file system path as the Keyword field.

UnIndexed-it is neither analyzed nor indexed, but its value is stored in the index. This type is suitable for fields that need to be displayed with the search results (such as URL or database primary key), but you never directly search for its value. Because the original value of this type of field is stored in the index, this type is not suitable for storing relatively large values, if the index size is a problem.

UnStored-opposite to UnIndexed. This field type is analyzed and indexed but not stored in the index. It is suitable for indexing a large amount of text without needing to obtain it again in the original form. For example, the subject of a webpage or other types of text documents.

Text-analyzed and indexed. This means that this type of field can be searched, but be careful with the field size. If the data to be indexed is a String, it is also stored, but if the data (such as in our Indexer example) is from a Reader, it will not be stored. This is usually the source of confusion, so pay attention to this difference when using Field. Text.
All fields are composed of names and values. The type of the field you want to use depends on how you want to use this field and its value. Strictly speaking, Lucene only has one field type: fields that are distinguished by their respective features. Some are analyzed, some are not; some are indexed, and some are stored literally.
Note the differences between Field. Text (String, String) and Field. Text (String, Reader. String variables store field data, while Reader variables do not. You can use Field. UnStored (String, String) to index a String without storing it)

Below are core search types:
IndexSearcher: IndexSearcher is used for searching, while IndexWriter is used for indexing: exposes the main links of indexes of several search methods. You can think of IndexSearcher as a class to open an index in read-only mode. It provides several search methods, some of which are implemented in the abstract base class Searcher. The simplest method is to accept a single Query object as a parameter and return an Hits object. The typical application class of this method is as follows: IndexSearcher is = new IndexSearcher (
FSDirectory. getDirectory ("/tmp/index", false ));
Query q = new TermQuery (new Term ("contents", "lucene "));
Hits hits = is. search (q );

Term:
Term is the basic unit of search. Similar to the Field object, it consists of a string element: Field name and Field value. Note that the Term object is also related to the indexing process. However, they are generated internally by Lucene, So you
Generally, you do not need to consider them. When searching, you may create a Term object and use TermQuery at the same time.
Query q = new TermQuery (new Term ("contents", "lucene "));
Hits hits = is. search (q );
This Code enables Lucene to find all documents containing the word lucene in the contents field. Because the TermQuery object inherits its abstract parent class Query, you can use the Query type on the left of the equation.

Query
Lucene contains specific Query subclass. So far, we have mentioned only the most basic Lucene Query: TermQuery in this chapter. Other Query types include BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery, FilteredQuery, and SpanQuery.

TermQuery
TermQuery is the most basic query type supported by Lucene, and it is also one of the most primitive query types. It is used to match the document containing the specified field.

The Hits class is a simple container that searches for results (matching the documentation for a given query) Document queue pointers. Based on performance considerations, Hits instances do not load all documents that match the query from the index, but each time a small part

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Core classes of apache lucene

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support