Understanding Lucene (2) understanding core Indexing classes

Source: Internet
Author: User

Lucene In Action 1.5, a vast majority of translation and understanding, bilingual comparison
■ IndexWriter

IndexWriter is the central component of the indexing process. This class creates

A new index and adds documents ents to an existing index. You can think of Index-

Writer as an object that gives you write access to the index but doesn't let you read

Or search it. Despite its name, IndexWriter isn't the only class that's used to modify

An index; section 2.2 describes how to use the Lucene API to modify an index.

IndexWriter is the core component in the indexing process. It can create a new Index and add documents to an existing Index. You can regard it as an object that can only be written, not read and searched for indexes, indexWriter is not the only class that can be used to modify indexes. Section 2.2 describes how to use javaseapi to modify Index

■ Directory

The Directory class represents the location of a Lucene index. It's an abstract

Class that allows its subclasses (two of which are supported in Lucene) to store

Index as they see fit.

In your applications, you will most likely be storing a Lucene index on a disk.

To do so, use FSDirectory, a Directory subclass that maintains a list of real files

In the file system, as we did in Indexer.

The other implementation of Directory is a class called RAMDirectory.

Because all data is held in the fast-access memory and

Not on a slower hard disk, RAMDirectory is suitable for situations where you need

Very quick access to the index, whether during indexing or searching.

Of course, the performance difference between RAMDirectory and

FSDirectory is less visible when Lucene is used on operating systems that cache

Files in memory.

Dirctory describes the location of Lucene Index, which is an abstract class. Therefore, it allows the subclass (Lucene contains two of its subclasses) to store the Index in any proper location.

The FSDirctory subclass stores the Index on the hard disk, and the Ramdirctory subclass stores the Index in the Ram

Ramdirectory is suitable for accessing Index very quickly. The Lunce test case is the Ramdirectory

Because of the operating system Cache, the performance difference between the two sub-classes is very small (almost invisible)

■ Analyzer

Before text is indexed, it's passed through an Analyzer. The Analyzer, specified

In the IndexWriter constructor, is in charge of extracting tokens out of text to be

Indexed and eliminating the rest. If the content to be indexed isn't plain text, it

Shocould first be converted to it, as depicted in figure 2.1. Chapter 7 shows how

Extract text from the most common rich-media document formats. Analyzer is

An abstract class, but Lucene comes with several implementations of it. Some

Them deal with skippingStop words(Frequently used words that don't help distinguish

One document from the other, suchA,An,The,In, AndOn); Some deal

Conversion of tokens to lowercase letters, so that searches aren't case-sensitive;

And so on. Analyzers are an important part of Lucene and can be used for much

More than simple input filtering. For a developer integrating Lucene into

Application, the choice of analyzer (s) is a critical element of application design.

You'll learn much more about them in chapter 4.

Text is uploaded to Analyzer before being indexed. The Analyzer specified by the IndexWriter constructor is used to extract tokens from the text to be indexed and eliminate rest ). If the content to be indexed is not a text (plain text), it is first converted to text. For detailed descriptions, see Chapter 7.

Analyzer is an abstract class. Lucene provides several sub-classes. For example, some are responsible for skipping the stop words that are not useful for document differentiation, and some are responsible for converting the tags to lowercase to make searchers case insensitive.

Analyzers makes Lucene an important part and can be used in many aspects, not just simple input filtering.

For developers who need to integrate Lucene into their own applications, selecting Analyzer is a crucial element in design.

Chapter 4 shows more information about Analyzerd.

■ Document

A Document represents a collection of fields. You can think of it as a virtual document-

A chunk of data, such as a web page, an email message, or a text file-

That you want to make retrievable at a later time. Fields of a document represent

The document or meta-data associated with that document. The original source

(Such as a database record, a Word document, a chapter from a book, and so on)

Of document data is irrelevant to Lucene. The meta-data such as author, title,

Subject, date modified, and so on, are indexed and stored separately as fields

A document.

Document represents a collection of fileds. You can think of it as a virtual Document-a piece of data, such as a webpage, with an email message, A text file-you will get data from this document later.

Fields of a document is a description of the document or metadata related to the document. The original source of the document (original source), such as database records, Word documents, and chapter of the book, is irrelevant to Lucene. Metadata such as the author, title, subject, and modification date are stored and indexed separately as Fileds of the document.

Note:

When we refer to a document in this book, we mean a Microsoft Word,

RTF, PDF, or other type of a document; we aren't talking about Lucene's

Document class. Note the distinction in the case and font.

In our Indexer, we're concerned with indexing text files. So, for each text file

We find, we create a new instance of the Document class, populate it with Fields

(Described next), and add that Document to the index, using tively indexing the file.

Indexer is concerned with index files. Therefore, it creates a Document class composed of Fileds for each text file found, adds the Document to the index, and finally supports efficient index files.

■ Field

Each Document in an index contains one or more named fields, embodied in

Class called Field. Each field corresponds to a piece of data that is either queried

Against or retrieved from the index during search

Lucene offers four different types of fields from which you can choose:

Each indexed document contains one or more fields, which are described in the Field class.

During the search, each field corresponds to the index that is queried or retrieved again.

Lucene provides four different fields

■ Keyword-isn' t analyzed, but is indexed and stored in the index verbatim.

This type is suitable for fields whose original value shocould be preserved in

Its entirety, such as URLs, file system paths, dates, personal names, Social

Security numbers, telephone numbers, and so on. For example, we used

The file system path in Indexer (listing 1.1) as a Keyword field.

 

■ UnIndexed-Is neither analyzed nor indexed, but its value is stored in

Index as is. This type is suitable for fields that you need to display

Search results (such as a URL or database primary key), but whose values

You'll never search directly. Since the original value of a field of this type is

Stored in the index, this type isn' t suitable for storing fields with very large

Values, if index size is an issue.

The value is stored in the index and is not indexed for analysis. It is suitable for Filed that is displayed as search results but cannot be directly searched.

Because the original value is stored in the index, it cannot be too large.

■ UnStored-The opposite of UnIndexed. This field type is analyzed and

Indexed but isn't stored in the index. It's suitable for indexing a large

Amount of text that doesn't need to be retrieved in its original form, such

As bodies of web pages, or any other type of text document.

In contrast to UnIndexed, data blocks that are analyzed and indexed are suitable for big data blocks. Such data blocks cannot obtain the original form because they are indexed. For example, the content of a document.

■ Text-Is analyzed, and is indexed. This implies that fields of this type can

Be searched against, but be cautious about the field size. If the data

Indexed is a String, it's also stored; but if the data (as in our Indexer example)

Is from a Reader, it isn' t stored. This is often a source of confusion, so

Take note of this difference when using Field. Text.

Being indexed by analysis implies that fields can be searched, but it is necessary to determine the size of the field.

All fields consist of a name and value pair.

Fields consists of name and value pairs.

We need to clarify the four differences:

1. Understand analyzed, indexed, and stored and Their Relationship with Search.

2. Understand the differences between field name and field value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.