"Develop your own search engine" reading notes--indexing of the establishment of __ search engine

Source: Internet
Author: User

Lucene's document.

Document meaning is documentation, in Lucene, it represents a logical file. Lucene itself cannot index physical files, and only documents of document type can be recognized and processed. At some point, a document can be mapped to a physical file, replacing a physical file with a document, but more often than not, document has nothing to do with a physical file, which serves as a collection of data sources to provide Lucene with the original text content to be indexed. Lucene extracts the relevant data source content from the document and processes it accordingly according to the property configuration.

In this way, you can also extract the data source from different physical files and put them in the same document.

Not only that, because the document is responsible for collecting data sources, you can even build a document without using a physical file, and a piece of text, several numbers, and even some links can be used as the data source for building the document. As soon as they are added to the Document object, Lucene can index these data sources and enable the user to find them.

add multiple field to document

For a document, how do you represent the data source that it collects? In Lucene, in fact, the data source is represented by a class called field. We can interpret field as fields .

Typically, you can create a field-type object directly from the build function of the field. This field type is primarily used to identify the various properties of the current data source, and to store the data content from the data source. When Lucene processes each field, it takes full account of the various attributes of the data source in order to make different processing.

In fact, this document-field structure is similar to the relational database. A table in a relational database can be viewed as an index in Lucene, and each record in the table is the document in Lucene, and each field in the table, that is, fields in Lucene. Each field in a table has a variety of properties, can be a character type, can be a numeric type, and, as field, has a variety of properties.

The various attributes of the data source mentioned above refer to the following:

whether to store (whether the data from the data source is stored in the index in its entirety);

whether the index (the data source's data should be retrieved when the user retrieves it, if a data source is not indexed, the user is unable to search on the data source);

whether participle (participle refers to the text of the data source according to some kind of rules of segmentation).

internal implementation of document

The document is relatively simple to implement, and it primarily acts as a record and management of information about the field so that Lucene can traverse all field information. Within the document, field is kept in an array of objects of a Vector type.

Document's main function is to maintain its internal field information, including the field of the increase, delete, check and other functions.

internal implementation of field

In the Lucene2.0 version, field contains two static internal classes: Store and index. They represent how the field is stored and indexed, respectively.

Among them, the store class has 3 public static properties:

Store.no: Indicates that the field does not require storage;

Store.yes: Indicates that the field needs to be stored;

Store.compress: Indicates that the value of this field is saved using a compression method.

The index class has four public static properties:

Index.no: Indicates that the field does not require an index (that is, the user does not need to find the value of the field);

Index.tokenized: Indicates that the field is preceded by a participle and then indexed;

Index.un_tokenized: Indicates that the field is not participles, but it is indexed (that is, the field is searched by the user);

Index.no_norms: Indicates that the field is indexed, but does not use analyzer, and it is forbidden to participate in the scoring, primarily to reduce memory consumption.

Using the combination of the store and the index class, you can represent all the states of the current field.

How to construct a field class

In the Lucene2.0 version, the field class itself provides 5 different methods of public construction:

Public Field (String name, Stringvalue,store store,index Index)

Public Field (String name, Stringvalue,store stor, Index index,termvector termvector)

Public Field (String name,reader Reader)

Public Field (String name,readerreader,termvector termvector)

Publicfield (String name,byte[] Value,store Store)

Where the name parameter unification refers to the names of the field. There are three specific ways to add values to a field:

A direct string approach;

Use reader to pass from outside;

Use direct binary byte to pass in.

The Termvector property is used to indicate whether you want to store the field's entry vector, which is actually a record of the entry and the number of occurrences.

7, Lucene indexing Tool IndexWriter, its main role is to create the index, add document, merge the index segment, and control all aspects related to the index, it is the main operator of the index of Lucene.

8, IndexWriter three constructors:

Publicindexwriter (String path,analyzer A,boolean Create)

Publicindexwriter (File path,analyzer A,boolean Create)

Public IndexWriter (Directoryd,analyzer A,boolean Create)

Their first argument is the location where the index is stored. The string type is an absolute path, the file type is a wrapped absolute path, and the directory type is a directory representation within Lucene.

Analyzer is a very important tool in Lucene, mainly responsible for the various types of input data sources analysis, including filter participle and other functions.

The third parameter is a Boolean value that deletes all content from the original directory to rebuild the index at the path specified by the first parameter, or appends a new document to the index on which it already exists.

9. After adding all the document using the Adddocument method, be sure to use the Close method of IndexWriter to turn off the indexer so that all data in the I/O cache is written to disk and the various streams are closed. This will ultimately complete the establishment of the index. If it's not closed, you'll find nothing in the index directory except for a segments file.

10,

11, Segment

In each segment, there are many document, in one index, there may be more than one segment. The largest unit of Lucene's index management is segment. All index files within each segment have the same prefix.

In an index, there is only a "segments" file, which has no suffix, and it records how many segments are in the current index, and how many documents are in each segment.

12,

The. fdt file is the primary file that holds the data source data, and the. fdx file simply records the location of the current document in. FDT so that it is easy to read later. It is important to note that the value of the field stored in the. fdt file is only the field with the Store.yes attribute in document.

13, Lucene Index part of the Invertdocument method is responsible for calling the underlying parser interface, to analyze the data source, and statistics the position and frequency of the entry information, and then put it into the postingtable.

In the invertdocument approach, you need to be aware of:

(1), iterate over all the field that needs to be indexed. For those field that do not need participle, place the data of its whole field as a big entry, and put it into postingtable.

(2), for the need for Word field, then call the bottom word segmentation interface to the word, and then each of the words are put into the postingtable.

14. Sort the postingtable

After all entries are added to the postingtable, Lucene first converts the postingtable into an array of posting types, then sorts the array so that all the entries are in their dictionary order. That way, you can write the entry information to the. tii and. tis files. In addition, the frequency and position information are written into the. Frq and. prx files. (A quick Sort method is used in Lucene to sort this posting array).

Why should Lucene sort the array of posting?

Here is a question of how an index file is stored. The index is the basis of the whole lucene work, without indexing, the search engine has lost its meaning. Then, the access and access efficiency of index file becomes the important parameter that restricts the performance of search engine. In general, the storage efficiency of an index can be slightly worse than the read efficiency, because indexing time does not have much impact on the user experience, so you should create an index format that will help speed up the search process. This format should be characterized by less index content and orderly searching.

15. Write posting information to index

16. Index file Format

(1), Index of segment

Each segment represents a complete index segment of Lucene. Typically, multiple segment are included in an index. Each segment has a uniform prefix, and this signature is determined based on the number of document in the current index. The prefix name equals the number of document converts to 36, which is preceded by an underscore.

Typically, in a complete index, there is only one "segment" file, which has no suffix, and it records all the segment information in the current index.

(2) The file in the FNM format contains all the field names (field name) in document.

(3). FDX and. Fdt are the combined use of two files, where. FDT is used to store data for field with Store.yes attributes. and. FDX is an index that stores the location of the document in. Fdt.

(4). Tis file is used to store the term after the word (Term), and. Tii is its index file, which indicates the position of the entry in each. tis file.

(5) In Lucene's index, all documents are deleted not immediately from the index, but are left to the next merged index or optimized for the index, which is somewhat similar to the Windows Recycle Bin principle. This function is implemented through the deletable file. After all the documents have been deleted, they will first leave a record in the deletable file, and the index will be removed when it is actually deleted.

(6) There is a property in the IndexWriter: Usecompoundfile, the default value is true to indicate whether to use the composite index format to save the index. When the contents of the index are very large and the number of files is numerous, the system opening the file will greatly consume the system resources. Lucene therefore provides a single file index format, which is called a composite index format. When you use the composite index format to store document content, you can set the property value to True only after you initialize a IndexWriter object, using the Setusecompoundfile (Boolean) method.

17, the indexing process tuning

(1) Merging factor Mergefactor

This factor determines how segment should be merged by the Adddocument () method. When this value is small, smaller memory is required for indexing, and the search for an optimized index is faster, but indexing is slower and is suitable for intermittently adding documents to the index. When this value is large, it is suitable for batch index establishment.

Give an example:

By setting the Mergefactor factor to 10, a new segment is created each time you add 10 document to the index.

When the 10th such segment is established, they are merged into a new segment with 100 document.

Next, each 100 document creates a new segment, and when No. 999 is indexed, there should already be 9 segment on the disk, each with 100 documents, The No. 901 to 999 document is now in memory and has not been written to disk.

If you add a document to the index at this point, the first 9 segment will merge with the 10th newly created segment and become a segment with 1000 document. The process is one analogy.

(2),Maxmergedocs

This parameter is used to implement a restriction on the Mergefactor parameter, indicating the maximum number of document that can be owned in a segment.

(3) Minmergedocs

When the index is brushed to disk, it needs to be stored in memory first, Minmergedocs is used to limit the number of files in this memory.

18, the combination of index and optimization of index

(1), the directory class has two subclasses: Ramdirectory (related to the file system memory) and fsdirectory (related to the directory of the file system).

Fsdirectory refers to a path in the file system. Therefore, when Lucene writes an index to it, it writes directly to the disk. And Ramdirectory is an area in memory, when the virtual machine exits, the contents will disappear. Therefore, the contents of the ramdirectory need to be transferred to Fsdirectory.

For ramdirectory, you can generate an instance simply by using a constructor. The Fsdirectory instance needs to be generated using a static method, which has two parameters, the first parameter is the path to the file system to be indexed, and the second parameter is a Boolean parameter that indicates whether all content in the original directory is emptied.

(2), using IndexWriter to merge indexes

The IndexWriter class of Lucene provides an interface for merging different indexes. This is not a consolidation of a specific directory, but rather a merging of different directory-type objects. Not only can the indexes that are stored under different file system paths be merged, but also the indexes in memory can be merged with indexes in the file system to hold the indexes that are stored in ramdirectory.

When merging an in-memory index, it is important to close the appropriate indexwriter to ensure that the document stuck in the cache is "brushed" to the ramdirectory, as in the case of fsdirectory, if you do not use the Close method to turn off the IndexWriter , you will find that the index file is not actually written to the directory.

(3) Optimization of index

The IndexWriter optimize method optimizes all segment in the index directory specified by the current IndexWriter and the cache directory it uses, and is all segments merged into a complete segment, That is, only one file prefix appears in the entire index directory.

19, Delete the document from the index

(1), in the index package of Lucene, there is a very important tool indexreader. It is mainly responsible for the various reading and maintenance of the index. If you open an index, get a document in the index, get the number of total documents in the index, or even delete a document from the index.

(2), use document ID number to delete a specific document

Method name Indexreader.deletedocument (IntID). Use similar Recycle Bin mechanisms within Lucene to manage document deletion. When each document is removed from the index, it is equivalent to being thrown into the Recycle Bin, not actually deleted, and if the Reader.close () method is not used to write information about the deleted document to disk, once the process exits the index, it reverts to its original state.

Indexreader's Undeleteall () method enables a reverse deletion (equivalent to a restore of the Recycle Bin).

You only need to optimize the index once with IndexWriter, and Lucene assigns ID values to each document so that the document marked as deleted is physically deleted. (Empty the Recycle Bin).

(3), using field information to delete the bulk document

The Indexreader deletedocument () method is a way to bulk delete an index, which deletes the index according to the entry. The term class is a tool used to represent entries, which can represent entries as <field,value> pairs.

20, the synchronization of Lucene problem

(1) in Lucene, classes that modify indexes are mainly focused on indexwriter (primarily responsible for the maintenance of index writing and indexing overall, such as mergers, optimizations, and so on) and Indexreader (primarily responsible for deleting documents from the index).

At any one time, only one IndexWriter instance in the system can operate the index, not allowing multiple indexwriter to add document to the index, or optimizing indexes, merging segment;

At any one time, you cannot have multiple indexreader in the process of performing a document deletion. The next indexreader operation should be performed after the Close method of the previous indexreader has not been completed;

Before you can use IndexWriter to add a document to an index, you must first close the Indexreader instance that performs the deletion;

Before you can use Indexreader to delete, you must first close the instance of the Indexreader that performs the add document operation.

(2) the lock in Lucene

Write.lock and Commit.lock
21. The Indexmodifier integrates most of the features of IndexWriter and the ability to delete indexes in Indexreader.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.