Basics: Understanding Lucene's Index document model from a conceptual concept

Source: Internet
Author: User
Tags unique id

Ext.: http://blog.csdn.net/duck_genuine/article/details/6053430

Directory (?) [+]

Lucene has two main documentation models: Document and field, and a document may contain several field.

Each field has a different strategy:

1. Indexed or not, after the field has been parsed (Analyisi), it is not in the original text until it is added to the index.

2. If indexed, you can choose whether to save the term vector (vector) for similar searches.

3. You can choose whether to store (store), copy the original text directly, do not index, for retrieval after the removal.

The document model in Lucene is similar to a database, but not exactly the same, and is reflected in the following areas:

1. No canonical format, i.e. no fixed schema, no columns, etc. the document added in the same index can contain a different field.

2. Non-formal, Lucene document model is a flat structure, no recursive definition, natural connection and so on complex structure.

2.2 Understanding the indexing process

Overall, the indexing process is:

1. Extract summary: Extracts from the source and creates document and Field objects. Tika provides text extracts for non-text, such as PDF, Word, and so on.

2. Analysis: First, the field of the document is decomposed to generate a token stream, and then pass through a series of filter (such as lowercase) and so on.

3. Indexing: Written to the index by IndexWriter Adddocument. Lunece uses a reverse index, which is "the document contains the word X" instead of "what Word does the document contain?"

Index file composition

To ensure efficiency, each index consists of several segments:

_x.cfs Each segments consists of several CFS, X is 0,1,2 .... If Usecompoundfile is turned on, there is only one. cfs file.

SEGMENTS_<N>: Records the CFS file corresponding to each partition.

After each period of time, these segment are automatically merged when IndexWriter is called.

2.3 Basic operation of the index

First create the IndexWriter

IndexWriter (Dir,new whitespaceanalyser (), IndexWriter.MaxField.UNLIMITED);

Dir is the save path to the index, Whitespaceanalyser is based on a blank word breaker, and the last part limits the number of field.

Create document and field in sequence

Document doc = new document ();

Doc.add (New Filed (Key,value,store?,index?)

Key is the field name that is searched for, and value is the text to be written/parsed.

Store, regardless of the index, whether additional storage of the original text, can be called after the search results, no no additional storage; YES, additional storage.

Index, no, no index, ANALYZED, post-word indexing, not_analyzed, non-segmentation index, analyzed_no_norms, Word segmentation index, norms;not_analyzed_no_norms not stored, No participle, index, no storage norms. In addition to no, the index is counted and can be searched. Norms stores the information needed for boost, including Norm, which may consume more memory?

Delete Index

IndexWriter provides the ability to delete document:

Deletedocumen (term)

Deletedocumen (term[])

Deletedocumen (Query)

Deletedocumen (Query [])

Special note that term is not necessarily unique, so it is possible to delete multiple mistakenly . It is also best to choose a unique, non-indexed term to prevent confusion (such as a unique ID).

Delete commit () and close to actually write to the index file.

After deletion, only marked for deletion, Maxdoc () returns all documents (including deleted but not cleaned); Numdocs: Number of documents not deleted

After using delete, optimize (): Compress the deleted space, and then commit to delete the free space.

Update index

UpdateDocument (term,document), lunce only supports all substitutions, that is, the entire docuemnt is to be replaced and the individual field cannot be updated.

Options for 2.4 field

The options are divided into three categories: index, storing, and term vector.

Index option

Index.analyzed: Post-Segmentation index

Index.not_analyzed: Non-segmentation direct index, such as URL, system path, etc., for accurate retrieval .

Index.analyzed_no_norms: Similar to index.analyzed, but does not store norm TERMS, saves memory but does not support boost.

Index.not_analyzed_no_norms: Similar to index.not_analyzed, but does not store norm TERMS, saves memory but does not support boost, very often .

Index.no: Not indexed at all, so it will not be retrieved

By default, Luncene stores all occurrences of the word and can be closed with field.setomittermfreqandpositions (true), but affects Phrasequery and spanquery.

Store options

Store.yes: Stores the original value value, which can be extracted after retrieval .

Store.no: The original value is not stored and cannot be re-extracted after retrieval.

The compressiontools can be used to compress and decompress a byte array.

Term vector options

Term vectors are primarily used to provide support for similar searches , such as Cat search and return to Cat.

Termvector.yes: Recording Term Vector

Termvector.with_positions: Record the term vector and where each term appears

Termvector.with_offsets: Recording the term vector and the offset of each term

Termvector.with_positions_offsets: Record the term vector and where it appears + offset

termvector.no: Termvector not stored

If index selects no, then Termvector must select No

Use a type outside of string as the data source for field

Reader: cannot be store, default tokenstream is always participle and index.

Tokenstream: The result after the word breaker as a source, cannot be store, always analyzed and indexed.

Byte[]: cannot be indexed, no termvector, must be Store.yes

Options related to sorting

The number field can be used Numericfield, if the text field must be Field.Index.NOT_ANALYZED, to sort, that is, to ensure that the field contains only one token to sort .

Multi-valued field (multi-valued fields)

For example, if a book has multiple authors, what should I do?

One way is to add multiple fields of the same key, different value

Document doc = new document ();
for (int i = 0; i < authors.length; i++) {
Doc.add (New Field ("Author", Authors[i],
Field.Store.YES,
Field.Index.ANALYZED));
}

There is also a method that is presented in chapter 4th.

2.5 Boost (boost)

Boost can sort the results that affect the search return.

Boost can be done at index or search time, which is more flexible and can be developed independently but consumes more CPU.

Booost doument

At index time, boost will be stored in norms term. By default, all document has equal boost, or 1.0, which can be manually boosted by a docuemnt boost value.

Document.settboost (Float bei), bei is a multiple of 1.0.

Boost Field

You can also index field, use the boost for document, and perform the same field for the subordinate field.

Boost the field independently

Field.boost (float)

Note: Lucene's rank algorithm consists of a number of factors, andboost is just one factor, not a determining factor .

Norms

The boost value is stored in the norms, which may cause the search to consume a lot of memory. So you can turn it off:

Set no_norms, or specify Field.setomitnorms (true) in field.

2.6 Index numbers, dates, times, etc.

Indexed numbers

There are two kinds of scenarios:

1. The number is embedded in text, such as "Be sure to include Form 1099 in your", and you want to search for the word 1099. You need to choose an analyzer that does not break down the numbers , such as Whitespaceanalyzer or StandardAnalyzer. Simpleanalyzer and Stopanalyzer ignore the numbers and cannot be checked out by 1099.

2. After the digital separate field,2.9, Lucene supports the numeric type, using the Numericfield: Doc.add (New Numericfield ("Price"). Setdoublevalue (19.99)); At this point, Use a dictionary tree for numeric field storage,

You can add the same Numericfield value to the document, supported as or in Numericrangequery, Numericrangefilter, but not in the sort. Therefore, if you want to sort, you must add a unique numericfield.

The Precisionstep controls the accuracy of the scan, the smaller the more accurate but the slower the speed.

Index Date and time

To do this, convert the date to a timestamp (a long integer) , and then follow the Numericfield process.

Or, if it doesn't need to be accurate to milliseconds, it can be converted into seconds.

Doc.add (New Numericfield ("Day"). Setintvalue ((int) (New Date () GetTime ()/24/3600));

Even indexing a day instead of a specific time.

Calendar cal = Calendar.getinstance ();
Cal.settime (date);
Doc.add (New Numericfield ("DayOfMonth")
. Setintvalue (Cal.get (Calendar.day_of_month)));

2.7 Field truncation

Lucene supports truncation of fields. Indexwriter.maxfieldlength represents the maximum length of a field, which defaults to maxfieldlength.unlimited and infinity.

The maxfieldlength.limited represents a limitation and can be specified by setmaxfieldlength (int n).

After the above settings, only the first n characters are retained.

Detailed log information can be obtained through Setinfostream (System.out).

2.8 Real-time search

2.9 supports real-time search, or fast indexing – the retrieval process .

Indexreader Indexwriter.getreader ()

This method immediately refreshes the cache of index and returns indexreader for searching immediately after it takes effect.

2.9 Optimizing indexes

Index optimizations can increase search speed , not index speed. It refers to merging small index files into several.

The IndexWriter provides several optimization methods:

Optimize (): Merges the index into a single segment that is not returned until completed. But it's too resource-intensive.

Optimize (int maxnumsegments): partial optimization, optimized to up to maxnumsegments segments? This is optimized for the above extreme situations, such as 5.

Optimize (Boolean dowait): Pass optimize (), but it will return immediately.

Optimize (int maxnumsegments, Boolean dowait): Same as optimize (int maxnumsegments), but will return immediately.

In addition, a lot of extra space is spent in optimization . the old abandoned segment is not removed until Indexwriter.commit () .

2.10 Directory

Directory encapsulates the stored API and provides an abstract interface up to the following categories:

Simplefsdirectory: Stored on the local disk using java.io, do not support multi-threading, to lock themselves .

Niofsdirectory: Multithreading can be extended, using Java.nio to support multithreading security, but there are bugs under Windows .

Mmapdirectory: Memory-mapped storage (maps files to memory, similar to Nmap).

Ramdirectory: All is stored in memory.

Fileswitchdirectory: Use two directories to toggle the use of alternating.

Using Fsdirectory.open will automatically pick the right directory. You can also specify by yourself:

Directory Ramdir = new Ramdirectory ();
IndexWriter writer = new IndexWriter (Ramdir, Analyzer, IndexWriter.MaxFieldLength.UNLIMITED);

The ramdirectory is suitable for situations where memory is relatively small.

The index can be copied for acceleration:

Directory Ramdir = new Ramdirectory (otherdir);

Or

Directory.copy (Directory SourceDir,
Directory DestDir,
Boolean closedirsrc);

2.11 Thread Safety, lock

threading, multi-JVM security

Any number of indexreaders can be opened simultaneously, and can span the JVM.

At the same time , only one indexwriter can be opened , exclusive write locks . Built-in thread safety mechanism.

Indexreaders can be opened when the IndexWriter is open.

Multi-threaded can share indexreader or indexwriter, they are thread-safe, built-in synchronization mechanism and high performance.

Share IndexWriter via remote file system

Be careful not to turn on and off repeatedly, otherwise it will affect performance.

The lock of Index

In the form of a file lock, the name is Write.lock.

If you create a indexwriter that is already locked, you will encounter lockobtainfailedexception.

Other locking methods are also supported, but in general there is no need to change them.

indexwriter.islocked (directory): Checks if a directory is locked.

Indexwriter.unlock (directory): Unlocking a directory, dangerous!.

Attention! Every time IndexWriter no matter what action you perform, the close is displayed ! The lock will not be released automatically!

2.12 Debug Index 2.14 Advanced indexing options

Indexreader can be used to completely remove the removed index, the advantages are as follows:

1. By the specific number of document to delete, more accurate and indexwriter not.

2.IndexReader can be displayed immediately after deletion, and IndexWriter must be reopened to be displayed.

3.IndexReader has Undeleteall and can revoke all deleted indexes ( only valid for not yet merged ).

Releasing space after deleting an index

You can call the free space shown by Expungedeletes, which executes the merge to release the space that was deleted but only marked, not yet freed.

Caching and refreshing

When an index is added, the index is dropped, a cache is established in memory to reduce disk I/O, andLucene periodically places the changes in the cache into the directory to form a segment(segment).

IndexWriter the conditions for refreshing the cache are:

When the in-memory data is already greater than setrambuffersizemb specified.

When the number of document in the index is more than Setmaxbuffereddocs specified.

When the number of indexes is removed more than the specified setmaxbuffereddeleteterms.

When one of the above conditions occurs, that is, trigger the cache brush in, it will establish a new segment but not the disk, only after the commit is written to the disk index.

Commit of the Index

Commit to persist the changes to this index. Only after you call commit, the Indexreader or indexsearcher that you open can see the results after the most recent commit.

Closing close also calls commit indirectly.

Relative to commit is the rollback method, which revokes all changes since the last commit.

Commit is time consuming and cannot be called frequently.

"Double buffering" commit

In the development of graphical interface, there are often double buffering techniques, that is, one for being refreshed, one for display, and two for use interchangeably. Lucene also supports such a mechanism.

Lucene exposes two interfaces:

Preparecommit

Commit

Preparecommit is slower, and calling commit after Preparecommit is very fast.

Delete Policy

Indexdeletionpolicy determines the deletion policy. You can decide whether to keep the previous commit version.

Lucene's transaction support for acid

This is achieved primarily by "opening only one IndexWriter" at a time.

If the JVM, OS, or machine is hung, Lucene will automatically revert to the previous commit version.

Merging merge

Merging merge is required when there are too many segnmnet in the index. Advantages:

1. Reduce the number of segnment files

2. Reduce the amount of space the index file occupies.

Mergepolicy deciding when a merge merge should be executed

Mergepolicy

Select those files that need to be merged and have two policies by default:

Logbytesizemergepolicy: Determines whether a merge is required based on the index size

Logdocmergepolicy: Decide whether to merge according to the number of document

respectively through

Setmergefactor

and Setmaxmergedocs to specify, the specific parameters see API.

Mergescheduler

Decide how to merge:

Concurrentmergescheduler, the background additional threads are merged, and the merge is completed by Waitformerges.

Serialmergescheduler, serial merge at adddocument time, using unified Threading.

Basics: Understanding Lucene's Index document model from a conceptual concept

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.