Ext.: http://blog.csdn.net/duck_genuine/article/details/6053430
Directory (?) [+]
Lucene has two main documentation models: Document and field, and a document may contain several field.
Each field has a different strategy:
1. Indexed or not, after the field has been parsed (Analyisi), it is not in the original text until it is added to the index.
2. If indexed, you can choose whether to save the term vector (vector) for similar searches.
3. You can choose whether to store (store), copy the original text directly, do not index, for retrieval after the removal.
The document model in Lucene is similar to a database, but not exactly the same, and is reflected in the following areas:
1. No canonical format, i.e. no fixed schema, no columns, etc. the document added in the same index can contain a different field.
2. Non-formal, Lucene document model is a flat structure, no recursive definition, natural connection and so on complex structure.
2.2 Understanding the indexing process
Overall, the indexing process is:
1. Extract summary: Extracts from the source and creates document and Field objects. Tika provides text extracts for non-text, such as PDF, Word, and so on.
2. Analysis: First, the field of the document is decomposed to generate a token stream, and then pass through a series of filter (such as lowercase) and so on.
3. Indexing: Written to the index by IndexWriter Adddocument. Lunece uses a reverse index, which is "the document contains the word X" instead of "what Word does the document contain?"
Index file composition
To ensure efficiency, each index consists of several segments:
_x.cfs Each segments consists of several CFS, X is 0,1,2 .... If Usecompoundfile is turned on, there is only one. cfs file.
SEGMENTS_<N>: Records the CFS file corresponding to each partition.
After each period of time, these segment are automatically merged when IndexWriter is called.
2.3 Basic operation of the index
First create the IndexWriter
IndexWriter (Dir,new whitespaceanalyser (), IndexWriter.MaxField.UNLIMITED);
Dir is the save path to the index, Whitespaceanalyser is based on a blank word breaker, and the last part limits the number of field.
Create document and field in sequence
Document doc = new document ();
Doc.add (New Filed (Key,value,store?,index?)
Key is the field name that is searched for, and value is the text to be written/parsed.
Store, regardless of the index, whether additional storage of the original text, can be called after the search results, no no additional storage; YES, additional storage.
Index, no, no index, ANALYZED, post-word indexing, not_analyzed, non-segmentation index, analyzed_no_norms, Word segmentation index, norms;not_analyzed_no_norms not stored, No participle, index, no storage norms. In addition to no, the index is counted and can be searched. Norms stores the information needed for boost, including Norm, which may consume more memory?
Delete Index
IndexWriter provides the ability to delete document:
Deletedocumen (term)
Deletedocumen (term[])
Deletedocumen (Query)
Deletedocumen (Query [])
Special note that term is not necessarily unique, so it is possible to delete multiple mistakenly . It is also best to choose a unique, non-indexed term to prevent confusion (such as a unique ID).
Delete commit () and close to actually write to the index file.
After deletion, only marked for deletion, Maxdoc () returns all documents (including deleted but not cleaned); Numdocs: Number of documents not deleted
After using delete, optimize (): Compress the deleted space, and then commit to delete the free space.
Update index
UpdateDocument (term,document), lunce only supports all substitutions, that is, the entire docuemnt is to be replaced and the individual field cannot be updated.
Options for 2.4 field
The options are divided into three categories: index, storing, and term vector.
Index option
Index.analyzed: Post-Segmentation index
Index.not_analyzed: Non-segmentation direct index, such as URL, system path, etc., for accurate retrieval .
Index.analyzed_no_norms: Similar to index.analyzed, but does not store norm TERMS, saves memory but does not support boost.
Index.not_analyzed_no_norms: Similar to index.not_analyzed, but does not store norm TERMS, saves memory but does not support boost, very often .
Index.no: Not indexed at all, so it will not be retrieved
By default, Luncene stores all occurrences of the word and can be closed with field.setomittermfreqandpositions (true), but affects Phrasequery and spanquery.
Store options
Store.yes: Stores the original value value, which can be extracted after retrieval .
Store.no: The original value is not stored and cannot be re-extracted after retrieval.
The compressiontools can be used to compress and decompress a byte array.
Term vector options
Term vectors are primarily used to provide support for similar searches , such as Cat search and return to Cat.
Termvector.yes: Recording Term Vector
Termvector.with_positions: Record the term vector and where each term appears
Termvector.with_offsets: Recording the term vector and the offset of each term
Termvector.with_positions_offsets: Record the term vector and where it appears + offset
termvector.no: Termvector not stored
If index selects no, then Termvector must select No
Use a type outside of string as the data source for field
Reader: cannot be store, default tokenstream is always participle and index.
Tokenstream: The result after the word breaker as a source, cannot be store, always analyzed and indexed.
Byte[]: cannot be indexed, no termvector, must be Store.yes
Options related to sorting
The number field can be used Numericfield, if the text field must be Field.Index.NOT_ANALYZED, to sort, that is, to ensure that the field contains only one token to sort .
Multi-valued field (multi-valued fields)
For example, if a book has multiple authors, what should I do?
One way is to add multiple fields of the same key, different value
Document doc = new document ();
for (int i = 0; i < authors.length; i++) {
Doc.add (New Field ("Author", Authors[i],
Field.Store.YES,
Field.Index.ANALYZED));
}
There is also a method that is presented in chapter 4th.
2.5 Boost (boost)
Boost can sort the results that affect the search return.
Boost can be done at index or search time, which is more flexible and can be developed independently but consumes more CPU.
Booost doument
At index time, boost will be stored in norms term. By default, all document has equal boost, or 1.0, which can be manually boosted by a docuemnt boost value.
Document.settboost (Float bei), bei is a multiple of 1.0.
Boost Field
You can also index field, use the boost for document, and perform the same field for the subordinate field.
Boost the field independently
Field.boost (float)
Note: Lucene's rank algorithm consists of a number of factors, andboost is just one factor, not a determining factor .
Norms
The boost value is stored in the norms, which may cause the search to consume a lot of memory. So you can turn it off:
Set no_norms, or specify Field.setomitnorms (true) in field.
2.6 Index numbers, dates, times, etc.
Indexed numbers
There are two kinds of scenarios:
1. The number is embedded in text, such as "Be sure to include Form 1099 in your", and you want to search for the word 1099. You need to choose an analyzer that does not break down the numbers , such as Whitespaceanalyzer or StandardAnalyzer. Simpleanalyzer and Stopanalyzer ignore the numbers and cannot be checked out by 1099.
2. After the digital separate field,2.9, Lucene supports the numeric type, using the Numericfield: Doc.add (New Numericfield ("Price"). Setdoublevalue (19.99)); At this point, Use a dictionary tree for numeric field storage,
You can add the same Numericfield value to the document, supported as or in Numericrangequery, Numericrangefilter, but not in the sort. Therefore, if you want to sort, you must add a unique numericfield.
The Precisionstep controls the accuracy of the scan, the smaller the more accurate but the slower the speed.
Index Date and time
To do this, convert the date to a timestamp (a long integer) , and then follow the Numericfield process.
Or, if it doesn't need to be accurate to milliseconds, it can be converted into seconds.
Doc.add (New Numericfield ("Day"). Setintvalue ((int) (New Date () GetTime ()/24/3600));
Even indexing a day instead of a specific time.
Calendar cal = Calendar.getinstance ();
Cal.settime (date);
Doc.add (New Numericfield ("DayOfMonth")
. Setintvalue (Cal.get (Calendar.day_of_month)));
2.7 Field truncation
Lucene supports truncation of fields. Indexwriter.maxfieldlength represents the maximum length of a field, which defaults to maxfieldlength.unlimited and infinity.
The maxfieldlength.limited represents a limitation and can be specified by setmaxfieldlength (int n).
After the above settings, only the first n characters are retained.
Detailed log information can be obtained through Setinfostream (System.out).
2.8 Real-time search
2.9 supports real-time search, or fast indexing – the retrieval process .
Indexreader Indexwriter.getreader ()
This method immediately refreshes the cache of index and returns indexreader for searching immediately after it takes effect.
2.9 Optimizing indexes
Index optimizations can increase search speed , not index speed. It refers to merging small index files into several.
The IndexWriter provides several optimization methods:
Optimize (): Merges the index into a single segment that is not returned until completed. But it's too resource-intensive.
Optimize (int maxnumsegments): partial optimization, optimized to up to maxnumsegments segments? This is optimized for the above extreme situations, such as 5.
Optimize (Boolean dowait): Pass optimize (), but it will return immediately.
Optimize (int maxnumsegments, Boolean dowait): Same as optimize (int maxnumsegments), but will return immediately.
In addition, a lot of extra space is spent in optimization . the old abandoned segment is not removed until Indexwriter.commit () .
2.10 Directory
Directory encapsulates the stored API and provides an abstract interface up to the following categories:
Simplefsdirectory: Stored on the local disk using java.io, do not support multi-threading, to lock themselves .
Niofsdirectory: Multithreading can be extended, using Java.nio to support multithreading security, but there are bugs under Windows .
Mmapdirectory: Memory-mapped storage (maps files to memory, similar to Nmap).
Ramdirectory: All is stored in memory.
Fileswitchdirectory: Use two directories to toggle the use of alternating.
Using Fsdirectory.open will automatically pick the right directory. You can also specify by yourself:
Directory Ramdir = new Ramdirectory ();
IndexWriter writer = new IndexWriter (Ramdir, Analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
The ramdirectory is suitable for situations where memory is relatively small.
The index can be copied for acceleration:
Directory Ramdir = new Ramdirectory (otherdir);
Or
Directory.copy (Directory SourceDir,
Directory DestDir,
Boolean closedirsrc);
2.11 Thread Safety, lock
threading, multi-JVM security
Any number of indexreaders can be opened simultaneously, and can span the JVM.
At the same time , only one indexwriter can be opened , exclusive write locks . Built-in thread safety mechanism.
Indexreaders can be opened when the IndexWriter is open.
Multi-threaded can share indexreader or indexwriter, they are thread-safe, built-in synchronization mechanism and high performance.
Share IndexWriter via remote file system
Be careful not to turn on and off repeatedly, otherwise it will affect performance.
The lock of Index
In the form of a file lock, the name is Write.lock.
If you create a indexwriter that is already locked, you will encounter lockobtainfailedexception.
Other locking methods are also supported, but in general there is no need to change them.
indexwriter.islocked (directory): Checks if a directory is locked.
Indexwriter.unlock (directory): Unlocking a directory, dangerous!.
Attention! Every time IndexWriter no matter what action you perform, the close is displayed ! The lock will not be released automatically!
2.12 Debug Index 2.14 Advanced indexing options
Indexreader can be used to completely remove the removed index, the advantages are as follows:
1. By the specific number of document to delete, more accurate and indexwriter not.
2.IndexReader can be displayed immediately after deletion, and IndexWriter must be reopened to be displayed.
3.IndexReader has Undeleteall and can revoke all deleted indexes ( only valid for not yet merged ).
Releasing space after deleting an index
You can call the free space shown by Expungedeletes, which executes the merge to release the space that was deleted but only marked, not yet freed.
Caching and refreshing
When an index is added, the index is dropped, a cache is established in memory to reduce disk I/O, andLucene periodically places the changes in the cache into the directory to form a segment(segment).
IndexWriter the conditions for refreshing the cache are:
When the in-memory data is already greater than setrambuffersizemb specified.
When the number of document in the index is more than Setmaxbuffereddocs specified.
When the number of indexes is removed more than the specified setmaxbuffereddeleteterms.
When one of the above conditions occurs, that is, trigger the cache brush in, it will establish a new segment but not the disk, only after the commit is written to the disk index.
Commit of the Index
Commit to persist the changes to this index. Only after you call commit, the Indexreader or indexsearcher that you open can see the results after the most recent commit.
Closing close also calls commit indirectly.
Relative to commit is the rollback method, which revokes all changes since the last commit.
Commit is time consuming and cannot be called frequently.
"Double buffering" commit
In the development of graphical interface, there are often double buffering techniques, that is, one for being refreshed, one for display, and two for use interchangeably. Lucene also supports such a mechanism.
Lucene exposes two interfaces:
Preparecommit
Commit
Preparecommit is slower, and calling commit after Preparecommit is very fast.
Delete Policy
Indexdeletionpolicy determines the deletion policy. You can decide whether to keep the previous commit version.
Lucene's transaction support for acid
This is achieved primarily by "opening only one IndexWriter" at a time.
If the JVM, OS, or machine is hung, Lucene will automatically revert to the previous commit version.
Merging merge
Merging merge is required when there are too many segnmnet in the index. Advantages:
1. Reduce the number of segnment files
2. Reduce the amount of space the index file occupies.
Mergepolicy deciding when a merge merge should be executed
Mergepolicy
Select those files that need to be merged and have two policies by default:
Logbytesizemergepolicy: Determines whether a merge is required based on the index size
Logdocmergepolicy: Decide whether to merge according to the number of document
respectively through
Setmergefactor
and Setmaxmergedocs to specify, the specific parameters see API.
Mergescheduler
Decide how to merge:
Concurrentmergescheduler, the background additional threads are merged, and the merge is completed by Waitformerges.
Serialmergescheduler, serial merge at adddocument time, using unified Threading.
Basics: Understanding Lucene's Index document model from a conceptual concept