Lucene--Real-time indexing

Source: Internet
Author: User

Lucene Real-time search can be divided into: real-time and near real-time search .

Real-time only depends on memory. Near real-time can be provided in Lucene with Org.apache.lucene.index.DirectoryReader.open (IndexWriter writer, Boolean applyalldeletes) throws IOException can achieve near real-time results without compromising performance (such as searching every 1s, similar to the implementation in SOLR).

first , real-time search

Lucene typically has ramdirectory and fsddirectory two ways to store indexes

Lucene's transactional, so that Lucene can incrementally add a segment, we know that the inverted index is a certain format, and once the format is very difficult to change the writing, then how to incrementally build the index?
  Lucene uses the concept of segments to solve this problem, for each created segment, its inverted index structure will no longer change, and the increment of the added document added to the new segment, between the segments at a certain time merge, resulting in a new inverted index structure. Lucene's transactional, so that Lucene index is not real-time, if you want Lucene real-time, you must add a new document after IndexWriter need commit, in the search Indexreader need to reopen, However, when the index is on the hard disk, especially when the index is very large, the commit operation of IndexWriter and the open operation of Indexreader are very slow, and the need of real-time is not reached at all.

In fact, the general application, if you can allow 1, 2 minutes of delay, then use fsddirectory is enough, every 1 minutes to increase the index and commit.
But if there is a need to search in real-time, then you need to use RAM and FSD two ways to combine.

The general principle is to combine multiple index searcher with Multireader .
(Multireader can be used for real-time search services or distributed indexes.)
The real-time steps are:

1, first open fsdindex, for searching; If you add a document, add Ramindex, and then open Ramsearcher. The re-opening of RAM is fast. Then periodically write the Ramindex to disk. 2. At the time of writing, FSD needs to commit and re-open a reader, this time need to open a new ramindex. At this time the search needs to open 3 searcher, the original Ramsearcher, the original Fsdsearcher, the new ramsearcher. This time the original Ramindex write to the disk, as long as the non-commit will not appear duplicate results. 3, Ramindex write to the end of the disk, then need to open a new fsdsearcher, the process is relatively slow. So we keep the 2nd step of the 3 searcher first unchanged, continue to serve. 4, when the experience Fsdsearcher open, then discard the original Fsdsearcher and the original ramseacher. Using the new Fsdsearcher and ramsearcher this 4-step operation is mostly atomic, if done (2) but not done (3), if a search is made, you will see less of the data, and if you do (3) do not do (2), see a portion of the data. So you need to add a sync lock to prevent data anomalies.

second, near real-time search

Implementation principle:
The principles of near real time search are recorded in LUCENE-1313 and LUCENE-1516.

LUCENE-1313, a RAM directory is maintained inside index writer, and flush and merge operations only update the data to RAM directory before the memory is sufficient, only the index The optimize and commit actions on writer will cause the data on the RAM directory to be fully synchronized to the file.

Lucene-1516,index Writer provides an API for real-time access to reader, which causes the flush operation to generate a new segment, but does not commit (fsync), thus reducing IO. The new segment is added to the newly generated reader. From the returned reader, you can see the update. So, as long as each new search gets a new reader from index writer, you can search for the latest content. The overhead of this operation is only flush, which is a small overhead relative to commit.

Lucene index is organized into multiple segment in an index directory. The new doc will be added to the new segment, and these new small segment are merged every once in a while. Because of the consolidation, the total number of segment remains small, and the overall search speed is still fast. To prevent read-write conflicts, Lucene creates only new segment and removes old segment after any active reader is not in use.
Flush is the buffer that writes data to the operating system, so long as the buffer is not satisfied, there is no hard disk operation.
Commit is to write all the memory buffers data to the hard disk, which is a complete hard drive operation.
Optimize is a merger of multiple segment, which involves the re-reading of old segment and the merging of new segment, which belong to the CPU and Io-bound
Heavy-weight operation. This is because the most important structure in the Lucene index is stored and posting in the format of Vint and Delta. Merging the posting of the same term is a process of reading, merging and rebuilding.


Code Interpretation:
In the IndexWriter method of acquiring reader, two methods Doflush () and Maybemerge () were called. Doflush () will call Documentswriter's Flush method to generate a new segment, and the returned reader will be able to access the new segment. Documentswriter receives multiple document additions and writes to the same segment. Each of the added doc passes through multiple docconsumer lines, including storedfieldswriter (internal call fieldswriter), Termvectorstermswriter, Freqproxtermswriter,normswriter and so on. In the absence of an active call to flush, RAM buffer is completely exhausted or the number of docs joined is large enough to create a new segment and flush it into the directory.
The Freqproxtermswriter call Termhashperfield is responsible for the term's indexing process, and when a field word item is indexed, the Add () function corresponding to the Termshashperfield is used to complete (a) The word item indexing process and the indexed content ( The word item string/pointer information/location information, etc.) is stored in the memory buffer. The middle process uses the Charblockpool,intblockpool,byteblockpool, as long as the memory is sufficient, you can continue to add.
characteristic test:
A document retrieval program is designed, the process manages an index writer and two threads, thread A is responsible for indexing the new document, and thread B is responsible for processing the search request, where the search uses the new API of IndexWriter to get the new reader. Observe the results of the search and the changes in the index directory by alternately generating the index and search requests. The experimental results are as follows:

1 when IndexWriter is turned on, a lock file is generated
2 each time the reader is called, if an update occurs, a flush is performed first, and the updated data that was saved in memory is written as a new segment, with one additional. CFS.
3 from the new reader, you can read the previously added doc information.
4 Once the newly generated segment reaches 10 times, a optimize occurs, generating 8 files for. FDT,. FDX,. Frq,. FNM,
. NRM,. Prx,. tii,. Tis.
5 of course, the outside can also actively trigger optimize, the result is the same. Multiple segment files before optimize and files previously optimize are no longer useful.
6 because optimize generates CFS consumes double disk space and adds additional processing time, when the index size of optimize is larger than 10 or a specified size of the total index size, even if index writer specifies the CFS format, Optimize still retains the format of multiple files (LUCENE-2773).
7 Calling IndexWriter's Close method, the lock file is freed, but the previously generated file is not deleted except for the optimize result file. Only the next time you open the index directory, the files you don't need will be deleted.
8 in three cases, IndexWriter will attempt to delete unwanted files, on Open,on Flushing a new Segment,on finishing a merge. However, if the currently open reader is using the file, it will not be deleted.
9 Therefore, when reader is finished, be sure to call the Close method to free the unwanted files.

ImportOrg.apache.lucene.analysis.standard.StandardAnalyzer;Importorg.apache.lucene.document.Document;ImportOrg.apache.lucene.document.Field;ImportOrg.apache.lucene.document.FieldType;ImportOrg.apache.lucene.index.DirectoryReader;ImportOrg.apache.lucene.index.IndexWriter;ImportOrg.apache.lucene.index.IndexWriterConfig;Importorg.apache.lucene.store.Directory;Importorg.apache.lucene.store.RAMDirectory;Importorg.apache.lucene.util.Version; Public classNRT1 {/*** Use Indexreader.open (writer, false) to achieve near real-time effects * *@paramargs *@throwsException*/     Public Static voidMain (string[] args)throwsException {Directory dir=Newramdirectory (); Indexwriterconfig IWC=Newindexwriterconfig (version.lucene_41,NewStandardAnalyzer (version.lucene_41)); IndexWriter W=NewIndexWriter (dir, IWC);        W.commit (); FieldType DOCTYPE=NewFieldType (); Doctype.setindexed (true); Doctype.setstored (true); Document Doc=NewDocument (); Doc.add (NewField ("title", "Haha", DOCTYPE)); Directoryreader R=Directoryreader.open (dir);  for(inti = 0; I < 3; i++) {w.adddocument (doc); R= Directoryreader.open (W,false);//true: Make the delete visible (not write to disk); False: The delete operation is not visible, which makes performance higher than true. System.out.println (R.numdocs ()); }    }}

Lucene--Real-time indexing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.