Question about Lucene (8): How to update documents that use Lucene to Build Real-Time Indexes

Source: Internet
Author: User

Concerning Lucene (7), we discussed how to use Lucene memory indexes and hard disk indexes to build real-time indexes.

However, some readers have mentioned how to build real-time indexes if documents are deleted and updated? This topic is discussed in this section.

1. How to delete a document by Lucene

 

  • IndexReader. deleteDocument (int docID) is deleted by IndexReader according to the document number.
  • IndexReader. deleteDocuments (Term term) is a document that uses IndexReader to delete a Term.
  • IndexWriter. deleteDocuments (Term term) is to use IndexWriter to delete documents containing the Term.
  • IndexWriter. deleteDocuments (Term [] terms) is to use IndexWriter to delete documents containing these terms.
  • IndexWriter. deleteDocuments (Query query) is to use IndexWriter to delete documents that can meet this Query.
  • IndexWriter. deleteDocuments (Query [] queries) is used to delete documents that can meet these queries.

You can use reader or writer to delete a document. The difference is that after the reader is deleted, the reader will take effect immediately. After the reader is deleted, it will be cached and can be viewed only when reader is opened again when it is written into the index file.

2. FAQs about Lucene document updates

 

2.1 use IndexReader or IndexWriter to delete

Since both IndexReader and IndexWriter can delete a document, which one should be used to delete the document?

We recommend that you use IndexWriter to delete the data.

IndexReader may have the following problems:

(1) When an IndexWriter is enabled, the IndexReader deletion operation cannot be performed. Otherwise, LockObtainFailedException will be reported.

(2) When IndexReader is used by multiple threads, deleting one thread will change the index seen by the other thread, making the results of the other thread uncertain.

(3) For the update operation, it is first deleted and then added in Lucene. However, the deleted operation is immediately seen, but the added operation cannot be seen immediately, this causes data inconsistency.

(4) even if the above problems can be solved through locks, the operations behind them affect the search speed, which we do not want to see.

2.2. How to cache the deletion of documents in memory

In the previous section, in order to ensure real-time performance, we used indexes in the memory, while the indexes on the hard disk were not frequently opened, even if they were opened, they were opened in the back-end thread.

If the document to be deleted is in the hard disk index and cannot be deleted if it is not re-opened, You need to cache the deleted document to the memory.

So how can I delete the files cached in the memory and apply the files to the hard disk without re-opening IndexReader?

In Lucene, IndexReader is FilterIndexReader, which can encapsulate an IndexReader. we can implement a FilterIndexReader to filter out the deleted documents.

An example is as follows:

 

Public class MyFilterIndexReader extends FilterIndexReader {

OpenBitSet dels;

Public MyFilterIndexReader (IndexReader in ){

Super (in );

Dels = new OpenBitSet (in. maxDoc ());

}

Public MyFilterIndexReader (IndexReader in, List <String> idToDelete) throws IOException {

Super (in );

Dels = new OpenBitSet (in. maxDoc ());

For (String id: idToDelete ){

TermDocs td = in. termDocs (new Term ("id", id ));// If the Lucene ing between Lucene IDs and application IDS can be cached in the memory, Reader generation will be much faster.

If (td. next ()){

Dels.set(td.doc ());

}

}

}

@ Override

Public int numDocs (){

Return in. numDocs ()-(int) dels. cardinality ();

}

@ Override

Public TermDocs termDocs (Term term) throws IOException {

Return new FilterTermDocs (in. termDocs (term )){

@ Override

Public boolean next () throws IOException {

Boolean res;

While (res = super. next ())){

If (! Dels. get (doc ())){

Break;

}

}

Return res;

}

};

}

@ Override

Public TermDocs termDocs () throws IOException {

Return new FilterTermDocs (in. termDocs ()){

@ Override

Public boolean next () throws IOException {

Boolean res;

While (res = super. next ())){

If (! Dels. get (doc ())){

Break;

}

}

Return res;

}

};

}

}

 

2.3 document update Sequence

Lucene's document update is to delete the old document and then add the new document. As mentioned above, the deleted documents are cached in the memory and applied to the hard disk index through FilterIndexReader. However, the new documents are also added to the index with the same id, this requires that the deletion of the cache will not filter out new documents, and the deletion of the cache will not be deleted when it is merged into the index.

Lucene's two updates must be the last one to overwrite the previous one, instead of the previous one to overwrite the last one.

Therefore, Multiple indexes in the hard disk in the memory must be kept in the same order. Which is the old index and which is the new index, the deletion of the cache should naturally be applied to all indexes older than him, instead of himself or herself and the indexes newer than him.

3. Lucene real-time indexing solution with update functions 3.1 Initialization

Assume that we have an index FileSystemIndex on the hard disk, which is opened in advance, including documents 1, 2, 3, 4, 5, 6.

We have an index for MemoryIndex in the memory. All the new documents are indexed to the memory index. After the index is completed, IndexWriter commit and IndexReader is re-opened, including documents 7 and 8.

 

3.2. Update document 5

In this case, to update document 5, you must first Delete document 5 and add it to document 5.

What needs to be done is:

  • First, delete document 5 in the memory index. Of course there is no document 5. The deletion is invalid.
  • Next, delete document 5 to the memory document deletion list, and form FilterIndexReader with the hard disk IndexReader.
  • Finally, add the new document 5 to the memory index. At this time, you can see the new document 5.
  • Adding document 5 to the delete list and submitting document 5 to the memory index should be an atomic operation. Fortunately, both are relatively blocks.

Note: The index on the hard disk can also be deleted in document 5. As IndexReader is not re-opened, this deletion cannot be deleted. We did not do this, I want to keep this update either in the memory or in the hard disk, instead of applying the deleted part to the hard disk, but the new file is in the memory. At this time, if the system crash occurs, the new file 5 is lost, and the old file 5 is also deleted on the hard disk. We will delete document 5 from the hard disk to the merge process from the memory index to the hard disk index.

If document 5 is updated again, delete document 5 in the memory index, add new document 5, and add document 5 to the delete list, you do not have to delete it.

3.3 merge Indexes

However, after a period of time, the indexes in the memory need to be merged into the hard disk.

During the merge process, an empty memory index needs to be re-created to merge the new documents, the IndexReader of the merged index, as well as the FilterIndexReader composed of the hard disk index and the delete list, remain open and provide external services, while the merge phase is performed in the background.

Merging the background includes the following steps:

  • Apply the delete list to the hard disk index.
  • Merge memory indexes into hard disk indexes.
  • Submit IndexWriter.

3.4. Update document 5 during Merging

What should I do if there are updates during the merge process?

  • First, delete document 5 of the merged index. This deletion will not affect the merge, because before the merge, the IndexReader of the merged index is enabled, document 5 of the index in the index merge will still be merged into the hard disk. This deletion affects the fact that no document 5 is visible in the merge of subsequent queries.
  • Then, delete document 5 to the delete list, and combine it with the delete list of the merged indexes to form FilterIndexReader.
  • Add new document 5 to the memory index.
  • Submit the deletion of document 5 from the index in the merge process, add document 5 to the delete list, and submit the addition of document 5 in the memory index to be an atomic operation, fortunately, the three are also very fast.

3.5 re-enable the hard drive index IndexReader

When the index is merged into the hard disk, it is time to re-open the index on the hard disk. The newly opened IndexReader can see the deletion of document 5.

If there is a new update at this time, it will also be added to the memory index and delete the list. For example, we will update document 6.

3.6 replace IndexReader 

After IndexReader is re-opened, You need to delete the merged index and its Delete list, disable the original IndexReader of the hard disk index, and use the new IndexReader.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.