Documentswriter of Lucene documents

Source: Internet
Author: User

Ucene Add the document to the Documentswriter code call hierarchy as follows:

Documentswriter.updatedocument (Document Doc, Analyzer Analyzer, term delterm)
--(1) documentswriterthreadstate state = Getthreadstate (doc, delterm);
--(2) Docwriter Perdoc = State.consumer.processDocument ();
--(3) Finishdocument (state, Perdoc);

The Documentswriter object mainly consists of the following parts:

* For writing index files
IndexWriter writer;
Directory directory;
Similarity similarity: Word breaker
String Segment: The current segment name, which, whenever flush, writes an index to a segment with this name.
Indexwriter.doflushinternal ()

--String segment = Docwriter.getsegment ();//return segment
--newsegment = new Segmentinfo (segment,......);
--Docwriter.createcompoundfile (segment);//Create a CFS file based on segment.

String docstoresegment: The destination segment to be written by the storage domain. (described in detail in the index file format article)
int Docstoreoffset: The offset of the storage domain in the target segment.

int NEXTDOCID: The next document ID number added to this index, this variable is unique for the same index folder, and is accessed synchronously.
Docconsumer consumer; This is the core of the entire indexing process and is the source of the Indexchain entire index chain.

Basic Index chain:
The indexing process for a document is not done by an object, but by a process chain formed by the combination of objects, each object on the chain is only
Part of the process of indexing, called the index chain, because there are other index chains, I call the basic index chain.

The Docconsumer consumer type is docfieldprocessor, which is the source of the entire index chain and contains the following sections:
* Processing of indexed fields
The Docfieldconsumer consumer type is Docinverter and contains the following sections
The Inverteddocconsumer consumer type is Termshash and contains the following sections
Termshashconsumer Consumer Type Freqproxtermswriter, responsible for writing freq, pro
Termshash Nexttermshash
Termshashconsumer Consumer Type Termvectorstermswriter, responsible for TVD, TVF information
Inverteddocendconsumer Endconsumer Type is normswriter, responsible for writing NRM information

* Processing of storage domains
Fieldinfos Fieldinfos = new Fieldinfos ();
Storedfieldswriter Fieldswriter is responsible for writing FNM, FDT, FDX information
* Delete Document
Buffereddeletes Deletesinram = new Buffereddeletes ();
Buffereddeletes deletesflushed = new Buffereddeletes ();

The class buffereddeletes contains the member variables:
* HashMap terms = new HashMap (); deleted words (term)
* HASHMAP queries = new HashMap (); deleted query
* List docids = new ArrayList (); Deleted document ID
* Long bytesused: Used to determine if the index file should be written to the deleted document.

This shows that there are three main ways to delete a document:
* Indexwriter.deletedocuments (term): All documents containing this word will be deleted.
* Indexwriter.deletedocuments (query query): All documents that satisfy this query will be deleted.
* indexreader.deletedocument (int docnum): Delete this document ID

Deleting a document can be deleted either by reader or by writer, but the reader is immediately able to
After being deleted by writer, it is cached in Deletesinram and deletesflushed, only written to the index file, when reader opens the
To see.

What is the use of Deletesinram and deletesflushed?
This version of Lucene to the deletion of the document is multi-threaded, when the document is deleted with IndexWriter, is cached in Deletesinram
Flush to write the deleted document to the index file, we know that flush is going to take a while, and in the process of flush, another line
What happens when the file is deleted?

The general process is this way, when flush, first in the synchronization (synchornized) method Pushdeletes, will deletesinram all
Deletesflushed, then clears the Deletesinram, exits the synchronization method, and the flush thread writes to the index file deletesflushed
The process of deleting a document while other lines assigns deleted documents are added to the new Deletesinram until the next flush is written to the index file

* Cache Management
To increase the speed of the index, Lucene caches a lot of data to write to the disk, but the cache needs to be managed, when allocated, when it is reclaimed, and when it is written to disk.
ArrayList freecharblocks = new ArrayList (); free blocks to be used for cached word (term) information
ArrayList freebyteblocks = new ArrayList (); a free block that will be used to cache the document number (DOC ID) and the word frequency (freq), bit (ProX) information.
ArrayList freeintblocks = new ArrayList (); the frequency (freq) and position (ProX) of the word stored in
Offset in Byteblocks
Boolean bufferisfull; To determine if the cache is full, and if it is full, it should be written to disk
Long Numbytesalloc; amount of memory allocated
Long numbytesused; amount of memory used
Long Freetrigger; Memory usage when memory should start to be reclaimed.
Long Freelevel; The amount of memory that the recovered memory should be recycled to.
Long rambuffersize; user-defined memory usage.

The relationship between cache usage is as follows:
Documentswriter.setrambuffersizemb (double MB) {
Rambuffersize = (long) (mb*1024*1024);//user-Set memory usage, when using memory greater than this time, start writing to disk
Waitqueuepausebytes = (long) (rambuffersize*0.1);
Waitqueueresumebytes = (long) (rambuffersize*0.05);
Freetrigger = (long) (1.05 * rambuffersize);//start releasing memory in Freeblocks when allocated memory reaches 105%
Freelevel = (long) (0.95 * rambuffersize);//Released to 95%
}

Documentswriter.balanceram () {
if (numbytesalloc+deletesramused > Freetrigger) {
Start freeing memory when allocating memory and deleting documents that occupy more than 105% of memory
while (numbytesalloc+deletesramused > Freelevel) {
Keep releasing until 95%.
Free Blocks
ByteBlockAllocator.freeByteBlocks.remove (ByteBlockAllocator.freeByteBlocks.size ()-1);
Numbytesalloc-= byte_block_size;
Freecharblocks.remove (Freecharblocks.size ()-1);
Numbytesalloc-= char_block_size * char_num_byte;
Freeintblocks.remove (Freeintblocks.size ()-1);
Numbytesalloc-= int_block_size * int_num_byte;
}
} else {

if (numbytesused+deletesramused > Rambuffersize) {
You can write to disk when you use memory plus delete documents that occupy more memory than user-specified memory
Bufferisfull = true;
}
}
}

When deciding if you should write to disk:
* Bufferisfull = True if the memory used is greater than the user-specified memory
* When using memory plus deleted documents accounted for by memory plus the deleted document being written takes up more memory than user specified memory
deletesinram.bytesused + deletesflushed.bytesused + numbytesused) >= rambuffersize
* When the number of deleted documents is greater than maxbuffereddeleteterms
Documentswriter.timetoflushdeletes () {
Return (Bufferisfull | | deletesfull ()) && setflushpending ();
}

Documentswriter.deletesfull () {
Return (rambuffersize! = Indexwriter.disable_auto_flush &&
(deletesinram.bytesused + deletesflushed.bytesused + numbytesused) >= rambuffersize) | |
(maxbuffereddeleteterms! = Indexwriter.disable_auto_flush &&
((Deletesinram.size () + deletesflushed.size ()) >= maxbuffereddeleteterms));
}

Multi-threaded concurrent indexing
To support multi-threaded concurrent indexing, there is a documentswriterthreadstate for each thread, for each
Thread to create an index chain (xxxperthread) of each thread based on the index chain of the Docconsumer consumer, to
Perform concurrent processing of the document.
documentswriterthreadstate[] threadstates = new Documentswriterthreadstate[0];
HashMap threadbindings = new HashMap ();
While the processing of the document can be parallel, writing the document to the index file must be done serially, and the code that is written serially
In the Documentswriter.finishdocument
Waitqueue waitqueue = new Waitqueue ()
Long waitqueuepausebytes
Long waitqueueresumebytes

In Lucene, documents are numbered in the order in which they are added, and nextdocid in Documentswriter is the record of the next added document ID. When Lucene supports
When holding multiple threads, you have to have a synchornized method to pay the document ID and add one nextdocid, which is done in the Documentswriter.getthreadstate function.

Although it is no problem to pay the document ID. But by Lucene index file format We know that the document is to be in the order of the IDs from small to uppercase to the index file
However, unlike document processing speeds, when a first-come thread processes a large document that takes a long time, another thread two may have
A lot of small documents, but these later small documents are more than the ID number of the first thread is processed by the large document, and therefore cannot immediately write to the index file
Instead, it is placed in the Waitqueue and is only written to the index file when the large document is processed.

There is a variable nextwritedocid in Waitqueue that represents the next ID that can be written to a file, and when it is id=4 to a large document, Nextwritedocid
For 4, although later the small document 5,6,7,8 etc have been processed to the end, but the following code,
Waitqueue.add () {
if (doc.docid = = nextwritedocid) {
............
} else {
Waiting[loc] = doc;

Waitingbytes + = Doc.sizeinbytes ();
}
Dopause ()
}

But there is a problem: when large documents are large and slow to handle, later threads two may have processed a lot of small documents.
Files are in the Waitqueue, then occupy more and more memory, in the short-stay, there is not enough memory danger.
Thus inside the finishdocuments, the Dopause () function was finally called in Waitqueue.add
Documentswriter.finishdocument () {
Dopause = Waitqueue.add (Docwriter);

if (dopause)
Waitforwaitqueue ();
Notifyall ()//lucene.net use SYSTEM.THREADING.MONITOR.PULSEALL (this) to notify each thread
}
Waitqueue.dopause () {
return waitingbytes > Waitqueuepausebytes;
}

When Waitingbytes is large enough (10% for user-specified memory usage), Dopause returns True, and later Thread II enters wait
Instead of processing additional documents, wait for the thread to handle the end of the large document.

When a thread handles a large document at the end of the call, Notifyall wakes up waiting for his thread.
Documentswriter.waitforwaitqueue () {
do {
try {
Wait ();
} catch (Interruptedexception IE) {
throw new ThreadInterruptedException (IE);
}
} while (!waitqueue.doresume ());
}

Waitqueue.doresume () {
return waitingbytes <= waitqueueresumebytes;
}

When Waitingbytes is small enough, Doresume returns True, thread two is no longer wait and can continue processing additional documents.
• Some flag bits
? int MaxFieldLength: The maximum number of words (term) that can be indexed within a field in a document.
? int maxbuffereddeleteterms: The maximum number of deleted words (term) that can be cached. When it is greater than this number, it is necessary
Written to the file.
This process also consists of the following three sub-procedures:

1. Get the Document Set processing object corresponding to the current thread (documentswriterthreadstate)

2. Processing documents with the resulting Document Set processing object (Documentswriterthreadstate)

3. End this document add with Documentswriter.finishdocument

Documentswriter of Lucene documents

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.