Lucene Index Creation process

Source: Internet
Author: User
Tags iterable

  • One, Lucene build index API
  • Second, create IndexWriter
  • Third, create document
  • Four, add document
    • 1 Lucene Usage Scenarios
    • 2 Important Basic Classes
      • 2.1 Documentswriterperthreadpool
      • 2.2 ThreadState
      • 2.3 Documentswriterperthread
      • 2.4 Documentswriterflushcontrol
      • 2.5 Flushpolicy
    • 3 docwriter.updatedocument
    • 4 Docwriter.updatedocument Detailed steps
    • 5 Documentswriterperthread.updatedocument Detailed steps
    • 6 Defaultindexingchain.processdocument Detailed steps
  • Five, Commit Document
  • Six, close IndexWriter

The purpose of this document is to analyze how Lucene writes business information to disk, and does not involve how each field in the document is stored (this section is presented in a separate wiki).

One, Lucene build index API 
Directory dire = Niofsdirectory.open (Filesystems.getdefault (). GetPath (Indexdirectory)); Indexwriterconfig IWC = new Indexwriterconfig (New StandardAnalyzer ()); Iwc.setrambuffersizemb (64); Mega Default Brush IndexWriter = new IndexWriter (dire, IWC); Document doc = createdocument (artiste, skuId); Indexwriter.adddocument (DOC); Indexwriter.commit (); Indexwriter.close ();
Second, create IndexWriter
NIOFSDirectory.open()

If the 64-bit JRE gets mmapdirectory (writes the index data in memory-mapped form into file).

 
Indexwriterconfig//propertiesthis.analyzer = Analyzer;rambuffersizemb = Indexwriterconfig.default_ram_buffer_size_ mb;//default of more than 16M will trigger flush disk operation Maxbuffereddocs = indexwriterconfig.default_max_buffered_docs;// By default, Flushmaxbuffereddeleteterms = indexwriterconfig.default_max_buffered_delete_terms;//is triggered by RAM space size Mergedsegmentwarmer = Null;delpolicy = new Keeponlylastcommitdeletionpolicy ();//delete Policy commit = Null;usecompoundfile = Indexwriterconfig.default_use_compound_file_system;openmode = openmode.create_or_append;// IndexWriter Open Mode similarity = Indexsearcher.getdefaultsimilarity ();//Similarity calculation, General initialization of the searcher will be used (because only when the query will use the similarity calculation) Mergescheduler = new Concurrentmergescheduler ();// Each segement merge has a thread complete writelocktimeout = indexwriterconfig.write_lock_timeout;//write operation encounters a lock timeout indexingchain = Documentswriterperthread.defaultindexingchain;codec = Codec.getdefault (); if (codec = = null) {throw new NullPointerException ();} Infostream = Infostream.getdefault (); mergepolicy = new Tieredmergepolicy ();//merge Policy Flushpolicy =New Flushbyramorcountspolicy ();//flush strategy readerpooling = indexwriterconfig.default_reader_pooling; Indexerthreadpool = new Documentswriterperthreadpool (indexwriterconfig.default_max_thread_states);// Concurrent Write index thread pool PERTHREADHARDLIMITMB = INDEXWRITERCONFIG.DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB;

IndexWriter can do some property configuration, Indexwriterconfig inside has a very rich variety of configurations.

Third, create document

This step is simple, mainly to assemble the business fields into a document. A document is made up of multiple field fields.

Each filed typically consists of four attributes:

    • Name: The names of the fields
    • Value: Values for this field
    • Value needs to be stored in the index file: if stored in an index file, search can read from document to the value of the field
    • Value is indexed: If the field is indexed, it can be retrieved by the field
Four, add document

Add a document that actually calls the updatedocument. The Lucene update document, unlike MySQL, can update a record directly, so you can only delete the document, and then add the document. The following parameter, term, is a retrieval condition that satisfies the condition of the document to be updated.

public void UpdateDocument (term term, iterable<? extends Indexablefield> doc) throws IOException {  Ensureopen ( );  try {    Boolean success = false;    try {      if (Docwriter.updatedocument (DOC, Analyzer, term)) {        processevents (true, false);      }      Success = true;    } Finally {      if (!success) {        if (infostream.isenabled ("IW")) {          infostream.message ("IW", "hit exception Updating document ");}}}  catch (Abortingexception | OutOfMemoryError tragedy) {    tragicevent (tragedy, "updatedocument");}  }
1 Lucene Usage Scenarios

Here are some points to explain why Lucene cannot update a document directly?

    1. The design nature of Lucene is a retrieval-oriented, or read-oriented system. For the purpose of retrieval, a large number of read-optimized storage designs were made when indexing was established. In short, to read the performance, sacrificing the easy to write, update the operation.
    2. Lucene uses the background to imply that Lucene is suitable for (good at) frequent reading, infrequently written scenes.

So add a document above and finally evolve to update a document. And UpdateDocument contains two serial operations

(1) Search first, if there is a document that satisfies the criteria, delete

(2) If no document satisfies the criteria, it is added directly to the memory

2 Important Basic Classes

Before looking at the Docwriter.updatedocument (Doc, Analyzer, term) code, let's look at a few Lucene-built classes, with an emphasis on the following:

2.1 Documentswriterperthreadpool
Lucene internal implementation of a documentswriterperthread pool (not a strictly meaningful thread pool), mainly
Implement Documentswriterperthread reuse (to be precise, to implement threadstate reuse). This class can simply understand a thread pool.
2.2 ThreadState 
/*{@link ThreadState} references and guards A * {@link Documentswriterperthread} instance that's used during indexing to B Uild a in-memory index segment.*/final static class ThreadState extends Reentrantlock {  Documentswriterperthread dwpt ;  TODO this should really being part of the Documentswriterflushcontrol  //write access guarded by documentswriterflushcontr OL  Volatile Boolean flushpending = false;  TODO this should really being part of the Documentswriterflushcontrol  //write access guarded by documentswriterflushcontr Ol  Long bytesused = 0;  Guarded by Reentrant lock  Private Boolean isActive = true;   ThreadState (Documentswriterperthread dpwt) {    this.dwpt = DPWT;  }
The essence is a read-write lock, used in conjunction with the Documentswriterperthread to complete the write operation of a document.
2.3 Documentswriterperthread

Simply understood as a write thread of document. Thread pooling guarantees the reuse of documentswriterperthread.

2.4 Documentswriterflushcontrol

Control Documentswriterperthread to complete the flush operation in the index process

2.5 Flushpolicy

Refresh Policy

Understand that the ThreadState class should be simple, and can even be viewed directly as a write thread with read-write lock control. In fact, the ThreadState internal reference Documentwriterperthread instance. When the online pool is initialized, 8 ThreadState are created (this time it is not initialized, meaning that the documentwriterperthread is not created, but rather delays the initialization of the specific thread). Try to reuse the 8 threadstate in the back.

 
Documentswriterperthreadpool (int maxnumthreadstates) {//default maxnumthreadstates=8  if (Maxnumthreadstates < 1) {    throw new IllegalArgumentException ("Maxnumthreadstates must is >= 1 but is:" + maxnumthreadstates);  }  Threadstates = new Threadstate[maxnumthreadstates];  numthreadstatesactive = 0;  for (int i = 0; i < threadstates.length; i++) {    threadstates[i] = new ThreadState (null);  }  FreeList = new Threadstate[maxnumthreadstates];}
3 docwriter.updatedocument

Well, after reading a few basic classes, back to the top UpdateDocument the most critical is this line.

 
Docwriter.updatedocument (Doc, Analyzer, term) Boolean updatedocument (final iterable<? extends indexablefield> Doc, Final Analyzer Analyzer, final term delterm) throws IOException, Abortingexception {boolean hasevents = Preupda   Te ();   Final ThreadState perthread = Flushcontrol.obtainandlock ();  Final Documentswriterperthread FLUSHINGDWPT;      try {if (!perthread.isactive ()) {Ensureopen ();    Assert false: "Perthread is not active and we are still open";    } ensureinitialized (Perthread);//True initialization of a single specific thread documentswriterperthread assert perthread.isinitialized ();    Final Documentswriterperthread dwpt = PERTHREAD.DWPT;    Final int dwptnumdocs = Dwpt.getnumdocsinram (); try {dwpt.updatedocument (Doc, analyzer, delterm);//documentswriterperthread thread actually updates the document} catch (Abortingexception a      e) {flushcontrol.doonabort (perthread);      Dwpt.abort ();    Throw AE; } finally {//We don ' t know whether the document actually//counted as being IndexEd, so we must subtract here to//accumulate our separate Counter:numDocsInRAM.addAndGet (Dwpt.getnumdocsinram (    )-Dwptnumdocs);    } Final Boolean isupdate = delterm! = NULL;  FLUSHINGDWPT = Flushcontrol.doafterdocument (Perthread, isupdate);  } finally {perthreadpool.release (perthread);//Put the thread back into the thread pool, release the resource} return Postupdate (FLUSHINGDWPT, hasevents);}
4  docwriter.updatedocument Detailed steps
  1. Get a threadstate

    &NBSP;
    threadstate Obtainandlock () {final ThreadState Perthread =  Perthreadpool.getandlock (thread. CurrentThread (), documentswriter);//Fetch a threadstate Boolean success = False from the thread pool;      try {if (perthread.isinitialized () && perThread.dwpt.deleteQueue! = documentswriter.deletequeue) {  There is a flush-all in process and this DWPT are//now stale--enroll it for flush and try for//another    Dwpt:addflushablestate (Perthread);    } success = true;  Simply return the threadstate even in a flush all case sine we already hold the lock return perthread;    } finally {if (!success) {//Make sure we unlock if this fails perthreadpool.release (Perthread); }  }}
  2. Initializes the ThreadState thread Documentswriterperthread
  3. The thread updates the document
  4. The thread is re-returned to the thread pool. A freelist is maintained in the thread pool, and reusable threadstate are placed in the freelist.
5 Documentswriterperthread.updatedocument Detailed steps

After the update to the document is given to a documentswriterperthread, we look down again.

 
public void UpdateDocument (ITERABLE&LT;?. extends indexablefield> Doc, Analyzer Analyzer, term delterm) throws Ioexcept  Ion, Abortingexception {testpoint ("Documentswriterperthread adddocument start");  Assert deletequeue! = null;  Reserveonedoc ();  Docstate.doc = doc;  Docstate.analyzer = Analyzer;  Docstate.docid = Numdocsinram; if (Info_verbose && infostream.isenabled ("DWPT")) {infostream.message ("DWPT", Thread.CurrentThread ().  GetName () + "Update delterm=" + delterm + "docid=" + docstate.docid + "seg=" + segmentinfo.name); }//Even on exception, the document was still added (but marked//deleted), so we don ' t need to un-reserve at that Poin T.//aborting exceptions would actually "lose" more than one//document, so the counter would be "wrong" in this case, b UT//It ' s very hard to fix (we can ' t easily distinguish aborting//VS Non-aborting exceptions): Boolean success = FAL  Se    try {try {consumer.processdocument (); } finally {docstate.cLear ();  } success = true;      } finally {if (!success) {//Mark document as deleted deletedocid (DOCSTATE.DOCID);    numdocsinram++; }} finishdocument (Delterm);}

We only care about one line of code inside that thread.

consumer.processDocument();

From here almost enlightened, all the last document processing is handed over to a docconsumer to deal with. And this docconsumer is getting the following:

abstractDocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread) throwsIOException;

Lucene implements a default Docconsumer that is: Defaultindexingchain. The next step is to see how the Docconsumer handles the document.

6 Defaultindexingchain.processdocument Detailed steps
@Overridepublic void Processdocument () throws IOException, abortingexception {//How many indexed field names we ' ve see   N (Collapses//Multiple field instances by the same name): int fieldcount = 0;   Long Fieldgen = nextfieldgen++; Note:we need, passes, in case there is//multi-valued fields, because we-must process all//instances of A given field at once, since the//analyzer was free to reuse tokenstream across fields//(i.e., we cannot has more th   An one Tokenstream//running "at once"): Termshash.startdocument ();  Fillstoredfields (DOCSTATE.DOCID);   Startstoredfields ();  Boolean aborting = false; try {for (Indexablefield Field:docState.doc) {//walk through each field to do the processing, haha, finally revealing the lovely tail FieldCount = Processfield (field, F    Ieldgen, FieldCount);    }} catch (Abortingexception ae) {aborting = true;  Throw AE; } finally {if (aborting = = false) {//Finish each indexed field name seen in the document:for (int i=0;i&l t;fieldcount;i++){fields[i].finish ();    } finishstoredfields ();  }} try {termshash.finishdocument (); } catch (Throwable th) {//must abort, on the possibility that on-disk term//vectors is now corrupt:throw Ab  Ortingexception.wrap (TH); }}

See the above code, I laughed. Haha, more and more clear, there is no. The processing of this document is nothing more than evolution into a traversal of each field, the field to do the processing on the line. But the specific field what to do, the wiki does not involve, put on another wiki in-depth record (refer to: Document storage details).

Five, Commit Document
indexWriter.commit();

Commit commit to do the following:

    • All pending changes are submitted to index. Includes new additions to the document, segement to delete the document, and the merge.
    • This action performs a directory.sync,sync operation that flushes the file system's cache to disk. While it is time-consuming (synchronization time-consuming), the VM hangs (or loses power) after it is flushed to disk without affecting these pending updates.

The specific explanation for the sync operation can be explained in the following paragraph:

Traditional UNIX implementations have a buffer cache or page cache in the kernel, and most disk I/O is buffered. When writing data to a file, the kernel usually copies the data into one of the buffers, and if the buffer is not yet full, it is not queued to the output queue, but waits for it to be full, or when the kernel needs to reuse the buffer to hold other disk block data, then the buffer is queued to the output queue, and then when it arrives at the first To perform the actual I/O operation. This output is called deferred write (delayed Write) (Bach [1986] in 3rd chapter discusses the buffer cache in detail). Deferred write reduces disk read and write times, but reduces file content updates so that data written to the file is not written to disk for a period of time. In the event of a system failure, this delay may result in the loss of file update content. To ensure consistency between the actual file system on the disk and the content in the buffer cache, the UNIX system provides sync, Fsync, and Fdatasync three functions. The sync function simply queues all the modified block buffers into the write queue and then returns, not waiting for the actual write disk operation to end. The system daemon, commonly referred to as update, periodically calls the sync function (typically every 30 seconds). This ensures that the block buffers of the kernel are flushed periodically. Command Sync (1) also calls the Sync function. The Fsync function works only on a single file specified by the file descriptor Filedes, and waits for the write disk operation to end and then returns. Fsync can be used in applications such as databases, which need to ensure that modified blocks are immediately written to disk. The Fdatasync function is similar to Fsync, but it affects only the data portion of the file. In addition to the data, Fsync also synchronizes the properties of the updated file. For a database that provides transactional support, when a transaction commits, it is necessary to ensure that the transaction log (which contains the modification operation and a commit record) is fully written to the hard disk before the transaction is committed and returned to the application tier.

After reading this explanation, the sync operation is to flush the cache data from the file system (and even the kernel) to disk, ensuring the security of the data (OS hangs, power off, data is not lost).

What does the specific lucene do?

Private final void commitinternal (Mergepolicy mergepolicy) throws IOException {   if (infostream.isenabled ("IW")) { C1/>infostream.message ("IW", "Commit:start");  }   Synchronized (commitlock) {    ensureopen (false);     if (infostream.isenabled ("IW")) {      infostream.message ("IW", "commit:enter lock");    }     if (Pendingcommit = = null) {      if (infostream.isenabled ("IW")) {        infostream.message ("IW", "Commit:now prepare") ;      }      Preparecommitinternal (Mergepolicy);//The most critical line    } else {      if (infostream.isenabled ("IW")) {        Infostream.message ("IW", "Commit:already prepared");      }    }     Finishcommit ();  }}

Go to preparecommitinternal inside is a detailed refresh operation, the index refresh operation is described in another wiki.

Six, close IndexWriter

Refreshes the data and closes the resource. Go inside, the logic is still very rich. After the flush is finished, look back at this part.

Lucene Index Creation process

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.