Step by step learning from me lucene---lucene incremental update and NRT (near-real-time) query near real-time

Source: Internet
Author: User

These two days overtime, can not take into account the blog update. Please forgive us.

Sometimes after we create the index, the data source may have updated content. And we imagine that the database can be directly in the query today. Here is what we call the incremental index. How can we achieve this demand? The implementation of this incremental index is not provided internally by Lucene.

We may think of it here in general. All the previous indexes are deleted, and then the rebuild is indexed.

For such a practice. It is also possible to assume that the number of bars in a data source is not particularly large. Suppose the data source has a particularly large number of bars. is bound to cause query data time consuming. At the same time, the construction of the index is more time-consuming, several phases superimposed, it is bound to cause the query when the data is missing, which will seriously affect the user's experience.

The implementation of the more common incremental indexes is:

    • Set a timer to periodically read from the data source the new content in the existing index file or the data in the data source with the updated label.
    • Convert data into required document and index

The advantage of doing so is to completely remove the index from the above and then rebuild it:

    • Data source query scans for small amounts of data
    • The corresponding update index has fewer number of bars. Reduced a lot of indexwriter commit and close these time-consuming operations

The above overcomes the problem of increment, but the real-time problem still exists:

    • Index changes are only available after the IndexWriter commit has been run.

So how do we improve the real-time, we all know that Lucene index can be in the file index and memory index two ways exist. Compared to the file index. The memory index runs more efficiently than the file index, because the file index is frequently IO-operated. In combination with the above considerations, we use the file Index + memory index in the form of Lucene incremental update, in fact, the mechanism is as follows:

    • Scheduled tasks scan for changes to data sources
    • The list of data sources obtained is placed in memory
    • When the document in memory reaches the limit of quantity. Deletes an in-memory index in a queue. and add it to the file index
    • Query by the use of File + Memory index federated Query Way to achieve NRT effect
Timed Task Scheduler

Java has built-in timertask. This class is capable of providing timed tasks. But one thing is that TimerTask's mission is stateless. We also need to set the task in parallel, and understand that the Quartz task scheduling framework provides stateful task Statefuljob. That is, the next task will not run when the scheduled task is not completed;

A common way for us to start a quartz task is as follows:

Date runTime = datebuilder.evenseconddate (new date ());      Stdschedulerfactory SF = new Stdschedulerfactory ();   Scheduler Scheduler = Sf.getscheduler ();      Jobdetail job = Jobbuilder.newjob (Xxx.class). build (); Trigger Trigger = Triggerbuilder.newtrigger (). StartAt (RunTime). Withschedule (Simpleschedulebuilder.simpleschedule (      ). Withintervalinseconds (3). RepeatForever ()). Forjob (Job). Build ();            Scheduler.schedulejob (Job, trigger); Scheduler.start ();</span>

Above we are set to run a scheduled task every three seconds, and the task class is xxx

Common methods for task classes

Here I define a XXX parent class, whose definition is as follows:

Package Com.chechong.lucene.indexcreasement;import Java.util.list;import Java.util.timertask;import Org.apache.lucene.store.ramdirectory;import org.quartz.job;import org.quartz.statefuljob;/** stateful task: serial run. That is, do not agree that the last run did not complete the hypothesis needs to be parallel to the interface to the job can be * @author Lenovo * */public Abstract class Baseincreasementindex implements Statefuljob {/** * memory index */private ramdirectory ramdirectory;public baseincreasementindex () {}public Baseincreasementindex (Ramdirectory ramdirectory) {super (); this.ramdirectory = Ramdirectory;} /** Update Index * @throws Exception */public abstract void Updateindexdata () throws exception;/** consumption data * @param list */public abstr ACT void consume (list list) throws Exception;}

Task class related implementations, the following method is to get the data source to be indexed Xxxincreasementindex

@Overridepublic void Execute (Jobexecutioncontext context) throws Jobexecutionexception {try {Xxxincreasementindex index = new Xxxincreasementindex (Constants.xxx_index_path, Xxxdao.getinstance (), Ramdirectorycontrol.getramdireactory ()); Index.updateindexdata ();} catch (Exception e) {//TODO auto-generated catch Blocke.printstacktrace ();}}


@Overridepublic void Updateindexdata () throws Exception {int maxbeanid = Searchutil.getlastindexbeanid (); System.out.println (Maxbeanid); list<xxx> sources = Xxxdao.getlistinfobefore (Maxbeanid);,, if (sources! = null && sources.size () > 0) {th Is.consume (sources);}}


Here, XXX represents the entity class object for which we want to get data

The consume method mainly does two things:

    • Data is stored in the memory index
    • Infers the number of memory indexes, exceeds the limit, and queues out the exceeded number and stores it in the file index
@Overridepublic void consume (List list) throws Exception {IndexWriter writer = ramdirectorycontrol.getramindexwriter (); Ramdirectorycontrol.consume (writer,list);}

Above we put the memory index and the implementation of the queue in the Ramdirectorycontrol

Memory Index Controller

First of all, we initialize the memory index IndexWriter, we need to pay attention to run a commit when initializing, otherwise we will prompt no segments exception

private static IndexWriter ramindexwriter;private static ramdirectory directory;static{directory = new Ramdirectory (); try {ramindexwriter = Getramindexwriter ();} catch (Exception e) {//TODO auto-generated catch Blocke.printstacktrace ();}} public static Ramdirectory Getramdireactory () {return directory;} public static Indexsearcher Getindexsearcher () throws Ioexception{indexreader reader = Null;indexsearcher searcher = null ; try {reader = Directoryreader.open (directory);} catch (IOException e) {e.printstacktrace ();} Searcher = new Indexsearcher (reader); return searcher;} /** single-case mode get Ramindexwriter * @return * @throws Exception */public static IndexWriter Getramindexwriter () throws Exception{if (Ramindexwriter = = null) {synchronized (indexwriter.class) {Analyzer Analyzer = new Ikanalyzer (); Indexwriterconfig iwconfig = new       Indexwriterconfig (analyzer);  Iwconfig.setopenmode (Openmode.create_or_append); try {ramindexwriter = new IndexWriter (directory, iwconfig); Ramindexwriter.commit (); RAMINDEXWRiter.close (); iwconfig = new Indexwriterconfig (analyzer);  Iwconfig.setopenmode (Openmode.create_or_append); Ramindexwriter = new IndexWriter (directory, iwconfig);} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();}}} return ramindexwriter;}

Defines a method to get the number of data bars in a memory index

/** based on the Finder, Query criteria, per page, sorting criteria query * @param query criteria * @param first start value * @param max Max * @param sort sort Condition * @return */public static Top Docs Getscoredocsbyperpageandsortfield (indexsearcher searcher,query query, int first,int max, sort sort) {try {if (query = = null) {System.out.println ("Query is null return null"); return null;} Topfieldcollector collector = null;if (sort! = null) {collector = Topfieldcollector.create (sort, First+max, false, False, F Alse);} else{sortfield[] SortField = new Sortfield[1];sortfield[0] = new SortField ("Createtime", sortfield.type.string,true); Sort DefaultSort = new sort (SortField), collector = Topfieldcollector.create (Defaultsort,first+max, False, False, false) ;} Searcher.search (query, collector); return Collector.topdocs (First, max);} catch (IOException e) {//TODO auto-generated catch Block}return null;} 

This method returns the result as Topdocs. We use Topdocs's totalhits to get the number of data bars in the memory index to identify memory usage and prevent memory overflow.

The implementation of the consume method is as follows:

/** consumption data * @param docs * @param listsize * @param writer * @param list * @throws Exception */public static void consume ( IndexWriter writer, list list) throws Exception {query query = new Matchalldocsquery (); Indexsearcher searcher = Getindexse Archer (); SYSTEM.OUT.PRINTLN (directory); Topdocs Topdocs = Getscoredocsbyperpageandsortfield (Searcher,query, 1, 1, null); int currenttotal = Topdocs.totalhits;if (Currenttotal+list.size () > Constants.xxx_ram_limit) {//exceeded memory limit int pulcount = constants.xxx_ram_limit-currenttotal; list<document> docs = new linkedlist<document> (); if (pulcount <= 0) {//directly processes the contents of the collection Topdocs Alldocs = Searchutil.getscoredocsbyperpageandsortfield (searcher, query, 0,currenttotal, NULL); Scoredoc[] scores = alldocs.scoredocs;for (int i = 0; i < scores.length; i + +) {//Remove data in memory document DOC1 = Searcher.doc (s Cores[i].doc); Integer pollid = Integer.parseint (Doc1.get ("id"));D ocument doc = deldocumentfromramdirectory (pollId); if (doc! = null) {XXX Carsource = (XXX) beantransfErutil.doc2bean (Doc, Xxx.class);D ocument doc2 = carsource2document (Carsource); if (doc2! = null) {Docs.add (DOC2);}}} Adddocumenttofsdirectory (docs); writer = Getramindexwriter (); Consume (writer, list);} else{//first Take out the part that does not reach the memory list subprocesslist = list.sublist (0, Pulcount); consume (writer, subprocesslist); List leavelist = list.sublist (Pulcount, List.size ()); consume (writer, leavelist);}} The else{//has not exceeded the limit. Store directly into memory int listsize = List.size (); if (Listsize > 0) {//Store to Memory}}}

The logic above is:

    1. Gets the number of data bars in the current memory according to Getscoredocsbyperpageandsortfield
    2. Compared with the total number of data sources in memory A and the data source obtained by this time B and the number of in-memory limit C
    3. Assume that A+b<=c does not exceed the limits of the memory index. All data is stored in memory
    4. Conversely, if the current in-memory data has reached the limit, the assumption is exceeded. The contents of the memory are processed directly, and then the method is recalled.

    5. Assume that the limit is not reached. First, remove the part that has not reached the limit. The remaining callbacks are then made.

Our beantransferutil here is the method of converting the document into the corresponding bean. Reflection and Commons-beanutils.jar are used here.

Package Com.chechong.util;import Java.lang.reflect.field;import java.lang.reflect.InvocationTargetException; Import Org.apache.commons.beanutils.beanutils;import Org.apache.lucene.document.document;public class Beantransferutil {public static object Doc2bean (Document doc, Class clazz) {try {object obj = Clazz.newinstance (); field[] fields = Clazz.getdeclaredfields (); for (Field field:fields) {field.setaccessible (true); String fieldName = Field.getname (); Beanutils.setproperty (obj, FieldName, Doc.get (FieldName));} return obj;} catch (Instantiationexception e) {//TODO auto-generated catch Blocke.printstacktrace ();} catch (Illegalaccessexception e) {//Todo auto-generated catch Blocke.printstacktrace ();} catch (InvocationTargetException e) {//TODO auto-generated CA Tch blocke.printstacktrace ();} return null;}}

The methods for reading an index from a memory index are as follows:

/** removes the specified doc * @param pollid * @throws ioexception  */private static Document deldocumentfromramdirectory from the memory index ( Integer pollid) throws IOException {Document doc = null; Query query = searchutil.getquery ("id", "int", pollid+ "", false); Indexsearcher searcher = Getindexsearcher (); try { Topdocs Querydoc = Searchutil.getscoredocsbyperpageandsortfield (searcher, query, 0, 1, NULL); scoredoc[] docs = querydoc.scoredocs; System.out.println (docs.length); if (Docs.length > 0) {doc = Searcher.doc (docs[0].doc); System.out.println (DOC); ramindexwriter.deletedocuments (query); Ramindexwriter.commit ();} return doc;} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();} return null;}

Here, the contents of the memory index are read by ID. It is then converted to document to delete the corresponding record in memory at the same time.

Implementation of NRT near real-time query

For the top index we need to use the appropriate query method. Here the query time in order to achieve near real-time effect. The memory index needs to be added to the scope of the query, which is indexreader.

The method of obtaining indexsearcher here is as follows:

/** Multi-folder multithreaded query * @param parentpath Parent Index folder * @param service multi-threaded query * @param isaddramdirectory whether to add Memory index query * @return * @throws IOException */public static Indexsearcher getmultisearcher (String parentpath,executorservice Service, Boolean isaddramdirectory) Throws Ioexception{file File = new file (Parentpath); file[] files = file.listfiles (); indexreader[] readers = null;if (!isaddramdirectory) {readers = new indexreader[ Files.length];} Else{readers = new indexreader[files.length+1];} for (int i = 0; i < files.length; i + +) {Readers[i] = Directoryreader.open (Fsdirectory.open (Paths.get (Files[i].getpat H (), new String[0])));} if (isaddramdirectory) {Readers[files.length] = Directoryreader.open (Ramdirectorycontrol.getramdireactory ());} Multireader Multireader = new Multireader (readers); Indexsearcher searcher = new Indexsearcher (multireader,service); return searcher;} 

So. We are able to read from the file index at the time of the query. The data is also retrieved from the memory index.

Step by step with me. Lucene is a summary of the recent Lucene index. If you have any questions, please contact my q-q: 891922381, the same time I new Q-q group: 106570134 (Lucene,solr,netty,hadoop) Everyone together to discuss, I strive for a daily Bo. I hope you will continue to pay attention. It's going to surprise everyone.




Step by step learning from me lucene---lucene incremental update and NRT (near-real-time) query near real-time

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.