Step by step learning from me lucene---lucene incremental update and NRT (near-real-time) query near real-time

Source: Internet
Author: User

These two days overtime, can not take into account the update of the blog, please forgive us.

Sometimes after we create the index, the data source may have updated content, and we imagine the database can be directly reflected in the query, here is what we call the incremental index. How can we achieve such a demand? The implementation of this incremental index is not provided within Lucene;

Here, it is common to think that all previous indexes are deleted and then indexed for rebuilding. For this practice, if the data source of the number of bars is not particularly large, if the number of data sources is particularly large, it is bound to cause the query data time-consuming, and the construction of the index is relatively time-consuming, several phases superimposed, it is bound to cause the query when the data is missing, which will seriously affect the user's experience;

The implementation of the more common incremental indexes is:

    • Set a timer to periodically read from the data source than the new content in the existing index file or the data in the data source with the updated label.
    • Convert data into required document and index

The benefit of doing so is to completely remove the index from the above and then rebuild it:

    • Data source query scans for small amounts of data
    • The corresponding update index has fewer bars, reducing the number of IndexWriter commit and close these time consuming operations

The above solves the problem of increment, but the real-time problem still exists:

    • Index changes can only be reflected after the IndexWriter commit is executed.

So how do we improve the real-time, we all know that Lucene index can be in the file index and memory index two ways, compared to the file index, memory index execution efficiency is higher than the file index construction, because the file index is to frequent IO operations; in combination with the above considerations, we use the file Index + In the form of memory indexes for the incremental update of Lucene, the implementation mechanism is as follows:

    • Scheduled tasks scan for changes to data sources
    • The list of data sources obtained is placed in memory
    • When the in-memory document reaches the number limit, the in-memory index is deleted as a queue and added to the file index
    • Query when using the file + Memory index federated Query Way to achieve NRT effect
Timed Task Scheduler

Java has built-in timertask, this class can provide timed tasks, but one thing is that TimerTask's task is stateless, we also need to set the task in parallel, and understand that the Quartz task scheduling framework provides stateful task Statefuljob, That is, the next task will not be executed when the dispatch task is not completed;

The common way we start a quartz task is as follows:

Date runTime = datebuilder.evenseconddate (new date ());      Stdschedulerfactory SF = new Stdschedulerfactory ();   Scheduler Scheduler = Sf.getscheduler ();      Jobdetail job = Jobbuilder.newjob (Xxx.class). build (); Trigger Trigger = Triggerbuilder.newtrigger (). StartAt (RunTime). Withschedule (Simpleschedulebuilder.simpleschedule (      ). Withintervalinseconds (3). RepeatForever ()). Forjob (Job). Build ();            Scheduler.schedulejob (Job, trigger); Scheduler.start ();</span>

Above we are set to perform a scheduled task every three seconds, and the task class is xxx

Common methods for task classes

Here I define a XXX parent class, which is defined as follows:

Package Com.chechong.lucene.indexcreasement;import Java.util.list;import Java.util.timertask;import Org.apache.lucene.store.ramdirectory;import org.quartz.job;import org.quartz.statefuljob;/** stateful task: serial execution, That is, do not allow the last execution to start this time if you need to change the interface to a job in parallel * @author Lenovo * */public Abstract class Baseincreasementindex implements Statefuljob {/** * memory index */private ramdirectory ramdirectory;public baseincreasementindex () {}public Baseincreasementindex (Ramdirectory ramdirectory) {super (); this.ramdirectory = Ramdirectory;} /** Update Index * @throws Exception */public abstract void Updateindexdata () throws exception;/** consumption data * @param list */public abstr ACT void consume (list list) throws Exception;}

Task class related implementations, the following method is to get the data source to be indexed Xxxincreasementindex

@Overridepublic void Execute (Jobexecutioncontext context) throws Jobexecutionexception {try {Xxxincreasementindex index = new Xxxincreasementindex (Constants.xxx_index_path, Xxxdao.getinstance (), Ramdirectorycontrol.getramdireactory ()); Index.updateindexdata ();} catch (Exception e) {//TODO auto-generated catch Blocke.printstacktrace ();}}


@Overridepublic void Updateindexdata () throws Exception {int maxbeanid = Searchutil.getlastindexbeanid (); System.out.println (Maxbeanid); list<xxx> sources = Xxxdao.getlistinfobefore (Maxbeanid);,, if (sources! = null && sources.size () > 0) {th Is.consume (sources);}}


Here, XXX represents the entity class object for which we want to get data

The consume method mainly does two things:

    • Data is stored in the memory index
    • Determines the number of memory indexes, exceeds the limit, and queues out the exceeded number and stores it in the file index
@Overridepublic void consume (List list) throws Exception {IndexWriter writer = ramdirectorycontrol.getramindexwriter (); Ramdirectorycontrol.consume (writer,list);}

Above we put the memory index and the implementation of the queue in the Ramdirectorycontrol

Memory Index Controller

First of all, we initialize the IndexWriter of memory index, we need to pay attention to commit before initialization, otherwise we will prompt no segments exception

private static IndexWriter ramindexwriter;private static ramdirectory directory;static{directory = new Ramdirectory (); try {ramindexwriter = Getramindexwriter ();} catch (Exception e) {//TODO auto-generated catch Blocke.printstacktrace ();}} public static Ramdirectory Getramdireactory () {return directory;} public static Indexsearcher Getindexsearcher () throws Ioexception{indexreader reader = Null;indexsearcher searcher = null ; try {reader = Directoryreader.open (directory);} catch (IOException e) {e.printstacktrace ();} Searcher = new Indexsearcher (reader); return searcher;} /** single-case mode get Ramindexwriter * @return * @throws Exception */public static IndexWriter Getramindexwriter () throws Exception{if (Ramindexwriter = = null) {synchronized (indexwriter.class) {Analyzer Analyzer = new Ikanalyzer (); Indexwriterconfig iwconfig = new       Indexwriterconfig (analyzer);  Iwconfig.setopenmode (Openmode.create_or_append); try {ramindexwriter = new IndexWriter (directory, iwconfig); Ramindexwriter.commit (); RAMINDEXWRiter.close (); iwconfig = new Indexwriterconfig (analyzer);  Iwconfig.setopenmode (Openmode.create_or_append); Ramindexwriter = new IndexWriter (directory, iwconfig);} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();}}} return ramindexwriter;}

Defines a method to get the number of data bars in a memory index

/** based on the Finder, Query criteria, per page, sorting criteria query * @param query criteria * @param first start value * @param max Max * @param sort sort Condition * @return */public static Top Docs Getscoredocsbyperpageandsortfield (indexsearcher searcher,query query, int first,int max, sort sort) {try {if (query = = null) {System.out.println ("Query is null return null"); return null;} Topfieldcollector collector = null;if (sort! = null) {collector = Topfieldcollector.create (sort, First+max, false, False, F Alse);} else{sortfield[] SortField = new Sortfield[1];sortfield[0] = new SortField ("Createtime", sortfield.type.string,true); Sort DefaultSort = new sort (SortField), collector = Topfieldcollector.create (Defaultsort,first+max, False, False, false) ;} Searcher.search (query, collector); return Collector.topdocs (First, max);} catch (IOException e) {//TODO auto-generated catch Block}return null;} 

This method returns the result as Topdocs, and we use the Topdocs totalhits to obtain the number of data bars in the memory index to identify memory consumption and prevent memory overflow.

The consume method is implemented as follows:

/** consumption data * @param docs * @param listsize * @param writer * @param list * @throws Exception */public static void consume ( IndexWriter writer, list list) throws Exception {query query = new Matchalldocsquery (); Indexsearcher searcher = Getindexse Archer (); SYSTEM.OUT.PRINTLN (directory); Topdocs Topdocs = Getscoredocsbyperpageandsortfield (Searcher,query, 1, 1, null); int currenttotal = Topdocs.totalhits;if (Currenttotal+list.size () > Constants.xxx_ram_limit) {//exceeded memory limit int pulcount = constants.xxx_ram_limit-currenttotal; list<document> docs = new linkedlist<document> (); if (pulcount <= 0) {//directly processes the contents of the collection Topdocs Alldocs = Searchutil.getscoredocsbyperpageandsortfield (searcher, query, 0,currenttotal, NULL); Scoredoc[] scores = alldocs.scoredocs;for (int i = 0; i < scores.length; i + +) {//Remove data in memory document DOC1 = Searcher.doc (s Cores[i].doc); Integer pollid = Integer.parseint (Doc1.get ("id"));D ocument doc = deldocumentfromramdirectory (pollId); if (doc! = null) {XXX Carsource = (XXX) beantransfErutil.doc2bean (Doc, Xxx.class);D ocument doc2 = carsource2document (Carsource); if (doc2! = null) {Docs.add (DOC2);}}} Adddocumenttofsdirectory (docs); writer = Getramindexwriter (); Consume (writer, list);} else{//first Take out the part that does not reach the memory list subprocesslist = list.sublist (0, Pulcount); consume (writer, subprocesslist); List leavelist = list.sublist (Pulcount, List.size ()); consume (writer, leavelist);}} else{//not exceeding the limit, store directly into memory int listsize = List.size (); if (Listsize > 0) {//Store to Memory}}}

The logic above is:

    1. Gets the number of data bars in the current memory according to Getscoredocsbyperpageandsortfield
    2. Compare the amount of data in memory A and the total number of data sources obtained at this time by B and the number of in-memory limit C
    3. If A+b<=c does not exceed the limits of the memory index, all data is stored in memory
    4. Conversely, determine whether the current in-memory data has reached the limit, if it has been exceeded, then directly handle the removal of the contents of memory, and then callback this method.
    5. If the limit is not reached, the part that does not reach the limit is first fetched, and then the remaining callbacks are made.

Here our Beantransferutil is the method of converting from document to corresponding bean, where reflection and Commons-beanutils.jar are used.

Package Com.chechong.util;import Java.lang.reflect.field;import java.lang.reflect.InvocationTargetException; Import Org.apache.commons.beanutils.beanutils;import Org.apache.lucene.document.document;public class Beantransferutil {public static object Doc2bean (Document doc, Class clazz) {try {object obj = Clazz.newinstance (); field[] fields = Clazz.getdeclaredfields (); for (Field field:fields) {field.setaccessible (true); String fieldName = Field.getname (); Beanutils.setproperty (obj, FieldName, Doc.get (FieldName));} return obj;} catch (Instantiationexception e) {//TODO auto-generated catch Blocke.printstacktrace ();} catch (Illegalaccessexception e) {//Todo auto-generated catch Blocke.printstacktrace ();} catch (InvocationTargetException e) {//TODO auto-generated CA Tch blocke.printstacktrace ();} return null;}}

The index is read from the memory index in the following way:

/** removes the specified doc * @param pollid * @throws ioexception  */private static Document deldocumentfromramdirectory from the memory index ( Integer pollid) throws IOException {Document doc = null; Query query = searchutil.getquery ("id", "int", pollid+ "", false); Indexsearcher searcher = Getindexsearcher (); try { Topdocs Querydoc = Searchutil.getscoredocsbyperpageandsortfield (searcher, query, 0, 1, NULL); scoredoc[] docs = querydoc.scoredocs; System.out.println (docs.length); if (Docs.length > 0) {doc = Searcher.doc (docs[0].doc); System.out.println (DOC); ramindexwriter.deletedocuments (query); Ramindexwriter.commit ();} return doc;} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();} return null;}

This is to read the contents of the memory index based on the ID, and then convert it to document while deleting the corresponding record in memory.

Implementation of NRT near real-time query

For the top index we need to use the appropriate query method, here query time in order to achieve near real-time effect, need to add memory index to the scope of the query, that is, Indexreader.

The indexsearcher here is obtained as follows:

/** multi-Catalog multithreaded query * @param parentpath Parent Index directory * @param service multi-threaded query * @param isaddramdirectory whether to increase memory index query * @return * @throws IOException * /public static Indexsearcher getmultisearcher (String parentpath,executorservice Service, Boolean isaddramdirectory) Throws Ioexception{file File = new file (Parentpath); file[] files = file.listfiles (); indexreader[] readers = null;if (!isaddramdirectory) {readers = new indexreader[ Files.length];} Else{readers = new indexreader[files.length+1];} for (int i = 0; i < files.length; i + +) {Readers[i] = Directoryreader.open (Fsdirectory.open (Paths.get (Files[i].getpat H (), new String[0])));} if (isaddramdirectory) {Readers[files.length] = Directoryreader.open (Ramdirectorycontrol.getramdireactory ());} Multireader Multireader = new Multireader (readers); Indexsearcher searcher = new Indexsearcher (multireader,service); return searcher;} 

In this way, we can read from the file index and retrieve the data from the memory index at the time of the query;

Step by step with me to learn Lucene is a summary of the recent Lucene index, we have a question to contact my q-q: 891922381, at the same time I new Q-q group: 106570134 (Lucene,solr,netty,hadoop), such as Mongolia joined, Greatly appreciated, we discuss together, I strive for a daily Bo, I hope that we continue to pay attention, will bring you surprise




Step by step learning from me lucene---lucene incremental update and NRT (near-real-time) query near real-time

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.