How does Lucene quickly find information such as field fields and recent distances with docid?

Source: Internet
Author: User

1 Problem description

Our search ordering service often needs to be reordered in combination with a personalization algorithm, generally divided into two parts: 1) for coarse sorting, which is done quickly by the retrieval engine, 2) reordered, and the result is sent to the personalization service engine in deep order by the personalization service engine. In our business scenario, the search engine, in addition to passing the Doc list, also passes the business field such as the merchant ID and the nearest distance from the user location to the doc.

Our search engine is based on Lucene, and the results of the Lucene query contain only docid and the corresponding score, not directly providing the list of business fields we want to pass to the personalization service and the corresponding distances. So the question to be solved in this article is: How to quickly find field fields and the corresponding distance of the doc according to DocId?

2 Traditional methods-get data from a positive row file

Through the inverted search is docid, and intuitively can be based on docid from the positive row to get the specific Doc content fields such as Dealid and so on.

First, you need to write data to a positive row, if not written, of course, can not query. How do I write it? We write the Dealid (deallucenefield.attr_id), the deal corresponding to the latitude and longitude string (deallucenefield.attr_locations, more than "," separated) into the index, Field.Store.YES that the information is stored in a row, Lucene will store the positive information in the FDX, fdt two files, FDT storage of specific data, FDX is an index of FDT (nth doc data in FDT position).

Document doc = new document ();d Oc.add (New Stringfield (deallucenefield.attr_id, String.valueof (Deal.getdid ()), Field.Store.YES));d Oc.add (New Stringfield (Deallucenefield.attr_locations, Buildmlls (Mllsset, Deal.getdid ()), Field.Store.YES));

How to query it?

1) Direct Query

The document is directly queried by DocId, and the contents of the document are taken out, for example, the nearest distance is calculated after the latitude and longitude strings are taken out.

for (int i = 0; i < sd.length; i++) {   Document doc = Searcher.doc (sd[i].doc);//sd[i].doc is Docid,earcher.doc (Sd[i]. DOC) is to find the corresponding document   Didlist.add (Integer.parseint (Doc.get (deallucenefield.attr_id)) according to DocId;   if (Query.getsortfield () = = dealsortenum.distance) {       ...       string[] MLLs = Locations.split ("");       Double dis = findmindistance (mlls, Query.getmypos ())/;       Distbuilder.append (DIS). Append (",");}    }     

In actual operation, the process of obtaining latitude and longitude information according to DocId and calculating the shortest distance is about 8ms, and sometimes jitter to more than 20 Ms.

2) query optimization

When querying directly, all Field.Store.YES field data is returned, and in fact we only need to get data for thetwo field of Dealid, localtion, so the optimization method is to pass in the field collection that needs to be fetched when the doc function is called , thus avoiding the overhead of getting the entire data.

for (int i = 0; i < sd.length; i++) {   Document doc = Searcher.doc (Sd[i].doc, fieldstoload);   Didlist.add (Integer.parseint (Doc.get (deallucenefield.attr_id)));   if (Query.getsortfield () = = dealsortenum.distance) {       String locations = Doc.get (deallucenefield.attr_locations);       string[] MLLs = Locations.split ("");       Double dis = findmindistance (mlls, Query.getmypos ())/;       Distbuilder.append (DIS). Append (",");}    }     

  

However, there is no improvement in the actual application compared to the direct query performance.

There are two points: 1) Fewer fields are used with Field.Store.YES, except for dealid and location, only two field are stored in the positive row index, which is effective for a large number of field stores in the positive row index; 2) getting the data from the positive row is obtained by reading the file, although we have The index file is opened through a memory map, but because each query also needs to locate the parsing data, a significant amount of overhead is wasted.

3 Optimization method Fieldcache to retrieve data from inverted rows

The data from the two fields of the Dealid and location from the row is slow, and if you can cache the two fields it will greatly increase the computational efficiency, such as a map,key is Docid,value dealid or mlls. Unfortunately Lucene does not provide this cache to the positive row because Lucene is primarily optimized for inverted rows.

In Lucene, some of the fields used for sorting, such as the "Weight" field we use, to speed up, Lucene converts all terms in the "weight" field to float (as shown) when it is first used. Coexist in the fieldcache so that it can be retrieved directly from the cache when it is used for the second time.

Fieldcache.floats weights = FieldCache.DEFAULT.getFloats (reader, "weight", true); Get the cache for "weights" this field, the cache key is Docid,value is the corresponding value of float Weightvalue = Weights.get (docId); Get values by DocId

 

for (int i = 0; i < sd.length; i++) {   ...   if (query.getsortfield () = = dealsortenum.distance) {      bytesref bytesrefmlls = new Bytesref ();      Mllsvalues.get (Sd[i].doc, bytesrefmlls);      String locations = bytesrefmlls.utf8tostring ();      if (Stringutils.isblank (locations))         continue;      string[] MLLs = Locations.split ("");      Double dis = findmindistance (mlls, Query.getmypos ())/1000;      Distbuilder.append (DIS). Append (",");}   }

In this way, the average response time of the process is reduced from 8ms to about 2ms, even if the jitter response time is less than 10ms, by obtaining the latitude and longitude information according to the docid and calculating the shortest distance.

4 Optimization Method 2-using Shapefieldcache

The use of Fieldcache increases memory consumption, especially the Location field, which holds the corresponding latitude and longitude string for the document, which is especially expensive for memory, especially where the location field of some documents holds thousands of latitude and longitude (which is not uncommon in our business scenario).

In fact, we don't need the location field because we've written the latitude and longitude to the index when we build the index, and Lucene puts all of the doc's latitude and longitude into the Shapefieldcache cache once it's in use.

for (String mll:mllsset) {   string[] mlls = Mll.split (",");   Point point = Ctx.makepoint (Double.parsedouble (mlls[1]), double.parsedouble (Mlls[0]));   For (Indexablefield F:strategy.createindexablefields (point)) {       doc.add (f);   }}

The query code is as follows.

StringBuilder Distbuilder = new StringBuilder (); Binarydocvalues idvalues = Binarydocvaluesmap.get (deallucenefield.attr_id); Functionvalues functionvalues = distancevaluesource.getvalues (null, context); Binarydocvalues idvalues = Binarydocvaluesmap.get (deallucenefield.attr_id); for (int i = 0; i < sd.length; i++) {by   Tesref bytesref = new Bytesref ();   Idvalues.get (Sd[i].doc, bytesref);   String id = bytesref.utf8tostring ();   Didlist.add (Integer.parseint (ID));   if (Query.getsortfield () = = dealsortenum.distance) {      double dis = functionvalues.doubleval (doc)/1000;      Distbuilder.append (DIS). Append (",");}   }

  

A) further optimization

The above method saves memory overhead, but does not avoid the computational overhead. We know that Lucene provides a sort by distance function, but Lucene just completes the sort and tells us the corresponding DocId and score, but does not tell us the nearest distance value for each deal and the user. Is there any way to save the distance?

Our approach is to save the distance value as score by overwriting Lucene's collector and the queue Priorityqueue used by Lucene, thus avoiding redundant computations. The core code is as follows:

@Override    protected void Populateresults (scoredoc[] results, int howmany) {        //Avoid casting if unnecessary.        sievefieldvaluehitqueue<sievefieldvaluehitqueue.entry> queue = (sievefieldvaluehitqueue< sievefieldvaluehitqueue.entry>) PQ;        for (int i = howMany-1; I >= 0; i--) {            Fielddoc Fielddoc = Queue.fillfields (Queue.pop ());            Results[i] = Fielddoc;            Results[i].score = float.valueof (string.valueof (fielddoc.fields[0])); Record distance        }    }

When this is optimized, the average response time to get data is from 2ms to 0ms, and jitter never occurs.

In addition, due to the avoidance of loading the location in memory field, the GC's response time is reduced by half, and the overall service average response time is also much lower.

5 Outlook

For information on how to quickly find field fields and recent distances with DOCID, this article provides a variety of methods and attempts, including fetching from a positive row file, fetching from the inverted fieldcache, and obtaining latitude from Shapefieldcache. In addition, the two calculations of the distance are avoided by transforming the Lucene collector and the queue. This optimization greatly improves the performance of the retrieval service.

There are many ways to get field data through DocId, such as Docvalue, which will be explored in the future.

How does Lucene quickly find information such as field fields and recent distances with docid?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.