1 Background
The Business (Poi) dimension to show a variety of services (such as group buying (deal), direct connection) is becoming more and more popular (Figure 1a), such as the current food, hotels and other categories on the mobile side of the group to the list of Poi list page display.
Figure 1 A: Business dimension display information; B:join signal
This brings complexity to the screening . Previous filters are flat, such as when filtering POI lists only take advantage of POI attributes (such as ratings, category, etc.), and filter the deal list only according to deal properties (room, price, etc.). Now that the filter is hierarchical, we need to filter the POI based on the properties of deal, for example, we need to filter the list of hotels, these hotels must have the price in 100~200 between the group.
This filtering essence is a join operation whose core is to associate the POI with the deal. From a database perspective (Figure 1 B), we have a POI table and a deal table, the deal table stores a foreign key (ParentID) to indicate the POI that the deal belongs to, the filter is divided into three steps: 1) first filter out the price range in 100~ 200 deal (Get Dealid 2 and 3 deal), 2) Find deal corresponding poi (get poiid 1 and 1 poi), 3) go heavy, because there may be multiple deal corresponding to the same poi, and we need to return the POI without repeating.
now that we are using Lucene to provide filtering services, how does Lucene solve this filter with join?
2 Lucene Join Solution
In our application, a POI is stored as a document, and a deal is also stored as a document,join at the core of associating the POI with the deal document. Lucene provides two ways to join, namely the query time join and the index time join, which are expanded separately below.
2.1. Query time Join
Query time join establishes the association of Deal and POI document through a similar database "foreign key" method.
a) Index
Create the document for the POI and the document for the deal, and use a field (ParentID) to associate deal with the POI when establishing deal document, in this case the ParentID field is created, It is stored in the deal corresponding to the poiid, it can be simply regarded as a foreign key.
public static document Createpoidocument (Poimsg poimsg) { Document document = new document (); Document.add (New Stringfield ("Poiid", String.valueof (Poimsg.getid ()), Field.Store.YES)); Document.add (New Stringfield ("name", Poimsg.getname (), Field.Store.YES)); return document;}
public static document Createdealdocument (Dealmodel Dealmodel, poimsg poimsg) { Document document = new document ();
document.add (New Stringfield ("Did", String.valueof (Dealmodel.getdid ()), Field.Store.YES)); Document.add (New Stringfield ("name", Dealmodel.getbrandname (), Field.Store.YES)); Document.add (New Doublefield ("Price", Dealmodel.getprice (), Field.Store.YES)); Document.add (New Stringfield ("ParentID", String.valueof (Poimsg.getid ()), Field.Store.YES)); return document;}
IndexWriter writer = new IndexWriter (directory, config); Writer.adddocument (Createpoidocument (POIMSG1)); Writer.adddocument (Createpoidocument (POIMSG2)); Writer.adddocument (Createdealdocument (DealModel1, POIMSG2)); Writer.adddocument (Createdealdocument (DealModel2, POIMSG1)); Writer.adddocument (Createdealdocument, POIMSG1));
b) Enquiry
Need to query two times: First query deal document, then through the deal in ParentID query poi document.
1) The first query occurs in joinutil.createjoinquery . First, the Termscollector collector is created, which collects the ParentID fields that meet the Fromquery's doc, and then creates the termsquery.
After the execution of this example , there are two terms in the Termscollector collection, namely "1" and "1";
2) Execute termsquery, query Tofield in the Termscollector terms set of the existing doc, and finally find Tofield as "1" Doc.
Indexsearcher indexsearcher = new Indexsearcher (indexreader); String fromfields = "ParentID"; Query fromquery = Numericrangequery.newintrange ("Price", N, A, false, false); String tofields = "Poiid"; Query toquery = Joinutil.createjoinquery (Fromfields, False, Tofields, Fromquery, Indexsearcher, Scoremode.max); Topdocs results = Indexsearcher.search (toquery, 10);
Joinutil.createjoinquery code Termscollector termscollector = Termscollector.create (Fromfield, Multiplevaluesperdocument); Fromsearcher.search (Fromquery, termscollector); return new Termsquery (Tofield, Fromquery, Termscollector.getcollectorterms ());
c) Advantages and disadvantages
The advantage of query time join is very intuitive and flexible, the disadvantage is that it is not possible to score a ranking, and because the query two times performance will be degraded.
2.2. Index time Join
Query time join establishes a relationship by explicitly adding a "foreign key" to the deal document, and after finding deal, it is necessary to find the ParentID collection of these deal document. Then query again to find the POI document POIID within the ParentID collection. If you can find the corresponding POI document immediately after finding the deal, it will greatly improve the efficiency. The index time join does this by creating a mapping between the deal document ID and the POI document ID in an ingenious way.
A) principle
How do I find the POI document ID with a deal document ID?
in Lucene, the doc ID is self-increasing, with each write to a Document,doc ID plus 1 (understandable for simplicity). The index time join requires that the index be written in a sequential relationship, write the child document, and then write the parent document. For example we have POI1 and poi2 two poi, wherein POI1 under Deal2 and Deal3, and Poi2 only deal1, then need to write Deal2, Deal3, and then write Deal2 and deal3 corresponding POI1 document, And so on, finally forming the structure shown in 2.
After the index is established, we get the ID collection (3,5) of the parent document. When we find the deal document ID according to the deal attribute, for example, we find the deal that satisfies the condition is DEAL3, its document id=2, at this time only need to go to the parent document ID collection to find the first 2-large ID, in this example immediately found 3.
Figure 2
Lucene itself implements the Bitset to save the Id,lucene internal implementation code 3 as shown.
Figure 3 Implementation principle
b) Index
From the above-mentioned principle we need to establish a hierarchical relationship index.
The document array is created first, the array has a feature, and the last one must be a poi, before it is deal. Then call Writer.adddocument (documents); Writes this array to.
public static document Createpoidocument (Poimsg poimsg) { Document document = new document (); Document.add (New Stringfield ("Poiid", String.valueof (Poimsg.getid ()), Field.Store.YES)); Document.add (New Stringfield ("name", Poimsg.getname (), Field.Store.YES)); Document.add (New Stringfield ("doctype", "poi", Field.Store.YES)); return document; }
public static document Createdealdocument (Dealmodel dealmodel) { Document document = new document (); Document.add (New Stringfield ("Did", String.valueof (Dealmodel.getdid ()), Field.Store.YES)); Document.add (New Stringfield ("name", Dealmodel.getbrandname (), Field.Store.YES)); Document.add (New Doublefield ("Price", Dealmodel.getprice (), Field.Store.YES)); return document; }
IndexWriter writer = new IndexWriter (directory, config); list<document> documents = new arraylist<document> ();d Ocuments.add (Createdealdocument (DealModel2)); Documents.Add (Createdealdocument (DEALMODEL3));d Ocuments.add (Createpoidocument (POIMSG1)); Writer.adddocument ( Documents);d Ocuments.clear ();d Ocuments.add (Createdealdocument (dealModel1));d Ocuments.add (Createpoidocument ( POIMSG2)); writer.adddocument (documents);
c) Enquiry
Filter poifilter = new Cachingwrapperfilter (new Querywrapperfilter (New Termquery (Poilucenefield.attr_doctype , "poi"))); Filter out Poitoparentblockjoinquery query = new Toparentblockjoinquery (Dealquery, Poifilter, Scoremode.max); Toparentblockjoincollector collector = new Toparentblockjoincollector (sort,//Sort (GetOffset () + getlimit ()),//POI paging numhits true,//Trackscores False//Trackmaxscore); collector = (toparentblockjoincollector) indexsearcher.search (query, col Lector); Sort Childsort = new sort (new SortField (Deallucenefield.attr_price, SortField.Type.DOUBLE)); Topgroups hits = Collector.gettopgroups (Query.gettoparentblockjoinquery (), Childsor T, Query.getoffset (),//Parent doc offset,//Maxdocspergroup 0,//withingroupoffset true// Fillsortfields);
3 Practice
Official documents show that index time join is more efficient than 30% faster than query time join. So we used the index time join method in the project and the service is running well.
Lucene join resolves parent-child relationship Index