Facet with Lucene

Source: Internet
Author: User
Tags ranges unique id

Facets with Lucene

Posted on August 1, from Pascal Dimassimo in Latest articles

During the development of our latest product, Norconex Content Analytics, we decided to add facets to the search interface . They allow for exploring the indexed content easily. SOLR and Elasticsearch both have facets implementations that work on top of Lucene. But Lucene also offers simple facet implementations that can is picked out of the box. And because Norconex Content Analytics is based on Lucene, we decided to go with those implementations.

We ' ll look at those facets implementations in the this blog post, but before, let's talk about a new feature of Lucene 4 that I s used by all of them.

Docvalues

Docvalues is a new addition to Lucene 4. What is they? Simply put, they is a mean to associate a value for each document. We can have multiple docvalues per document. Later on, we can retrieve the value associated with a specific document.

you could wonder how a docvalue is different than a stored field. T He difference is that they be not optimized for the same usage. Whereas all stored fields of A  Single document  are meant to being loaded together (for example, when we NE Ed to display the document as a search result), docvalues is meant to being loaded at once for  all documents . For example, if need to retrieve all the values of a field for all documents, then iterating through all documents and re trieving a stored field would be slow. If a docvalue is used, the loading all of the them for all documents are easy and Efficie Nt. A docvalue is stored in a column stride fashion and so all of the values are kept together for easy access. You can learn more about Docvalues In multiple places over the Web.

Lucene allows different kind of docvalues. We have numeric, binary, sorted (single string) and sorted set (multiple strings) docvalues. For example, a docvalue to store a ' price ' value should probably be a numeric docvalue. But if you want to save the alphanumeric identifier for each document, a sorted docvalues should is used.

Why are this important for faceting? Before Docvalues, when an application wanted to does faceting, a common approach to build the facets values is to do fi Eld uninverting, that's, go over the values of a indexed field and rebuild the original association between terms a nd documents. This process needed to be redone every time new documents were indexed. But with Docvalues, since the association between a document and a value are maintained in the index, it simplifies the WOR K needed to build facets.

Let's now look at the facets implementation in Lucene.

STRING FACET

The first facet implementation available in Lucene, that we'll look at is and we expect when we think of facets. It allows for counting the documents that share the same string value.

When indexing A is document, you are having to use a Sortedsetdocvaluesfacetfield. Here are an example:

New facetsconfig (); Config.setindexfieldname ("Author", "Facet_author"new  Document (); Doc.add(new Sortedsetdocvaluesfacetfield ("Author", "Douglas Adams")); Writer.adddocument ( Config.build (DOC));

With this, Lucene would create a "facet_author" field with the author value indexed in it. But Lucene would also create a docvalue named "Facet_author" containing the value. When building the facets at Search-time, this docvalue would be used.

You ' ve probably also noticed the Facetsconfig object. It allows us to associate a dimension name ("Author") with a field name ("Facet_author"). Actually, when Lucene indexed the value of the "Facet_author" field and docvalue, it also prefixes the value with the Dime Nsion name. This would allow us to has different facets (dimensions) indexed in the same field and Docvalue. If we would has omitted the call to Setindexfieldname, the facets would has been indexed in a field called ' $facets ' (an d the same name for the Docvalue).

At search time, this is the code we would use to gather the author facets:

 sortedsetdocvaluesreaderstate state = new  Defaultsortedsetdocvaluesreaderstate (Reader, "Facet_author" ); Facetscollector FC  = new   Facetscollector (); Facetscollector.search (searcher, query,  10 = new   = Facets.gettopchildren (Ten, "Author" );  for  (int  i = 0; i < result.childcount; I++) {Labelan   Dvalue LV  = Result.labelvalues[i]; System.out.println (String.Format ( "%s"  

Here, the defaultsortedsetdocvaluesreaderstate is responsible for loading all the dimensions from the specified DOCVA Lue (Facet_author). Note that this ' state ' object is costly to build and so it should be re-used if possible. Then, sortedsetdocvaluesfacetcounts'll be able to load the values of a specific dimension using the ' state ' object and T o Compute the count for each distinct value.

You can find more code examples in the file Simplesortedsetfacetsexample.java in the Lucene sources.

NUMERIC RANGE FACET

This next facet of implementation is to being used with numbers to build range facets. For example, it would group documents of the same price range together.

When indexing A is document, you have to add a numeric docvalue for each document. Like this:

Doc.add (new Numericdocvaluesfield ("Price", 100L));

In the case, we are need to use a numericdocvaluesfield and not a specialized facetfield.

When searching, we need to first define the set of ranges that we want. Here's how it could be built:

New Longrange[3];ranges[newtruefalse); ranges[newTrue  false); ranges[newtruefalse);

With those ranges, we can build the facets:

Facetscollector.search (searcher, query, tennew longrangefacetcounts ("price"= Facets.gettopchildren (0, "price");  for (int i = 0; i < Result.childcount; i++) {   = result.labelvalues[i];   System.out.println (String.Format ("%s (%s)", Lv.label, Lv.value));}

Lucene would calculate the count for each range.

For code sample, see Rangefacetsexample.java in Lucene sources.

Taxonomy FACET

This is the first facet implementation, and it was actually available before Lucene 4. This implementation was different than the others in several aspects. First, the unique values for a taxonomy facet is stored in a separate Lucene index (often called the Sidecar index). Second, this implementation supports hierarchical facets.

For example, imagine a "path" facet where "path" represents where a file is on a filesystem (or the Web). Imagine the file "/home/mike/work/report.txt". If we were to store the path ("/home/mike/work") as a taxonomy facet, it'll actually is split into 3 unique values: "/ho Me ","/home/mike "and"/home/mike/work ". Those 3 values is stored in the sidecar index with each being assigned a unique ID. In the main index, a binary docvalue are created so, each document was assigned the ID of its corresponding path value ( The ID from the sidecar index). In this example, if "/home/mike/work" is assigned ID 3 in the sidecar index, the Docvalue for the document "/home/mike/wo Rk/report.txt "would is 3 in the main index. In the sidecar index, all values are linked together, so it's easy later on-retrieve the parents and children of each Value. For example, "Home" would is the parent of "/home/mike", which would be the parent of "/home/mike/work". We'll see how this information is used.

Here's some code to index the path facet of a file under "/home/mike/work":

 Directory Dirtaxo = Fsdirectory.open (Pathtaxo); Taxo  = new   Directorytaxonomywriter (Dirtaxo); Facetsconfig config  = new   Facetsconfig () ; Config.setindexfieldname ( "path", "Facet_path"  path, true  ); Document doc  = new   Document ();d Oc.add (  new  Stringfield ("filename", "/home/mike/work/report.txt"  new  Facetfield ("path", "Home", "Mike", "work"  

Notice Here's we need to create a taxonomy writer, which was used to write in the sidecar index. After that, we can add the actual facets. Like with Sortedsetdocvaluesfacetfield, we need to define the configuration of the Facet field (dimension name and field N AME). We also has to indicate that the facets would be hierarchical. Once It is set, we can use Facetfield with the dimension name and all the hierarchy of values for the facet. Finally, we add it to the main index (via the writer object), but we also need to pass the Taxo writer object so the Sidecar index is also updated.

Here are some code to retrieve those facets:

Directorytaxonomyreader Taxoreader =    newnew  Fasttaxonomyfacetcounts (   "Facet_path"= Facetsfolder.gettopchildren ("path");

For each matching document, the ID of the facet value is retrieved (via the docvalues). Lucene would count how much there was for each unique facet of value by counting how many documents was assigned to each ID. After this, it can fetch the actual facet values from the sidecar index using the those IDs.

In the last example, we do not specify any specific path, so all facets for all paths is returned (including all child p aths). But we could restrict to a further specific path to get is only the facets underneath it, for example "/home/mike/work":

Facetresult result = Facetsfolder.gettopchildren (   Ten, "path", "Home", "Mike", "work");

This is where the hierarchical aspect of the taxonomy facets gets interesting. Because of the relations kept between the facets in the sidecar index, Lucene are able to count the documents for the facet s at different levels in the hierarchy.

Again, for more code example on taxonomy facets, see Multicategorylistsfacetsexample.java in the Lucene sources.

Conclusion

So we ' ve seen this Lucene offers facets implementations out of the box. A lot of interesting features can is built on top of them! For more info, refer to the Lucene sources and Javadoc.

Turn from:

http://www.norconex.com/facets-with-lucene/

Facet with Lucene

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.