Lucene Learning Two: index and range query analysis of numeric types

Last Update:2014-12-10 Source: Internet

Author: User

Tags diff

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lucene uses the index structure of the character (word) type. The index and storage of numeric types will eventually be converted to character types.

Earlier versions Lucene did not encapsulate a public class of numeric types. You need to convert the number directly into a string and add it to the field first.

Java code:

1  New Document (); 2  long i = 123456L;
3  Doc. ADD (new Field ("id", string.valueof (i), Field.Store.YES, Field.Index.YES)); 4  Writer. Adddocument (DOC);

If you convert directly from the above, there is a problem when you make a range query.

Suppose there are now 123456,123,222 of these three numbers, stored in the way above. Because the Lucene index structure is a character-based skip linked list.

The final sort in the index is 123, 123456,222. This is the time to use Termrangequery for range queries earlier.

The result will be 123, 123456,222 to find out. In order to solve this problem, the fixed number of digits is generally used, and the character of string sorting is used to supplement 0 in the insufficient bit.

New true true); /Find [123,222]

Converted to: 000000123,000123456,00000222 for storage. Such an index order becomes 000000123,000000222,000123456.

Also do the same conversion when querying.

New true true);

This will have two performance issues:

1: If the range of the lower limit is split into multiple term such as 000000123,000000124,000000125....000000222. then go to the query and merge the result set separately. This can lead to too many queries.

2: From the start position 000000123 times the calendar to find the end of 000000222, there will be too many traversal.

Later versions provide support for numeric types, using Numericfield to instantiate a field (domain). The optimization scheme of interval query for numeric type numericrangequery is provided.

The latest version (over 4.0) provides Intfield,longfield Floatfield, Doublefield , and more refined numeric types.

Index Code:

1 New Document (); 2       3 New Longfield ("id", H.getid (), Field.Store.YES); 4 5 Doc.add (); 6                     7 writer.adddocument (DOC);

Query code:

  true true);

When a numeric type is indexed, the value is converted to multiple lexicographic sortable string and then indexed to the trie dictionary tree structure.

For example: Suppose Num1 is disassembled into a, AB,ABC, num2 disassembled into A,AB,ABD.

"Figure 1":

By searching AB You can find the num1,num2 with the AB prefix. When a range is searched, the same prefix for the values in the lookup range can be used to find the purpose of returning more than one doc at a time, thus reducing the number of lookups.

Here's how index and range queries work for numeric types.

1: Binary representation of numeric values

Take long for example: sign bit + 63 bit integer bit, sign bit 0 indicates positive number 1 indicates negative number.

For positive numbers, a lower 63 bit is larger, and the greater the number, the lower the 63 bit for negative numbers.

If the symbol bit is reversed. Then Long.min--Long.max can be expressed as: 0x0000,0000,0000,0000--0XFFFF,FFFF,FFFF,FFFF

After such a conversion, is it from the character level is already from small to large sort?

2: How to split the prefix

Take 0x0000,0000,0000,f234 as an example and move the right 4 bits at a time.

1:0x0000,0000,0000,f23 the prefix of all values in the range of 0x0000,0000,0000,f230--0x0000,0000,0000,f23f is consistent

2:0x0000,0000,0000,f2 the prefix of all values in the 0X0000,0000,0000,F200--0X0000,0000,0000,F2FF range

3:0x0000,0000,0000,f the prefix of all values in the range of 0x0000,0000,0000,f000--0x0000,0000,0000,ffff

....

0x0

If you use the right to move a few values to do a key, you can represent a corresponding range. Key can be interpreted as a numeric prefix

3: To large range fold into small range

Lucene's approach to querying is to reduce the number of lookups by using a prefix to search for large ranges, and then each small range.

4: Implementation of index of numeric type

Set a Precisionstep (default 4) and move the value type to the right (n-1) * Precisionstep bits at a time.

After each shift, a byte is deposited from the left to each 7 bits, forming a byte[],

and inserts a special byte in the No. 0 bit of the array, identifying the offset this time.

Each byte[] can be turned into a lexicographic sortable string.

Lexicographic the characters of the sortable string are sorted in dictionary order, and the size order of the values is the same as the offset. -This is the key to numericrangequery range Finder!

The long type is altogether 64 bits, and if precisionstep=4, there will be 16 lexicographic sortable string.

The equivalent of 16 prefixes corresponding to a long value, and then using Lucene's reverse index, the final index into a similar "Figure 1" the kind of index structure.

Split the key code:

The Longtoprefixcodedbytes () method of the Org.apache.lucene.util.NumericUtils class

1    Public Static voidLongtoprefixcodedbytes (Final LongValFinal intShiftFinalBytesrefbuilder bytes) {
2     if((Shift & ~0x3f)! = 0)//ensure shift is 0..633       Throw NewIllegalArgumentException ("Illegal shift value, must be 0..63");
       Calculates the size of the byte[], with each bit seven bits deposited in a byte
4     intNchars = (((63-shift) *37) >>8) + 1;//I/7 is the same as (i*37) >>8 for i in 0..63       ///Finally there is a No. 0 position offset, so +1
5Bytes.setlength (nchars+1);//One extra for the byte that contains the shift info6 Bytes.grow (Buf_size_long); //Identity offset, shift
7Bytes.setbyteat (0, (byte) (Shift_start_long +shift)); //Reverse the symbol position
8     LongSortablebits = val ^ 0x8000000000000000l; //Right SHIFT, first Shifi 0, followed by precisionstep increments
9Sortablebits >>>=shift;Ten      while(Nchars > 0) { One       //Store 7 bits per byte for compatibility A       //With UTF-8 encoding of terms         //7 bits per byte, preceded by 0--in UTF8 for ASCII code. And added to the array. 
 -Bytes.setbyteat (Nchars--, (byte) (Sortablebits & 0x7f)); -Sortablebits >>>= 7; the     } -}

5: Range Query

The general idea is to start splitting from both ends of the range. First, the low value is split into an interval, and then moved Precisionstep to the next high and into an interval.

Finally, each value in the cell is transferred to the lexicographic sortable string in the same way as the index by the number of moves. Search.

Code:

The Splitrange () method of the Org.apache.lucene.util.NumericUtils class

1 Private Static voidSplitrange (2     FinalObject Builder,Final intValsize,3     Final intPrecisionstep,LongMinbound,LongMaxbound4   ) {5     if(Precisionstep < 1)6       Throw NewIllegalArgumentException ("Precisionstep must be >=1");7     if(Minbound > Maxbound)return;8      for(intshift=0;; SHIFT + +precisionstep) {9       //Calculate new bounds for inner precisionTen       Final Longdiff = 1L << (shift+precisionstep), OneMask = ((1l<<precisionstep)-1L) <<shift; A       Final Boolean -Haslower = (Minbound & mask)! = 0L, -Hasupper = (Maxbound & mask)! =Mask; the       Final Long -Nextminbound = (haslower? (Minbound + diff): minbound) & ~Mask, -Nextmaxbound = (hasupper? (Maxbound-diff): maxbound) & ~Mask; -       Final Boolean +lowerwrapped = Nextminbound <Minbound, -upperwrapped = nextmaxbound >Maxbound; +        A       if(shift+precisionstep>=valsize | | nextminbound>nextmaxbound | | lowerwrapped | |upperwrapped) { at         //We are in the lowest precision or the next precision was not available. - AddRange (builder, Valsize, Minbound, Maxbound, shift); -         //exit the split recursion loop -          Break; -       } -        in       if(Haslower) -AddRange (builder, Valsize, Minbound, Minbound |mask, shift); to       if(Hasupper) +AddRange (builder, valsize, Maxbound & ~mask, Maxbound, shift); -        the       //recurse to next precision *Minbound =Nextminbound; $Maxbound =Nextmaxbound;Panax Notoginseng     } -}

For example: 1001,0001-1111,0010 split into

1:1001,0001-1001,1111 (after the No. 0 shift 0x91-0x9f has 15 term)

and 1111,0000-1111,0010 (the No. 0 offset after the 0XF0-0F2 has 3 term)

2:1002,0000–1110,1111 after right shift (0X11-0X15 has 5 term)

Find 23 lexicographic sortable string. You can cover the entire range.

Official Note:

Http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/NumericRangeQuery.html

On the other hand, if the precisionStep are smaller, the maximum number of terms to match reduces, which optimizes query speed. The formula to calculate the maximum number of terms that would be visited while executing the query is:

For longs stored using a precision step of 4, maxQueryTerms = 15*15*2 + 15 = 465 and for a precision step of 2, maxQueryTerms = 31*3*2 + 3 = 189 . But the faster search speed was reduced by more seeking in the term enum of the index. Because of this, the ideal precisionStep value can is only found off by testing. Important: You can index with a lower precision step value and test search speed using a multiple of the original step value.

Http://lucene.apache.org/core/4_10_2/core/index.html

To sort according to a LongField , use the normal numeric sort types, eg SortField.Type.LONG .

If you have need to sort by numeric value, and never run range querying/filtering, you can index using a precisionStep Integer.MAX_VALUE .

If this value is used only as a sort field, no range query is required. Specifies the sort type when sorting SortField.Type.LONG .

Can be precisionstep=integer.max_value. This will only produce a 0 offset lexicographic sortable string to reduce the index volume.

Lucene Learning Two: index and range query analysis of numeric types

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More