Lucene's numeric index and range query

Last Update:2015-12-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For the text search engine inverted index (data structure and algorithm), scoring system, word segmentation system are clearly mastered, I have a lot of value index and search has been a great interest, recently to Lucene on the numerical index and range search to do some learning, and the main content is organized as follows:

1. Lucene does not directly support the value (and range) of the search, the value must be converted to characters (strings);

2. Lucene search value of the initial plan;

3. Lucene How to index values and support range queries.

1. Lucene does not directly support numeric search

Lucene does not directly support the search for numeric values (and ranges), values must be converted to characters (strings)-This is determined by the core of the inverted index, and lucene requires that the term be arranged in dictionary order (lexicographic sortable). If you simply convert a numeric value to a string, there are a number of problems:

2. Lucene search value of the initial plan

2.1 If you save 11,24,3,50 directly, follow the dictionary order query range [24,50], will bring out 31. There is a simple solution to this problem, which is to complement the string into fixed-length strings, such as 000011,000024,000003,000050. This will solve the character range query [000024,000050].

2.2 When indexing, the term is sorted in numerical order, the above example is 3,11,24,50, the search can be correct.

Obviously, the above scheme has "mishap":

2.1 The problem of the scheme is that the fixed number of bits is difficult to control, the number of bits to be wasted space, less storage of the numerical range is limited;

The problem with the 2.2 scenario is that the query for the range [24,50] must be expanded into 25,26...50 so that the Boolean query queries are less efficient than acceptable.

3. Lucene How to index values and support range queries

You can first convert the values into strings and keep the order. This means that if number1 < Number2, then transform (number) < Transform (number). Transform is a function that turns a value into a string, and if you take a mathematical term, transform is monotonous.

* Note that when numbers are indexed, they can only be of the same type, for example, it is not possible to be the same field, with int and float.

3.1 Lucene indexes the Numericfield, first convert numeric value to lexicographic sortable binary and then move to the right by a certain step (Precision step back) lexicographic sortable string builds an index, essentially equivalent to building a trie.

How to turn numeric value into lexicographic sortable Binary all byte dictionary order is numeric order.

For long binary representations Http://en.wikipedia.org/wiki/Two ' s_complement

The highest bit is the sign bit 0, which indicates a positive number 1 for negative numbers. For positive numbers, a lower 63 bit is larger, and the larger the number is, the greater the number is the lower 63 bits (0xFFFFFFFFFFFFFFFF-1, the largest negative integer) for negative numbers. So as long as the symbol bit can be mapped into a byte lexicographic sortable binary.

For double binary representations http://en.wikipedia.org/wiki/Binary64

The real value assumed by a given 64-bit double-precision datum with a given biased exponent and a 52-bit fraction are

For a positive double, a lower 63 bit is larger, and the larger the number, the smaller the number is, the lower the 63-bit size for a negative double. Negative conditions and long are reversed, so for a double less than 0 to reverse the lower 63 bits, and then the same as long and then the symbol bit reversed, double can be mapped into a byte lexicographic sortable binary.

For the type of int and float 32-bit, do not dwell on the same principle.

3.2 Use the nature of trie to decompose rangequery into as few termquery as possible, and then use these termquery to do a search.

The principle is that shift increments from 0 to Precisionstep, trying to find up to two sub-range:lower and upper for each shift, and then the middle range continues to recursion until break occurs, when range becomes center Range. When shift=n, all values between the split-out range satisfying the low shift position of the minbound and placing the low shift bit of the Maxbound all 1 are in the range to be queried. The basic idea is similar to a tree array.

It's easier to see examples like [1, 10000] This range, by Splitrange out of range:

shift:0

Lower: [0x1,0xf], representing from 1 to 15

Upper: [0x2710,0x2710] = 10000 to 10000

Shift:4

lower:[0x10, 0xF0] indicates from (0x10) to 255 (0xFF)

upper:[0x2700, 0x2700] from 9984 (0x2700) to 9999 (0x270f)

Shift:8

Lower: [0x100,0xf00] means from 0x100 to 4095 (0xFFF)

Upper: [0x2000,0x2600] means from 8192 (0x2000) to 9983 (0X26FF)

Shift:12

Center: [0x1000, 0x1000] means from 4096 (0x1000) to 8191 (0X1FFF)

Total 7 Range The last range is the center range, and the 7 range also covers the [1,10000]

AddRange shifts the minbound and maxboud of each split long range right shift and then turns to lexicographic sortable String, The last time you build the index, it is preceded by a byte that represents shift. Because shift is incremented by precisionstep, multiple lexicographic sortable String range Splitrange out is incremented (pair order comparison). This finds all of the term in these ranges and only needs to be forward for this field, without the need for seek backward.

For the example above, the 7 range is converted to lexicographic sortable String, which is then used to find all the term that falls within these range ranges.

Like Shift:8.

Lower: [0x100,0xf00] means from 0x100 to 4095 (0xFFF)

0x100, the highest bit becomes 10% 0x80,00,00,00,00,00,01,00 and then right shifts 8 bits into 0x80,00,00,00,00,00,01 then every 7 bits becomes a byte

0x40, 00, 00, 00, 00, 00, 00,01

0xf00 to become 0x40, XX, xx, xx, xx, XX, 00,0f.

Add a byte to the front to represent shift so the final lexicographic sortable String

0x100-0x28,40, 00, 00, 00, 00, 00, 00,01

0XF00, 0x28,40, xx, xx, xx, XX, 00,0f

The first byte 0x28 represents the shift for 8,0x20 is the offset, distinguishing between different numeric types.

So if you're looking for a total of 3,840 values for [256, 4095], you only need to find 15 term

0x28,40, XX, xx, xx, xx, xx, 00,01 ~ 0x28,40, xx, xx, xx, xx, XX, 00,0f

Overall [0, 10000] A total of 1000 values, the maximum number of terms to find is 55.

[0X1,0XF] 15

[0x2710,0x2710] 1

[0x10, 0xF0] 15

[0x2700, 0x2700] 1

[0X100,0XF00] 15

[0x2000,0x2600] 7

[0x1000, 0x1000] 1

If you do not do the trie tree, then you need to traverse up to find 10,000 term.

Theoretically, for precisionstep=4, how many term does a range need to look up?

According to Splitrange you can see that in addition to the last shift, each shift in front produces a maximum of two ranges (Lower and Upper), and the last shift produces the center range.

The 64-bit numeric value is up to shift 64/4=16 times. So there are up to lower and upper up to 15 range, Center 1 range, and each range covers up to 15 term.

Why not a 16 term? 16 term words, the existence of this range is meaningless and can be rounded to the next shift.

Only one case is special and cannot be rounded up, such as range is [Long.min_value, Long.max_value] has only a center range in shift=60, covering 16 term.

So theoretically for precisionstep=4, you need to find the term 31 Range * 15 Term/range = 465

The more general conclusion

n = [(bitspervalue/precisionstep-1) * (2^precisionstep-1) * 2] + (2^PRECISIONSTEP-1)

Precisionstep=8, n=3825

precisionstep=2, n=189

Obviously precisionstep smaller n is smaller, but the smaller the precisionstep means that the more term is required for each field, the 64-bit value needs index term is 64/precisionstep.

The above mainly discusses the Longfield search, for Doublefield just need to do one step is to be less than 0 double, low 63-bit reverse, and then the Longfield exactly the same process. for int and float only numeric types change from 64 bits to 32 bits, the rest are the same.

Lucene's numeric index and range query

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene's numeric index and range query

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support