Lucene index File Size optimization summary

Source: Internet
Author: User

With the rapid development of the business, Lucene-based index file zip compression is also close to the GB magnitude, while keeping the index file size as an acceptable range is very necessary, not only can improve the index transfer, read speed, It also improves the efficiency of the index cache (Lucene caches the index file when it is opened, such as mmapdirectory in memory-mapped mode).

How to reduce the size of our index file? This article makes a few attempts, which are described below.

1 numeric data type index optimization1.1 Numeric Type index issues

Lucene is essentially a full-text search engine rather than a traditional database system, based on inverted indexes, which are ideal for working with text, while processing numeric types is not a strong one.

for an application scenario, Let's say that we are storing the business, each merchant has a per capita consumption , users want to query the range of 5 00~1000 inthis price range of businesses .

a simple and straightforward idea is to write the merchant per capita consumption as a string to the inverted (), in the interval query: 1) traverse the price participle table, will fall within the range of the inverted row ID record table to find out, 2) Merge the inverted row ID record table. Here are two steps have performance problems: 1) traverse the price participle table, compare violence, and through the term search inverted row ID record too many times, performance is very poor, in Lucene query too many times, may throw too many Boolean clause exception. 2) It is very time consuming to merge the Inverted ID record table, which is plainly in the disk.

of course, there is a way of thinking is to make up its digital length, assuming that all businesses per capita consumption in [0,10000] This interval, we store 1 o'clock write to the inverted row is 00001 (5 bits), because the word breaker is sorted by string, Therefore, we do not have to traverse the price participle table, through the binary search can quickly find in the range of an inverted ID record table, but here also failed to resolve the query too many, merge inverted ID record too many times the problem. In addition how to complement is also a problem, to make up too much wasted space, too little to save too much range value.

1.2 Lucene Solution

To solve this problem, Schindler and Diepenbroek proposed a trie-based approach, which was published in the Computers & Geosciences (SCI Journal of Geographic Information Science) in 08. Impact factor 1.9), also used after Lucene 2.9 version. (Schindler, U, Diepenbroek, M, 2008.) Generic xml-based Framework for Metadata portals. Computers & Geosciences 34 (12), Thesis:http://epic.awi.de/17813/1/Sch2007br.pdf)

Simply put, the integer 423 is not directly written to the inverted row, but divided into a few pieces of writing inverted, in the case of decimal division, 423 will be divided into 423, 42, 4 of these three term write, essentially these terms formed the trie tree ().

How to query it? Suppose we want to query [422, 642] This range of doc, first at the bottom of the tree to find the first greater than 422 value, that is, 423, then find the right sibling node 423, found not to find its parent node's right brother (find 44), for 642 is also, find its left sibling node (641), After looking for the parent node's left brother (63), always find the two public nodes, and finally find out 423, 44, 5, 63, 641, 642 this 6 term can be. In this way, the original need to query 423, 445, 446, 448, 521, 522, 632, 633, 634, 641, 642 this 11 times the corresponding list of inverted IDs, and merge the 11 term corresponding inverted ID list, now only need to query 423, 44, 5 , 63, 641, 642 these 6 term corresponding inverted row ID list and merge, greatly reduced the number of queries and the number of merges, especially when the query interval is larger, the effect is more obvious.

This optimization method is essentially a space-time-changing approach, and you can see that the number of term numbers will increase a lot.

In practice, Lucene converts the number to 2 binary, and in fact the trie tree does not need to save the data structure, the traditional trie a node will have a pointer to the child node , and there will be a pointer to the parent node, As long as you know a node here, its parent node, right sibling node can be computed. Lucene also provides precisionstep This field is used to set the split length, by default int, double, float and other numeric type Precisionstep is 4, is divided by 4-bit binary. Precisionstep length set the shorter, split the term more, a large range of query speed is also faster, Precisionstep set the longer, in extreme cases set to infinity, then will not be trie segmentation, range Query and no optimization effect, Precisionstep length needs to be optimized with its own business.

1.3 Index file size optimization scheme

Many of our applications are numeric types, such as IDs, Avescore (evaluation points), price, and so on, but there are very few numeric types for interval range queries, most of which are directly queried or used for sorting.

Therefore, the optimization method is very simple, will not need to use the range query's number field set Precisionstep to Intger.max, so that the number write to the inverted only one term, can greatly reduce the number of terms.

1  Public Final classCustomfieldtype {2      Public Static FinalFieldType Int_type_not_stored_no_tire =NewFieldType ();3     Static {4Int_type_not_stored_no_tire.setindexed (true);5Int_type_not_stored_no_tire.settokenized (true);6Int_type_not_stored_no_tire.setomitnorms (true);7 int_type_not_stored_no_tire.setindexoptions (FieldInfo.IndexOptions.DOCS_ONLY);8 Int_type_not_stored_no_tire.setnumerictype (FieldType.NumericType.INT);9 Int_type_not_stored_no_tire.setnumericprecisionstep (integer.max_value);Ten Int_type_not_stored_no_tire.freeze (); One     } A}
Doc.add (New Intfield ("Price", Price, Customfieldtype.int_type_not_stored_no_tire));//per capita consumption
1.4 Effects

The effect is obvious after optimization, and the size of the index compression packet is reduced by a few times.

2 Spatial data type Index optimization. 1 geographic Data indexing issues

Still the same, Lucene is good for text based on inverted indexes, but not strong for spatial type data.

For an application scenario, each merchant has a unique latitude and longitude coordinate (x, y), and the user wants to filter 5-kilometer of nearby businesses.

An intuitive idea is to write the longitude X, dimension y as two numeric type fields into the inverted row, then go through all the merchants at the time of the query, calculate the distance to the user, and keep the merchant less than 5-kilometer. The disadvantage of this method is obvious: 1) need to traverse all the merchant, very violent; 2) Furthermore, the calculation of spherical distance is not related to a large number of trigonometric functions, which is less efficient.

The simple optimization method uses a rectangular box to filter these businesses, then calculates the distance from the filtered merchant, reserving less than 5-kilometer of the merchant, which, although greatly reducing the amount of computation, still needs to traverse all the merchants.

2.2 Lucene Solution

Lucene encodes the latitude by using the Geohash method (Geohash introduction See: Geohash). Simply described, Geohash space is continuously divided and coded for each of the molecular space, such as our entire Beijing area is coded as "w", then a 4 for the city, a sub-space encoded as "WX", the "WX" sub-space division, re-identification of the various sub-space, such as " WX4 "(Simple enough to understand).

So how does a latitude and longitude (x, y) write to the inverted index? Assuming that one latitude falls within the "WX4" subspace, the latitude and longitude will be written to the inverted row with the three term "W", "WX", and "WX4".

How do I make a nearby query? First of all, we will be near 5km divided into a lattice, each lattice has geohash code, the code as query term, to reverse the query can be, for example, near 5km Geohash lattice corresponding to the code is "WX4", then directly can fall in this space to find businesses.

2.3 Index file size optimization scheme

The above method is essentially a space-for-time method, such as a latitude and longitude (x, y), with only two fields, but coding with geohash will produce many term and write backwards.

Lucene defaults to the longest geohash length of 24, which means that a latitude and longitude will be written to the inverted row in 24 strings. The initial geohash length was 11, but in fact for our requirements, the Geohash length of 9 was sufficient to meet our needs (the Geohash length is 9 about 5*4 meters of the lattice).

The following table shows the accuracy of the geohash length, excerpted from Wikipedia: Http://en.wikipedia.org/wiki/Geohash

Geohash Length lat bits LNG bits lat Error LNG Error km Error
1 2 3 ±23 ±23 ±2500
2 5 5 ±2.8 ±5.6 ±630
3 7 8 ±0.70 ±0.7 ±78
4 10 10 ±0.087 ±0.18 ±20
5 12 13 ±0.022 ±0.022 ±2.4
6 15 15 ±0.0027 ±0.0055 ±0.61
7 17 18 ±0.00068 ±0.00068 ±0.076
8 20 20 ±0.000085 ±0.00017 ±0.019
1 Private voidSpatialinit () {2          This. CTX = Spatialcontext.geo;//Select Geo to represent latitude and longitude coordinates, calculate distances by sphere, otherwise planar Euclidean distances3         intMaxlevels = 9;//Geohash Length of 9 means 5*5 meters, length will cause query matching cost4Spatialprefixtree Grid =NewGeohashprefixtree (CTX, maxlevels);//geohash string Matching tree5          This. Strategy =NewRecursiveprefixtreestrategy (grid, "poi");//Recursive matching6}

2.4 Effects

The result of this optimization is not recorded, but the latitude-longitude Geohash encoding occupies 25% of the term, and we reduce the geohash length from 11 to 9 (18% reduction), which equates to a reduction in the total number of terms 25%*18%=4.5%.

3 indexes are not stored

The above two methods essentially reduce the index file size by reducing the number of term numbers, and the following method goes another way.

After finding a bunch of docid from Lucene, it is necessary to find the corresponding document by DocId and find some required fields in it, such as ID, consumption per capita, etc., and return it to the client. But in fact we just need to get the ID, and then go through these IDs and request Db/cache to get extra fields.

Therefore, the optimization method is to store only the required fields such as ID, and for most fields we only index and not store, in this way, the index compressed file decreased by about 10%.

1 doc.add (new Stringfield ("price", each, Field.Store.NO));

4 Summary

Based on some basic principles of lucene and its own business, this paper optimizes index file size, which makes the index file size drop by more than half.

Lucene index File Size optimization summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.