Lucene Index Structure Improvement-supports retrieval of one billion-level indexes on a single machine

Source: Internet
Author: User

Glossary:

Lucene:It is a sub-project of the 4 jakarta Project Team of the apache Software Foundation. It is an open-source full-text search engine toolkit, that is, it is not a complete full-text search engine, but a full-text search engine architecture, it provides a complete query engine and index engine, and some text analysis engines (two Western languages: English and German ). Lucene aims to provide software developers with a simple and easy-to-use toolkit to conveniently implement full-text retrieval in the target system, or build a complete full-text retrieval engine based on this.

Binary Search:Binary Search, also known as semi-query, has the advantage of a small number of times, fast query speed, and good average performance. Its disadvantage is that the table to be queried is an ordered table and it is difficult to insert or delete data. Therefore, the half-fold lookup method is suitable for searching frequently ordered lists without frequent changes. First, assume that the elements in the table are arranged in ascending order. Compare the keywords recorded in the middle of the table with the search keywords. If the two are the same, the search is successful; otherwise, the table is divided into two sub-tables by using the intermediate position record. If the keyword recorded in the intermediate position is greater than the search keyword, the previous sub-table is further searched, otherwise, search for the next sub-table. Repeat the preceding process until you find a record that meets the conditions to make the search successful, or until the child table does not exist, the search fails.

Inverted index:(English: Inverted index), also known as reverse index, put into a file or reverse file, is an index method, it is used to store the ing of a word stored in a document or a group of documents in full-text search. It is the most common data structure in the document retrieval system.

The standalone version of lucene can only deal with indexes of tens of millions or millions. By modifying indexes, it can support retrieval of more than 1 billion of indexes.

At present, lucene needs to read the entire index file for the first loading. If the data volume is large and the index file is large, it will become a memory bottleneck.

 

 

Lucene uses inverted indexes to search documents. Assume that three documents exist.

People's Republic of China

People's Heroes

Chinese food

During the indexing process, Lucene will reverse the three documents according to the word splitting results to form an inverted table tis file.

China {1, 3}

People {1, 2}

Republic {1}

Hero {2}

Food {3}

In this way, when a user searches for the keyword "China", according to the inverted "China {1, 3}", he can know that this keyword is contained in documents 1 and 3, documents 1 and 3 are returned to the user.

We usually call every word in the inverted Table A term. If you want to know that the term of the people corresponds to those documents, you must first know that the term of the people is in the inverted table, then we can know the documents ({, 2}) corresponding to this term (people). So we need to first find the term and locate the position of the term.

The preceding inverted table is used as an example. Assume that the offset of China in the inverted table is 1, the offset of the people is 2, the Republic is 3, the hero is 4, and the food is 5.

Here, the offset is the starting position of the file. Once you know the offset, you can use the seek operation to directly locate the file pointer to the position of the target term, and then you can read the document ID, that is, the query result.

It is also important that the terms in the inverted table are ordered when lucene creates an index.

Lucene uses the 128 skip table method to create an index. The principle is as follows:

Because lucene indexes are ordered, lucene stores some key terms in index files,

Suppose there are a total of 1280 terms in the inverted table, so 1st terms, 129th terms, 257 ..., 1281 an 11 key terms and offsets will be stored in the index file tii.

Before term retrieval, lucene loads all the key terms in the tii file into the inner storage and stores them as arrays. Of course, they are also ordered. When we want to retrieve a term, we will first retrieve the term in the memory between the two key terms. Due to the order, the target term will certainly be between the two terms, then, based on the offset of a small term in the two key terms, start searching after the offset in the tis file of the inverted table. Because the interval of the hop table is 128 bits, the term can be found up to 128 times.

128 skip table Problems

Before term search, the index file tii must be fully loaded into the memory. If the number of terms is small, there will be no problem, but if there are many terms, such as 1 billion, it also consumes dozens of GB of memory (depending on the length of the term). General physical machines generally do not have such a large memory, therefore, lucene cannot be searched because of program crash. In addition, it takes a lot of time to load indexes each time. If primary developers do not use lucene properly and enable or disable lucene each time, the index files will be repeatedly loaded, it also affects the search performance.

How can this problem be solved?

A notable feature of the Lucene inverted table tis file is that the term is ordered, and the offset is a fixed-length long type, so it is suitable for Binary Search (half-fold search ).

Change the tii index file storage content to the offset of each term in the Lucene inverted table TIS. According to the offset, we can get the value of the corresponding term in the TIS file.

Suppose there are a total of 1000 terms in the inverted table, and our target term is on 501st terms, then the binary search will first compare the terms at the middle (half-fold) position, that is, compare the term at the root of the root 500th position, and find that the target term should be within the range of 500 to 1000, then perform a half-lookup within the range, and locate it in the range of 500 ~ 750, and then further locate 500 ~ 625,500 ~ 564,500 ~ 532,500 ~ 516, and finally find the offset of the target term.

 

In this process, the index file is not loaded into the memory, and the memory is less dependent. If there are 10 billion terms, in the worst case, the number of folds is 34.

In addition, due to the features of half-fold, 1/2, 1/4, 1/8, 1/16... These points are high hit points. They can be pre-loaded to the memory based on the memory size of the physical machine, but compared with the 128 hop table, the memory is loaded in high hit areas, and the memory usage is much higher. If the cache contains 16384 locations, this can be reduced by 14 seek times for a long time, then 10 billion only needs 20 file seek times, but if it is compared with the original one, in the worst case of the old 128 skip table, 128 seek is required, with an average of 64 seek times, which is much higher than the binary seek times. In addition, because cache can be used in high hit points, the number of caches can be further reduced.

 

Optimization of problems and details

As the number of terms increases, the number of seek operations caused by semi-query increases. 10000 of the terms require 12 searches, 0.1 million requires 15 queries, and 1 million performs 20 queries, 10 million 23 times, 0.1 billion 26 times, 1 billion 29 times, 10 billion 34 times.

2. The term compression method was originally used to store the differences between the previous record and the differences in the key points of storage (this will reduce the compression ratio, but the binary method must do this)

3. if the index binary search document difference is <128, the original linked list is retained and the scan method is called. (although the number of reads increases, the physical characteristics of the disk are considered, the operating system usually has a file buffer, and the continuous data reading speed is faster than the continuously jumping seek, and the physical hard disk is suitable for reading continuous data). In this way, 1 ~ 128 consecutive seek to reduce the number by about 6 ~ About 7 hops of seek.

4. because norms also consumes a lot of memory, norms is disabled when indexing is created here. To be improved in the future, lucene also has this problem. However, the same binary method can be used to solve this problem, do not load all data into the memory.

 

1. Lucene TermInfosReader uses the 128-bit skip table. The example is as follows:

 

 

During retrieval, the Skip table needs to load the memory, and the query speed is relatively satisfactory for terms of tens of millions (but it also needs 1 ~ 128 seek), but if the number of terms Reaches hundreds of millions, it may break through the physical memory limit of a single machine. Currently, almost all of the industry uses distributed, split into multiple indexes to reduce the length of the Skip table. However, a single machine can also support retrieval of billions of keys and values (a fixed-length data type can be used here, in addition, the luceneTermInfos structure is ordered and supports binary search. If cache is used together, it will not consume as much memory as jumping tables, because the number of seek is reduced, the Retrieval time is also improved, and the number of unlimited terms is supported, depending on the hard disk size)

At present, lucene needs to read the entire index file for the first loading. Generally, it needs to grow the connection mode and has high requirements on developers. Therefore, this problem is not caused by the use of Binary Index files.

 

 

 

The following table shows ~ 1 billion MD5 values for index creation and query

The read time is the query time of md5 records, in milliseconds

The time when the full index is created, in milliseconds.

 

Number of records

Read 10 million records

Time of each record

Index creation time

Total index size

Tii File Size

 

1 million

13667

0.13667

14338

87.6 MB

7.62 MB

 
 

2 million

14400

0.144

25508

175 MB

15.2 MB

 
 

10 million

20234

0.20234

120262

4.26 GB

381 MB

 
 

0.1 billion

2289399

22.89399

1360215

8.51 GB

762 MB

 

0.5 billion

3793413

37.93413

12249876

42.6 GB

3.72 GB

 

Billion

5063614

50.63614

27365596

85.2 GB

7.45 GB

 
 

 

 

Lucene compression algorithm Overview

Lucene uses compression to create indexes

For string type

The first record in the file linked list stores the complete information, and the second record stores the difference from the first record.

For example, the first record is abcdefg. If the second record is abcdefh, their difference is only h. Therefore, the second record will only store one length and add different characters, 6 + h storage

For lucene indexes, the compression ratio is considerable because the indexes are sorted.

Numeric type

Common compression of file offset is similar to string compression, but the difference from the previous record is stored. If the first record is 8 and the second record is 9, the second one only stores 9-8 = 1, while lucene stores longer indexes. lucene creates indexes by append. Therefore, this compression method is very effective.

Compression changes of New Indexes

Because the above lucene compression method cannot be used when it is changed to a binary method, a certain compression rate needs to be abandoned. We compress data by defining the key points and store the complete data at the key points. The following points are different from the key points. This is very similar to the key frames for video image compression, key Frames Store complete image information, while the subsequent frames only store differences and changes. This can reduce the time consumption of computing differences (this is insignificant for lucene, not the main problem solved here), but the compression ratio will decrease.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.