Lucene underlying principles and optimization experience sharing (1)-lucene Introduction and Index principle

Source: Internet
Author: User

Based on Lucene search engine we have developed our own Full-text retrieval system, take up the background of petabytes, trillions of records of data retrieval work, here to share the bottom of the Lucene basic research and some optimization experience.
Introduced from two aspects:
1. Lucene Introduction and indexing principle
2. Lucene Optimization Experience Summary

1. Lucene Introduction and indexing principle

This part starts from three aspects: lucene introduction, index principle, lucene index Implementation.

About 1.1 Lucene

Lucene was originally developed by the famous Doug cutting, 2000 open source, is now also open source Full-text search solution is the choice, its features outlined is: full Java implementation, Open source, High performance, functional integrity, easy to expand, function integrity in the support of the participle, Various query methods (prefix, blur, regular, etc.), scoring highlighting, column storage (docvalues), and so On.
And Lucene has developed for more than 10 years, but still maintains an active development to adapt to the growing demand for data analysis, the latest 6.0 version introduced block k-d trees, comprehensively improve the digital type and location information retrieval performance, The Elasticsearch distributed retrieval and analysis system based on Lucene is also in full swing, and Elasticsearch has been used in our Project.
Lucene Overall Use:

Combine the code to illustrate a four-step process:

IndexWriter iw=NewIndexWriter ();//create IndexWriterDocument doc=NewDocument (NewStringfield ("name","Donald Trump", Field.Store.YES));//build Index DocumentIw.adddocument (doc);//make Index LibraryIndexreader reader = Directoryreader.Open(fsdirectory.Open(NewFile (index)); Indexsearcher searcher =NewIndexsearcher (reader);//open IndexQuery query = Parser.parse ("name:trump");//parse QueryTopdocs Results =searcher.search (query, -);//retrieve and fetch the first 100 document numbers for(scoredoc Hit:results.hits) {Document doc=searcher. Doc (hit.doc)//real Document fetching}

It is easy to use, but only by knowing the principle behind it can we better use lucene, and then we will introduce the general retrieval principle and the implementation details of LUCENE.

1.2 Indexing principle

Full-text search technology has a long history, the vast majority are based on inverted index to do, there have been some other programs such as file Fingerprints. Inverted index, as the name implies, it is the opposite of an article contains what words, it starts from the word, it records the word in which documents appear, consisting of two parts-dictionary and inverted List.

The dictionary structure is particularly important, there are many kinds of dictionary structure, each has its advantages and disadvantages, the simplest such as a sorted array, through the binary search to retrieve data, faster with a hash table, disk to find a B-tree, plus tree, but a can support terabytes of data in the inverted index structure needs to have a balance in time and space, Lists the pros and cons of some common dictionaries:

Among the available are: B + trees, jumping tables, FST
B + tree:
MySQL InnoDB B + number structure

    理论基础:平衡多路查找树    优点:外存索引、可更新    缺点:空间大、速度不够快

Jumping table:

    优点:结构简单、跳跃间隔、级数可控,Lucene3.0之前使用的也是跳跃表结构,后换成了FST,但跳跃表在Lucene其他地方还有应用如倒排表合并和文档号索引。    缺点:模糊查询支持不好

FST
The index structure that Lucene uses now

理论基础:   《Direct construction of minimal acyclic subsequential transducers》,通过输入有序字符串构建最小有向无环图。优点:内存占用率低,压缩率一般在3倍~20倍之间、模糊查询支持好、查询快缺点:结构复杂、输入要求有序、更新不易Lucene里有个FST的实现,从对外接口上看,它跟Map结构很相似,有查找,有迭代:
String inputs={"abc","abd","acf","acg"};//keysLongoutputs={1,3,5,7};//valuesFst<long> fst=NewFst<> (); for(intI=0; I<inputs.length;i++) {fst.add (inputs[i],outputs[i])}//getLong Value=fst.get ("abd");//get 3//iterationBytesreffstenum<long> iterator=NewBytesreffstenum<> (fst); while(iterator.next!=NULL){...}
100万数据性能测试:
Data Structure HashMap TreeMap FST
Build Time (ms) 185 500 1512
Query all keys (ms) 106 218 890

Can be seen, FST performance basic with Haspmap gap is not small, but FST has an incomparable advantage is to occupy small memory, only one of the HashMap10, this is important for large data size retrieval, after all, the speed can not be put into memory is useless.
Therefore a qualified dictionary structure requires:
1. Query Speed.
2. Memory Consumption.
3. Memory + disk Combination.
Later we will parse the Lucene index structure, focusing on the FST implementation features of Lucene to illustrate these three points.

1.3 Lucene Index Implementation
*(本文对Lucene的原理介绍都是基于4.10.3)*

Lucene after many years of Evolution optimization, now an index file structure, basically can be divided into three parts: dictionary, inverted table, forward File.

Here is a detailed description of each part of the Structure:

Index structure

Lucene now uses the data structure as fst, which is characterized by:
1. Word lookup complexity is o (len (str))
2, share the prefix, save space
3, Memory storage prefix index, disk storage suffix Word block
This is consistent with the three elements of the dictionary structure we mentioned Earlier: 1. Query Speed. 2. Memory Consumption. 3. Memory + disk Combination. We insert four words into the index library abd, abe, acf, acg, and look at its index file Contents.

Tip section, Each column a FST index, so there will be multiple fst, each FST store prefix and suffix block pointers, where the prefix is a, ab, AC. Other information in Tim that contains suffix blocks and words, such as inverted table pointers, tfdf, and so on, is an inverted list of each word in the doc File.
So its retrieval process is divided into three steps:
1. Memory load tip file, find suffix word block location by FST match Prefix.
2. According to the word block location, read the suffix block in the Tim file on the disk and find the suffix and the corresponding inverted table position Information.
3. Load the inverted list in the doc file according to the inverted table Position.
Here there are two problems, the first is how to calculate the prefix, the second is how to write the suffix of the disk and through the FST location, below will describe the next Lucene build FST process:
It is known that FST requires an orderly input, so Lucene will pre-sort the parsed document words and build the fst, assuming the input is abd,abd,acf,acg, then the entire build process is as Follows:

1. 插入abd时,没有输出。2. 插入abe时,计算出前缀ab,但此时不知道后续还不会有其他以ab为前缀的词,所以此时无输出。3. 插入acf时,因为是有序的,知道不会再有ab前缀的词了,这时就可以写tip和tim了,tim中写入后缀词块d、e和它们的倒排表位置ip_d,ip_e,tip中写入a,b和以ab为前缀的后缀词块位置(真实情况下会写入更多信息如词频等)。4. 插入acg时,计算出和acf共享前缀ac,这时输入已经结束,所有数据写入磁盘。tim中写入后缀词块f、g和相对应的倒排表位置,tip中写入c和以ac为前缀的后缀词块位置。

The above is a simplified process, and the main optimization strategies for Lucene's FST implementation are:

1. 最小后缀数。Lucene对写入tip的前缀有个最小后缀数要求,默认25,这时为了进一步减少内存使用。如果按照25的后缀数,那么就不存在ab、ac前缀,将只有一个跟节点,abd、abe、acf、acg将都作为后缀存在tim文件中。我们的10g的一个索引库,索引内存消耗只占20M左右。2. 前缀计算基于byte,而不是char,这样可以减少后缀数,防止后缀数太多,影响性能。如对宇(e9 b8 a2)、守(e9 b8 a3)、安(e9 b8 a4)这三个汉字,FST构建出来,不是只有根节点,三个汉字为后缀,而是从unicode码出发,以e9、b8为前缀,a2、a3、a4为后缀,如:

Inverted table structure

Inverted table is the document number collection, but how to save, how to take a lot of attention, Lucene is now used in the inverted list structure called frame of reference, it has two main features:
1. Data compression, You can see how the 6 numbers from the original 24bytes compression to 7bytes.

2. Jump tables speed up merging, because when Boolean queries, and and or operations need to merge the inverted table, it is necessary to quickly locate the same document number, so the use of jumping table for the same document number Lookup.
This section can refer to a blog of elasticsearch, which has some performance tests:
ElasticSearch Inverted table

Forward file

Forward file refers to the original document, Lucene to the original document also provides a storage function, it is stored features are Sub-block + compression, FDT file is the file that holds the original document, it occupies the index library 90% disk space, The FDX file as an index file, the document number (self-increment Number) quickly get the document location, Their file structure is as Follows:
  
In the fnm, information such as column type, column name, storage method and so on is stored for meta Information.
FDT is a document value, inside a chunk is a block, Lucene index document, cache the document first, the cache is larger than 16KB, the document will be compressed Storage. A chunk contains the chunk starting document, how many documents, and the contents of the compressed Document.
FDX is the document number index, the inverted table holds the document number, through the FDX to quickly locate the document location is chunk position, Its index structure is relatively simple, is the jumping table structure, first it will be 1024 chunk into a block, each block recorded the starting document value, Block is equivalent to a one-level jump Table.
So looking up a document, it's divided into three steps:
The first step is to find the block and locate the Block.
The second step is to find out which chunk and chunk locations belong to each chunk based on the starting document number from the Block.
The third step is to load the FDT chunk and find the Documentation. Another detail here is that the chunk starting document value and the chunk position are not simple arrays, but the mean compression method is Used. So the starting document value of nth chunk is restored by DocBase + Avgchunkdocs * n + docbasedeltas[n] and the position in nth chunk FDT is Startpointerbase + avgchunksize * n + startpointerdeltas[n] to Recover.
As can be seen from the above analysis, Lucene storage of the original file is the row is stored, and in order to improve the space utilization, is a multi-document compression, so the document needs to read into and decompression of additional documents, so the document process relies heavily on random io, and Lucene although provides a specific column, however, you can see from the storage structure that it does not reduce the time to fetch Documents.

以上就是Lucene的底层原理介绍,下一篇将介绍我们如何从这些底层原理出发来优化我们的全文检索系统。

Lucene underlying principles and optimization experience sharing (1)-lucene introduction and indexing principles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.