Lucene inverted index Principle

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lucene is a high-performance Java full-text retrieval toolkit that uses the Inverted File index structure. This structure and corresponding generationAlgorithmAs follows:

0) There are twoArticle1 and 2
Article 1: Tom lives in Guangzhou, I live in Guangzhou too.
The content of article 2 is: he once lived in Shanghai.

1) Because Lucene is based on keyword indexing and query, we need to obtain the keywords of the two articles. Generally, we need to take the following measures.
A. Now we have some content in the article, that is, a string. First we need to find all words in the string, that is, word segmentation. English words are better processed because they are separated by spaces. Chinese words are connected together and require special word segmentation.
B. the words "in" and "once" and "too" in the article do not have any practical significance. The words "yes" in Chinese usually have no specific meaning, words that do not represent concepts can be filtered out.
C. users usually want to find articles containing "he" and "he" When querying "he". Therefore, all words must be case-sensitive.
D. users usually want to find articles including lives and lived when querying "live". Therefore, they need to restore lives and lived to "live"
E. punctuation marks in an article can be filtered out even if they do not represent a certain concept.
In Lucene, the above measures are completed by the analyzer class.

After the above processing
All keywords in article 1 are: [Tom] [live] [Guangzhou] [I] [live] [Guangzhou]
All the keywords in Article 2 are: [he] [live] [Shanghai]

2) with the keyword, we can create an inverted index. The correspondence above is: "Article number" to "all keywords in the Article ". Inverted indexes reverse this relationship to the keyword pair "All document numbers with this keyword ". Article 1 and 2 are converted
Keyword Document No.
Guangzhou 1
He 2
I 1
Live 1, 2
Shanghai 2
Tom 1

Generally, it is not enough to know which articles the keyword appears in. We also need to know the number of times and the location of the keyword appears in the article. there are usually two types of positions: a) character location, that is, to record the number of characters in the document (advantage is that the keyword is located quickly when it is highlighted); B) the keyword location, that is, to record the word as the first keyword in the article (advantage is to save the index space, phrase (phase) query is fast), which is recorded in Lucene.

After the "occurrence frequency" and "location" information are added, our index structure becomes:
Keyword document number [occurrence frequency] occurrence location
Guangzhou 1 [2] 3, 6
He 2 [1] 1
I 1 [1] 4
Live 1 [2], 2 [1] 2, 5, 2
Shanghai 2 [1] 3
Tom 1 [1] 1

Take the live line as an example to illustrate this structure: Live appears twice in article 1 and once in Article 2, where it appears at "2, 5, 2 "What does this mean? We need to analyze the document number and frequency. If article 1 appears twice, "2, 5" indicates the two positions that live appears in article 1, once in Article 2, the remaining "2" indicates that live is the 2nd keyword in article 2.

The above is the core part of the Lucene index structure. We noticed that keywords are arranged in character Order (Lucene does not use the B-tree structure), so Lucene can use the binary search algorithm to quickly locate keywords.

When implemented, Lucene saves the above three columns as dictionary files (term dictionary), frequency files (frequencies), and positions. The dictionary file not only stores each keyword, but also retains the pointer to the frequency file and location file. The pointer can be used to find the frequency information and location information of the keyword.

Lucene uses the field concept to express the location of the information (such as the title, article, or URL). In the index being created, the field information is also recorded in the dictionary file, each keyword has a field information (because each keyword must belong to one or more fields ).

To reduce the size of the index file, Lucene also uses the compression technology for the index. First, the keywords in the dictionary file are compressed, and the keywords are compressed into <prefix length, suffix>. For example, the current word is "Arabic" and the previous word is "Arabic ", then the "Arabic" is compressed to <3, language>. Second, a large amount of data is used to compress a number. Only the difference between the number and the previous value is saved (this can reduce the length of the number and thus reduce the number of bytes required to save the number ). For example, the current article number is 16389 (it must be saved in 3 bytes without compression), the previous article number is 16382, and the compressed file is saved in 7 bytes (only one byte is used ).

Next we can explain why the index should be created by querying the index.
Suppose you want to query the word "live". Lucene first searches for and finds the word binary in the dictionary, reads all the article numbers by pointing to the pointer of the frequency file, and then returns the result. The dictionary is usually very small, so the entire process is in milliseconds.
However, using a common sequential matching algorithm, instead of creating an index, is to perform String Matching on the content of all articles. This process will be quite slow. When the number of articles is large, time is often intolerable.

Http://www.lucene.com.cn/yanli.htm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene inverted index Principle

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene inverted index Principle

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support