Lucene inverted index Principle

Source: Internet
Author: User

Lucene is a high-performance Java full-text retrieval toolkit that uses the Inverted File index structure. This structure and corresponding generationAlgorithmAs follows:

0) There are twoArticle1 and 2
Article 1: Tom lives in Guangzhou, I live in Guangzhou too.
The content of article 2 is: he once lived in Shanghai.

1) Because Lucene is based on keyword indexing and query, we need to obtain the keywords of the two articles. Generally, we need to take the following measures.
A. Now we have some content in the article, that is, a string. First we need to find all words in the string, that is, word segmentation. English words are better processed because they are separated by spaces. Chinese words are connected together and require special word segmentation.
B. the words "in" and "once" and "too" in the article do not have any practical significance. The words "yes" in Chinese usually have no specific meaning, words that do not represent concepts can be filtered out.
C. users usually want to find articles containing "he" and "he" When querying "he". Therefore, all words must be case-sensitive.
D. users usually want to find articles including lives and lived when querying "live". Therefore, they need to restore lives and lived to "live"
E. punctuation marks in an article can be filtered out even if they do not represent a certain concept.
In Lucene, the above measures are completed by the analyzer class.

After the above processing
All keywords in article 1 are: [Tom] [live] [Guangzhou] [I] [live] [Guangzhou]
All the keywords in Article 2 are: [he] [live] [Shanghai]

2) with the keyword, we can create an inverted index. The correspondence above is: "Article number" to "all keywords in the Article ". Inverted indexes reverse this relationship to the keyword pair "All document numbers with this keyword ". Article 1 and 2 are converted
Keyword Document No.
Guangzhou 1
He 2
I 1
Live 1, 2
Shanghai 2
Tom 1

Generally, it is not enough to know which articles the keyword appears in. We also need to know the number of times and the location of the keyword appears in the article. there are usually two types of positions: a) character location, that is, to record the number of characters in the document (advantage is that the keyword is located quickly when it is highlighted); B) the keyword location, that is, to record the word as the first keyword in the article (advantage is to save the index space, phrase (phase) query is fast), which is recorded in Lucene.

After the "occurrence frequency" and "location" information are added, our index structure becomes:
Keyword document number [occurrence frequency] occurrence location
Guangzhou 1 [2] 3, 6
He 2 [1] 1
I 1 [1] 4
Live 1 [2], 2 [1] 2, 5, 2
Shanghai 2 [1] 3
Tom 1 [1] 1

Take the live line as an example to illustrate this structure: Live appears twice in article 1 and once in Article 2, where it appears at "2, 5, 2 "What does this mean? We need to analyze the document number and frequency. If article 1 appears twice, "2, 5" indicates the two positions that live appears in article 1, once in Article 2, the remaining "2" indicates that live is the 2nd keyword in article 2.

The above is the core part of the Lucene index structure. We noticed that keywords are arranged in character Order (Lucene does not use the B-tree structure), so Lucene can use the binary search algorithm to quickly locate keywords.

When implemented, Lucene saves the above three columns as dictionary files (term dictionary), frequency files (frequencies), and positions. The dictionary file not only stores each keyword, but also retains the pointer to the frequency file and location file. The pointer can be used to find the frequency information and location information of the keyword.

Lucene uses the field concept to express the location of the information (such as the title, article, or URL). In the index being created, the field information is also recorded in the dictionary file, each keyword has a field information (because each keyword must belong to one or more fields ).

To reduce the size of the index file, Lucene also uses the compression technology for the index. First, the keywords in the dictionary file are compressed, and the keywords are compressed into <prefix length, suffix>. For example, the current word is "Arabic" and the previous word is "Arabic ", then the "Arabic" is compressed to <3, language>. Second, a large amount of data is used to compress a number. Only the difference between the number and the previous value is saved (this can reduce the length of the number and thus reduce the number of bytes required to save the number ). For example, the current article number is 16389 (it must be saved in 3 bytes without compression), the previous article number is 16382, and the compressed file is saved in 7 bytes (only one byte is used ).

Next we can explain why the index should be created by querying the index.
Suppose you want to query the word "live". Lucene first searches for and finds the word binary in the dictionary, reads all the article numbers by pointing to the pointer of the frequency file, and then returns the result. The dictionary is usually very small, so the entire process is in milliseconds.
However, using a common sequential matching algorithm, instead of creating an index, is to perform String Matching on the content of all articles. This process will be quite slow. When the number of articles is large, time is often intolerable.

Http://www.lucene.com.cn/yanli.htm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.