"Turn" Lucene working principle

Source: Internet
Author: User

Original link http://www.cnblogs.com/dewin/archive/2009/11/24/1609905.html

Lucene is a high-performance Java full-text retrieval toolkit that uses inverted file index structures.

The structure and the corresponding generating algorithm are as follows:
  
0) with two articles 1 and 2
Article 1 of the content is: Tom lives in Guangzhou,i live in Guangzhou too.
Article 2 of the content is: He once lived in Shanghai.
  
1) Since Lucene is based on the keyword index and query, first we want to get the keywords of these two articles, usually we need to deal with the following measures
A. We now have the article content, that is, a string, we first want to find out all the words in the string, namely participle. English words are separated by spaces,

Better handling. Chinese words are connected to each other in need of special word processing.
B. In the article "in", "Once" "too" and other words do not have any practical significance, the Chinese "" "is" and so on the word is usually no specific meaning, these words do not mean that the concept can be filtered out
C. Users usually want to check "he" can be included "he", "he" article also find out, so all the words need to be uniform case.
D. Users usually want to check "live" can be included "lives", "lived" article also find out, so need to "lives", "lived" to restore "live"
E. Punctuation in an article usually does not indicate a concept, or it can filter out
The above measures are done by the Analyzer class in Lucene
  
After the above treatment
All the keywords in article 1 are: [Tom] [live] [Guangzhou] [live] [Guangzhou]
All the keywords in article 2 are: [He] [live] [Shanghai]
  
2) With the keyword, we can set up an inverted index. The correspondence above is: "article number" to "all keywords in the article."

The inverted index turns this relationship upside down and becomes: "keyword" for "All article numbers that have that keyword." The article has been inverted and then turned into
Keyword article number
Guangzhou 1
He 2
I 1
Live
Shanghai 2
Tom 1
  
It's not enough to know what the keywords are in, but we also need to know the number of occurrences and where the keywords appear in the article.

There are usually two types of positions:

A) character position, that is, record the word is the number of characters in the article (the advantage is that keyword highlighting when positioning fast);

b) Keyword location, that is, record the word is the first few keywords in the article (the advantage is to save index space, phrase (phase) query fast), Lucene recorded in this position.
  
With the "Occurrence frequency" and "occurrence" information, our index structure becomes:


Keyword article number [occurrence frequency] appears position
Guangzhou 1[2] 3,6
He 2[1] 1
I 1[1] 4
Live 1[2],2[1] 2,5,2
Shanghai 2[1] 3
Tom 1[1] 1
  
To live This behavior example we explain the structure: live in article 1 appeared 2 times, in article 2 appeared once,

What does it mean when it appears as "2,5,2"? We need to combine the article number and the frequency of occurrence analysis, in article 1 appeared 2 times,

Then "2,5" means that live in article 1 appears in two locations, article 2 appears once, the remaining "2" means that Live is the 2nd keyword in article 2.
  
These are the most central parts of the Lucene index structure. We notice that the keywords are arranged in alphabetical order (Lucene does not use the B-tree structure),

So lucene can quickly locate keywords with a two-dollar search algorithm.
  
When implemented, Lucene saves the above three columns as a dictionary file (term Dictionary), a frequency file (frequencies), a location file (positions), respectively.

The dictionary file not only holds each keyword, but also retains a pointer to the frequency file and location file, and the pointer can find the frequency information and location information of the keyword.
  
Lucene uses the concept of field to express the location of the information (such as the title, the article, the URL),

In the index under construction, the field information is also recorded in the dictionary file, each keyword has a field information (because each keyword must belong to one or more field).
  
To reduce the size of the index file, Lucene uses compression techniques for the index. First, the keywords in the dictionary file are compressed,

Keyword compression for < Zaiyu prefix length, suffix; For example: The current word is "Arabic" and the previous word is "Arabic", then the words "Arab" are compressed into <3, language >.

The second is to use a lot of compression of the number, the number is only saved with the previous value of the difference (this can reduce the length of the number, thereby reducing the number of bytes required to save).

For example, the current article number is 16389 (not compressed to be saved with 3 bytes), the previous article number is 16382, and after compression save 7 (only one byte).
  
Below we can explain why the index is indexed by querying the index.
Suppose you want to query the word "Live", Lucene finds the dictionary for $ Two, finds it, reads all the article numbers with a pointer to the frequency file, and returns the results.

Dictionaries are usually very small, and thus the entire process is millisecond-time.
Instead of using the normal sequential matching algorithm, the index is not built, but the content of all the articles is matched by string

The process will be quite slow, and when the number of articles is large, time is often intolerable.

the relationship and difference between forward index and inverted index  Forward index is a set of keywords which can reflect the main content of the page after searching the text word, eliminating noise, removing weight and extracting keywords. At the same time, each keyword is recorded on the page frequency, number of occurrences, format, location.   Thus, each page can be recorded as a tuple of keywords, which contains the word frequency, format, location and other weight information of each keyword. The forward index cannot be used directly for ranking.   If there is only a forward index, the ranking program needs to scan all the files in the index library, find the file containing the keywords, and then do the correlation calculation, so that the calculation can not meet the real-time return ranking results requirements. So the search engine will re-construct the forward index data Warehouse into inverted index, the mapping of the file corresponding to the keyword to the keyword to the file mapping. In the inverted index, the key word is the primary key, each keyword corresponds to a series of files, these files appear this keyword. This way, when a user searches for a keyword, the sort program locates the keyword in the inverted index to find all the files that contain the keyword immediately.

"Turn" Lucene working principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.