How Lucene Works

Source: Internet
Author: User

To find the article content text (goal: Quickly retrieve what is needed), this search technique is similar to a Chinese Dictionary index page design principle

Article 1 of the content is: Tom lives in Guangzhou,i live in Guangzhou too.
Article 2 of the content is: He once lived in Shanghai.
  
1) Since Lucene is based on the keyword index and query, first we want to get the keywords of these two articles, usually we need to deal with the following measures
A. We now have the article content, that is, a string, we first want to find out all the words in the string, namely participle. English words are better handled because they are separated by spaces. Chinese words are connected to each other in need of special word processing.
B. In the article "in", "Once" "too" and other words do not have any practical significance, the Chinese "" "is" and so on the word is usually no specific meaning, these words do not mean that the concept can be filtered out
C. Users usually want to check "he" can be included "he", "he" article also find out, so all the words need to be uniform case.
D. Users usually want to check "live" can be included "lives", "lived" article also find out, so need to "lives", "lived" to restore "live"
E. Punctuation in an article usually does not indicate a concept, or it can filter out
The above measures are done by the Analyzer class in Lucene
  
After the above treatment
All the keywords in article 1 are: [Tom] [live] [Guangzhou] [live] [Guangzhou]
All the keywords in article 2 are: [He] [live] [Shanghai]
  
2) With the keyword, we can set up an inverted index. The correspondence above is: "article number" to "all keywords in the article." The inverted index turns this relationship upside down and becomes: "keyword" for "All article numbers that have that keyword." The article has been inverted and then turned into
Keyword article number
Guangzhou 1
He 2
I 1
Live
Shanghai 2
Tom 1
  
Usually only know the keywords in which articles appear is not enough, we also need to know the number of occurrences of the keyword in the article and where it appears, usually there are two places: a) character position, that is, the word is recorded in the article is the first character (the advantage is the keyword highlight when the location of fast); b) keyword location, That is to record the word is the first few keywords in the article (the advantage is to save index space, phrase (phase) query fast), Lucene recorded in this position.
  
With the "Occurrence frequency" and "occurrence" information, our index structure becomes:

Keyword article number [occurrence frequency] appears position
Guangzhou 1[2] 3,6
He 2[1] 1
I 1[1] 4
Live 1[2],2[1] 2,5,2
Shanghai 2[1] 3
Tom 1[1] 1
  
To live This behavior example we explain the structure: live in article 1 appeared 2 times, in article 2 appeared once, it appears in the position of "2,5,2" This means what? We need to combine the article number and frequency to analyze, article 1 appeared 2 times, then "2,5" means that live in article 1 appeared in two locations, the article 2 appeared once, the remaining "2" means that Live is the 2nd keyword in article 2.
  
These are the most central parts of the Lucene index structure. We notice that the keywords are arranged in alphabetical order (Lucene does not use the B-tree structure), so lucene can quickly locate keywords with a two-dollar search algorithm.
  
When implemented, Lucene saves the above three columns as a dictionary file (term Dictionary), a frequency file (frequencies), a location file (positions), respectively. The dictionary file not only holds each keyword, but also retains a pointer to the frequency file and location file, and the pointer can find the frequency information and location information of the keyword.
  
Lucene uses the concept of field, which is used to express the location of information (in the title, in the article, in the URL), in the construction index, the field information is also recorded in the dictionary file, each keyword has a field information (because each keyword must belong to one or more field).
  
To reduce the size of the index file, Lucene uses compression techniques for the index. First of all, the dictionary files in the keywords are compressed, keyword compression for < Zaiyu prefix length, suffix, for example: The current word is "Arabic", the previous word is "Arabic", then "Arabic" compressed to <3, language >. The second is to use a lot of compression of the number, the number is only saved with the previous value of the difference (this can reduce the length of the number, thereby reducing the number of bytes required to save). For example, the current article number is 16389 (not compressed to be saved with 3 bytes), the previous article number is 16382, and after compression save 7 (only one byte).
  
Below we can explain why the index is indexed by querying the index.
Suppose you want to query the word "Live", Lucene finds the dictionary for $ Two, finds it, reads all the article numbers with a pointer to the frequency file, and returns the results. Dictionaries are usually very small, and thus the entire process is millisecond-time.
Instead of using regular sequential matching algorithms, instead of indexing, the string matches the content of all the articles, the process will be quite slow, and when the number of articles is large, time is often unbearable.

Article in which the content comes from the network

How Lucene Works

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.