Lucene is a high-performance Java full-text retrieval toolkit that uses the Inverted File index structure. The structure and corresponding generation algorithm are as follows:
There are two Articles 1 and 2:
Article 1: Tom lives in Guangzhou, I live in Guangzhou too.
The content of article 2 is: he once lived in Shanghai.
Because Lucene is based on keyword indexing and query, we first need to obtain the keywords of the two articles. The processing measures are as follows:
A. word segmentation. English words are better processed because they are separated by spaces. Chinese words are connected together and require special word segmentation.
B. the words "in", "Once", and "too" in the article have no practical significance. The words "of" and "yes" in Chinese usually have no specific meaning, words that do not represent concepts can be filtered out.
C. All words are case-insensitive.
D. users usually want to find articles including lives and lived when querying "live", so they can restore lives and lived to "live ".
E. punctuation marks in articles can be filtered out even if they do not represent a certain concept.
In Lucene, the above measures are completed by the analyzer class.
After the above processing
All keywords in article 1 are: [Tom] [live] [Guangzhou] [I] [live] [Guangzhou]
All the keywords in Article 2 are: [he] [live] [Shanghai]
With the keyword, we can create an inverted index. The correspondence above is: "Article number" corresponds to "all keywords in the Article ". Inverted indexes reverse this relationship to "keyword" corresponding to "all document numbers with this keyword ".
Article 1 and 2 are inverted and changed:
Keyword Document No.
Guangzhou 1
He 2
I 1
Live 1, 2
Shanghai 2
Tom 1
Generally, it is not enough to know which articles the keyword appears in. We also need to know the number of times and the location of the keyword appears in the article. there are usually two types of positions: a) character location, that is, to record the number of characters in the document (advantage is that the keyword is located quickly when it is highlighted); B) the keyword location, that is, to record the word as the first keyword in the article (advantage is to save the index space, phrase (phase) query is fast), which is recorded in Lucene.
After the "occurrence frequency" and "location" information are added, our index structure becomes:
Keyword document number [occurrence frequency] occurrence location
Guangzhou 1 [2] 3, 6
He 2 [1] 1
I 1 [1] 4
Live 1 [2], 2 [1] 2, 5, 2
Shanghai 2 [1] 3
Tom 1 [1] 1
Take the live line as an example to illustrate this structure: Live appears twice in article 1 and once in Article 2, where it appears at "2, 5, 2 ", what does this mean? We need to analyze the document number and frequency.
When implemented, Lucene saves the above three columns as dictionary files (term dictionary), frequency files (frequencies), and positions. The dictionary file not only stores each keyword, but also retains the pointer to the frequency file and location file. The pointer can be used to find the frequency information and location information of the keyword.
Lucene uses the field concept to express the location of the information (such as the title, article, or URL). During indexing, the field information is also recorded in the dictionary file, each keyword has a field information (because each keyword must belong to one or more fields ).
To reduce the size of the index file, Lucene also uses the compression technology for the index. First, the keywords in the dictionary files are compressed. The keywords are compressed into <prefix length, suffix>. For example, the current word is "Arabic" and the prefix is "Arabic ", then the "Arabic" is compressed to <3, language>. Second, a large amount of data is used to compress a number. Only the difference between the number and the previous value is saved (this can reduce the length of the number and thus reduce the number of bytes required to save the number ). For example, the current article number is 16389 (it must be saved in 3 bytes without compression), the previous article number is 16382, and the compressed file is saved in 7 bytes (only one byte is used)
Next we can explain why the index should be created by querying the index.
Suppose you want to query the word "live". Lucene first searches for and finds the word binary in the dictionary, reads all the article numbers by pointing to the pointer of the frequency file, and then returns the result. The dictionary is usually very small, so the entire process is in milliseconds.
However, using a common sequential matching algorithm, instead of creating an index, is to perform String Matching on the content of all articles. This process will be quite slow. When the number of articles is large, time is often intolerable.