1. Introduction
The inverted index stems from the fact that a record needs to be found based on the value of the property . Each entry in this index table includes an attribute value and the address of each record that has that property value . Because the property value is not determined by the record, it is determined by the property value to determine the position of the record, and is therefore called an inverted index (inverted). A file with an inverted index is called an inverted index file (inverted file).
Inverted files (inverted index), indexed objects are documents or words in a collection of documents, and are used to store the words stored in a document or set of documents, which is the most commonly used indexing mechanism for a document or a collection of documents.
The key step of the search engine is to set up an inverted index , the inverted index is generally expressed as a keyword, and then its frequency (the number of occurrences), the location (in which article or page, and the date, author and other information), it is equivalent to the Internet on the hundreds of millions of pages of the page to do an index, It is like the catalogue and label of a book. Readers want to see which topic related chapters, directly according to the table of contents to find the relevant page. No more from the first page of the book to the last page, one page of the search.
2.Lucene Inverted Index principle
Lucerne is an open-source, high-performance Java full-text Search Engine Toolkit , not a full-text search engine, but a full- text search engine architecture that provides a complete query engine and index engine , part of the text analysis engine . The aim is to provide software developers with an easy-to-use toolkit to facilitate full -text retrieval in the target system, or to build a complete full-text search engine on this basis.
Lucerne uses the inverted file index structure. The structure and the corresponding generating algorithm are as follows:
Features two articles 1 and 2:
Article 1 of the content is: Tom lives in Guangzhou,i live in Guangzhou too. Article 2 of the content is: He once lived in Shanghai.
<1> Get Keywords
Since Lucene is based on keyword indexing and querying, we first have to get the keywords for both articles, and we usually need to deal with the following measures:
A. We now have the article content, that is, a string, we first want to find out all the words in the string, namely participle . English words are better handled because they are separated by spaces. Chinese words are connected to each other in need of special word processing.
B. In the article "in", "Once" "too" and other words do not have any practical significance, the Chinese "" "is" and so on the word is usually no specific meaning, these words do not mean that the concept can be filtered out
C. Users usually want to check "he" can be included "he", "he" article also find out, so all the words need to be uniform case .
D. Users usually want to check "live" can be included "lives", "lived" article also find out, so need to "lives", "lived" to restore "Live"
E. punctuation in an article usually does not indicate a concept, or it can filter out
The above measures in Lucene are done by the Analyzer class. After the above treatment,
All the keywords in article 1 are: [Tom] [live] [Guangzhou] [i] [live] [Guangzhou] Article 2 All the keywords are: [he] [live] [Shanghai]
<2> Set up inverted index
With the keyword, we can set up an inverted index. The correspondence above is: "article number" to "all keywords in the article." The inverted index turns this relationship upside down and becomes: "keyword" for "All article numbers that have that keyword."
The article has been inverted and then turned into
Keywords article number guangzhou 1 He 2 I 1 live Shanghai 2 Tom 1
It is often not enough to know which articles appear in the keywords, we also need to know the number of occurrences of the keywords in the article and where they appear, usually in two places:
A. character position , that is, record the word is the number of characters in the article (the advantage is that keyword highlighting when positioning fast);
B. keyword location , that is, the word is the first few keywords in the article (the advantage is to save index space, phrase (phase) query fast), Lucene recorded in this position.
With the "Occurrence frequency" and "occurrence" information, our index structure becomes:
Keyword article number [occurrence frequency] appears position Guangzhou 1[2] 3,6 he 2[1] 1 I 1[1] 4 live 1[2] 2,5,
2[1] 2 Shanghai 2[1] 3 Tom 1[1] 1
To live This behavior example we explain the structure: live in article 1 appeared 2 times, in article 2 appeared once, it appears in the position of "2,5,2" This means what? We need to combine the article number and frequency to analyze, article 1 appeared 2 times, then "2,5" means that live in article 1 appeared in two locations, the article 2 appeared once, the remaining "2" means that Live is the 2nd keyword in article 2.
These are the most central parts of the Lucene index structure. We notice that the keywords are arranged in alphabetical order (Lucene does not use the B-tree structure), so lucene can quickly locate keywords with a two-dollar search algorithm .
<3> implementation
When implemented, Lucene saves the above three columns as a dictionary file (term Dictionary), a frequency file (frequencies), a location file (positions), respectively. The dictionary file not only holds each keyword, but also retains a pointer to the frequency file and location file, and the pointer can find the frequency information and location information of the keyword.
Lucene uses the concept of field, which is used to express the location of information (in the title, in the article, in the URL), in the construction index, the field information is also recorded in the dictionary file, each keyword has a field information (because each keyword must belong to one or more field).
<4> compression Algorithms
To reduce the size of the index file, Lucene uses compression techniques for the index.
Firstly, the keywords in the dictionary file are compressed, the keyword compression is < prefix length, suffix, for example: The current word is "Arabic", the previous word is "Arabic", then "Arabic" compressed into <3, language >.
The second is to use a lot of compression of the number, the number is only saved with the previous value of the difference (this can reduce the length of the number, thereby reducing the number of bytes required to save). For example, the current article number is 16389 (not compressed to be saved with 3 bytes), the previous article number is 16382, and after compression save 7 (only one byte).
<5> Application Reasons
Below we can explain why the index is indexed by querying the index.
Suppose you want to query the word "Live", Lucene finds the dictionary for $ Two, finds it, reads all the article numbers with a pointer to the frequency file, and returns the results. Dictionaries are usually very small, and thus the entire process is millisecond-time.
Instead of using regular sequential matching algorithms, instead of indexing, the string matches the content of all the articles, the process will be quite slow, and when the number of articles is large, time is often unbearable.
The inverted index of Lucene working principle