1. Word-document Matrix
The common search scenario is to give several keywords and find the documents containing the keywords.
How to quickly find a document containing a keyword becomes the key to search. With the help of the word-document matrix model, we can easily know which words are contained in a document and which documents contain a word.
The search engine index is actually the specific data structure of the word-document matrix, including inverted indexes, signature files, and suffix trees. Of course, inverted indexes are common. Lucene is also implemented based on inverted indexes.
2. Inverted indexes 2. 1. Composition of inverted Indexes
Inverted indexes generally consist of a vocabulary and a record table.
Vocabulary: a collection of different words contained in a document set.
Record table: a list of every word in the vocabulary that contains the document number of the word (other information such as the position of the word in the document may be saved ).
2. Use inverted index (Search)
How can such a structure be used for retrieval?
Generally, we store the vocabulary and the record table separately. The vocabulary file contains the pointer of each word pointing to the record table file. First, find the words to be searched in the vocabulary, and then retrieve the corresponding record table. If there is only one word in the query, you can retrieve the record table. If the query contains multiple words, you need to merge (intersection or union) multiple record tables ).
But the efficiency must be emphasized here. How can I quickly find a word and obtain the content of the record table? If your vocabulary is not very large (the primary storage can accommodate it), the sorting array, B tree, trie tree, and hash are all optional data structures. The hash speed is beyond doubt, tree B and tree trie can process prefix queries and range queries. Lucene does not consider storing the entire vocabulary into the primary storage. The vocabulary file is divided into two types, which are similar to two-layer skip tables. Binary Search is used in the middle.
2. 3. Create an index
There are three methods:
2.3.1. Document Traversal method twice (2-pass in-memory inversion)
This method relies entirely on memory. Scan all documents, calculate the TF of each word, and then calculate the total TF of all words. The total value is directly related to the memory size required for the final index. In the memory, an ordered array is created to store words, and a continuous storage space (controlled by the total TF) is created to store inverted indexes, in addition, each word points to its corresponding inverted list position through a "Pointer. The second scan is to create an inverted list of each word. After all the documents are scanned, they are stored in some way.
The disadvantage of this method is: 1) two scans with low efficiency; 2) completely dependent on memory, powerless to document sets with large scale; 3) Dynamic Index Update is not supported.
2.3.2. sorting method (sort-based inversion)
To solve the defect of the document traversal method twice, the sorting method only opens up a fixed size of space in the memory to create an index. When the space consumes light, it is written to the disk, then, the memory space is cleared to index other documents until the index is complete. Sorting method stores all words in the memory and keeps collecting inverted items (such as the word ID, Document ID, and word frequency). When space is consumed, sort all inverted items by Word ID and Document ID, and write them into temporary files. In this case, the memory space occupied by the inverted items is cleared and the preceding operations are repeated. Note that the word ID is sorted in advance. At the beginning, we will not load all words into the memory, but add them to the memory only when the document appears during traversal. Therefore, with the traversal, the space occupied by words is getting larger and larger, and the space occupied by inverted items is getting smaller and smaller. That is to say, the smaller the number of documents to be traversed, the smaller the temporary files.
After traversing the entire document set, we get a bunch of temporary files (storing inverted items), and then we need to "integrate" these files ". Because these inverted items are sorted in order, this work is relatively simple, that is, to merge the inverted items of the same word.
2.3.3. Merge (merge-based inversion)
One of the most critical problems with sorting is to put all words in the memory. If the number of words is so large that the memory cannot be placed, it will be far from enough! The merge method can solve this problem well. The merge method also opens up a fixed size space in the memory. For a document, it will be converted into a standard inverted memory index structure, assuming that, under the limitation of memory space, a separate inverted memory index structure has been established for N documents, and then merged to form an index segment, which is written to the disk to clear the memory. Here, the memory indexes and index segments differ only in the document quantity and structure. In this process, you also need to sort words to facilitate future merging.
After traversing the entire document set, we also get a bunch of temporary files (index segments). Similar to sorting, We can merge the inverted list of the same word.
Lucene adopts this method.
2. 4. Dynamic Indexing
After an index is created, if you do not need to adjust it later (add or delete it), it is called a static index. Otherwise, it is a dynamic index.
Dynamic Indexing also involves real-time indexing (Real-Time retrieval) and other issues, which are not described here.
The Dynamic Index consists of three parts: the disk index file, the memory temporary index, and the list of deleted documents.
The memory index in the memory temporary index and merge method indicates that the newly added documents are temporarily stored in the memory. To delete the document list. You can query the disk index file and memory temporary index at the same time, and then filter the deleted document list. When the memory temporary index reaches a certain threshold, it is merged into the disk index file.
2. 5. Index Update policy
Index updates mainly include complete re-build and incremental update (re-merge Policy re-merge and In-Place update policy in-place ). Lucene uses the re-merge policy.
Important:
1. Storage and query of Vocabulary (data structure)
Sorting array, B tree, trie tree, and hash
(Sina Weibo: @ quanliang _ machine learning)
From: http://www.cnblogs.com/huangfox/archive/2012/07/18/2597603.html