Inverted index is one of the most important technologies in search engine, which can be said to be the cornerstone of search engine. It can be said that with inverted index technology, the search engine can be efficient database search, delete and other operations.
1. The idea of inverted index
The inverted index stems from the fact that a record needs to be found based on the value of the property. Each entry in this index table includes an attribute value and the address of each record that has that property value. Because the property value is not determined by the record, it is determined by the property value to determine the position of the record, and is therefore called an inverted index (inverted).
In search engines, query words can be divided into several words, so for the search engine in the inverted index corresponding to the attribute is the word, and the corresponding record is a Web page (also can be widely referred to as a document). Therefore, the search engine inverted index is to implement a "word-document Matrix" of a specific storage form, through the inverted index, you can use the word (attribute) to quickly get a list of documents containing the word (record). The inverted index consists mainly of two parts: the word dictionary and the inverted file.
2. "Word-document Matrix"
The word-document matrix is a conceptual model of the inclusion relationship between the two, and figure 1 shows its meaning. Each column in Figure 1 represents a document, each line represents a word, and the position of the tick represents the containing relationship:
Figure 1 Word-document matrix
From the portrait-to-document dimension, each column represents which words the document contains, such as document 1, which contains vocabulary 1 and vocabulary 4, without other words. Judging by the dimension of horizontal-word, each line represents which document contains a word. For example, for vocabulary 1, Word 1 appears in document 1 and document 4, while other documents do not contain vocabulary 1. Other columns in the matrix can be interpreted as such.
The index of search engine is actually the concrete data structure that realizes "word-document Matrix". There are different ways to implement the above conceptual model, such as "Inverted Index", "Signature file", "suffix tree" and so on. However, the experimental data show that "inverted index" is the best way to realize the relationship between word-to-document mapping.
3. Basic framework for inverted indexes
Dictionary of words and words: the usual index unit of a search engine is the word, which is a collection of strings consisting of all the words that appear in the document collection, and each index entry in the word dictionary records some information about the word itself and a pointer to the inverted list.
Inverted table: The inverted table records the list of documents for all documents that have a word and the location information that appears in the document, each of which is called an inverted item (Posting). You can tell which documents contain a word, based on the inverted list.
Inverted file: The inverted list of all words is often stored sequentially in a file on disk, which is called an inverted file, and the inverted file is the physical file that stores the inverted index.
Search Engine Inverted Index approximate process framework: When the user searches the search box of search engine, the search engine searches for words and synonyms, and then gets a series of word lists based on the original query words. Then, according to the internal search engine dictionary to query each word corresponding to the inverted list, so as to locate the page containing the word or a document. Finally, the search engine based on the specific page sorting algorithm will query to the page to sort, through the front-end to display the search results to the user. 2 is the main process for inverted indexes:
Figure 2 Inverted Index process framework
4. Dictionary of Words
In fact, we can see through the process of inverted index, the key technology of inverted index is to build Word dictionary.
The word dictionary is used to maintain information about all the words that appear in the document collection and to record the position of the inverted list in the inverted file for a word. In support of the search, according to the user's query words, go to the Word dictionary query, you can get the corresponding inverted table, and as a basis for subsequent sorting.
For a large collection of documents, may contain hundreds of thousands of or even millions of different words, can quickly locate a word, which directly affects the response speed of the search, so the need for efficient data structures to build and find word dictionaries, Commonly used data structures include the loads hash linked list structure (the Zip method for hash storage) and the tree-shaped dictionary structure.
1) Hashila Chain method
Figure 3 is the structure of this dictionary. This dictionary structure is composed of two parts:
The main part is a hash table, each hash table entry holds a pointer to the list of conflicting links, and in the conflict list, the same hash value of the word forms the linked list structure. There are conflicting lists because two different words get the same hash value, and if so, when Hashifand Fari is called a conflict, you can store words of the same hash value in the list for subsequent lookups.
Figure 3 Hashila Chain Method Dictionary Structure
In the process of building the index, the dictionary structure is also built accordingly. For example, when parsing a new document, for a word t appearing in a document, the hash function is used first to obtain its hash value, and then the corresponding list of conflicting links is found by reading the stored pointer from the hash table entry corresponding to the hashed value. If the word already exists in the conflict list, the word has already appeared in the previously parsed document. If the word is not found in the conflict list, the word is first encountered, then added to the conflict list. In this way, when all the documents within the document collection have been parsed, the corresponding dictionary structure is built up.
In response to a user query request, the process is similar to establishing a dictionary, except that a word is not added to the dictionary, even if there are no words in the dictionary. In Figure 3, for example, assume that the user entered a query request for the word X, the word is hashed, positioned to the hash table slot 4th, from its reserved pointer can get the conflict linked list, followed by the word X and the conflict linked list of words in the comparison, found the word X in the conflict list, so found the word, You can then read the corresponding inverted list of the word to do subsequent work, if the word is not found, the document collection does not have any documents contain words, the search results are empty.
2) tree-shaped structure
A B-tree (or a + + tree) is another efficient lookup structure, and figure 1-8 is a B-tree structure. B-trees are different from Hashefang, requiring dictionary items to be sorted by size (numeric or Word Fu She), while hashing does not require data to satisfy this requirement.
B-Tree formed a hierarchical search structure, the middle node used to indicate a certain sequence range of dictionary items stored in which subtree, based on the dictionary item comparison size to navigate the role of the bottom of the leaf node stores the word address information, according to this address can extract the word string.
5. Example of inverted index
Assume that the document collection contains five documents, as shown in content 4 for each document, and that the leftmost column in the diagram is the document number for each document. Our task is to set up an inverted index on this collection of documents.
Figure 4 Document Collection
Chinese and English and other languages, the words do not have a clear separation between the symbols, so the first to use Word segmentation system to automatically cut into the word sequence. In this way, each document is converted into a data stream consisting of a sequence of words, for ease of subsequent processing, a unique word number is assigned to each of the different words, and the document contains the word, and at the end of this process we can get the simplest inverted index (see figure 3-4). In Figure 3-4, the Word ID column records the word number for each word, the second column is the corresponding word, and the third column is the inverted list of each word. For example, the word "Google", whose word number is 1, the inverted arrangement table is {1,2,3,4,5}, the document collection in each document contains the word.
Figure 5 A simple inverted index
The inverted index shown in Figure 5 is the simplest because the index system only records which documents contain a word, and in fact, the index system can also record more information beyond that. In the inverted list of words, not only the document number, but also the word frequency information (TF), that is, the number of occurrences of the word in a document, the reason to record this information, because the frequency of word information in the search results sorted, the calculation of query and document similarity is an important calculation factor, So it is recorded in the inverted list, in order to facilitate the subsequent sorting of the score calculation of the useful inverted index can also record more information, the index system in Figure 6, in addition to record the document number and Word frequency information, there are two additional types of information, namely the "Document frequency information" corresponding to each word (corresponding to the third column in Figure 6).
Figure 6 Inverted index with Word frequency, document frequency, and occurrence location information
In addition, in addition to the above information, you can also record the location of a word in a document in an inverted table.
The inverted index shown in Figure 6 is already a very complete index system, the index structure of the actual search system is basically the same, the difference is nothing more than the specific data structure to achieve the above logical structure.
With this index system, the search engine can easily respond to user queries, such as user input query word "Facebook", the search system to find inverted index, which can read the document containing the word, these documents are provided to the user's search results, and the use of Word frequency information, The document frequency information can be used to sort the candidate search results, calculate the similarity of the document and query, sort the output from high to low according to the similarity score, and finally show the search results for the user.
[Search engine] search engine technology inverted row index