Recently has been studying the working mechanism of Sphinx, in [Search engine]sphinx Introduction and the principle of exploration simply introduced its working principle, there are many problems not understand, such as the underlying data structure and algorithm, so further from the data structure level to understand how it works. Search on the Internet a lot of information, found that not many of the articles introduced in this area, and then found a book, "This is the search engine", read the third chapter of the book, introduced the mainstream search engine data structure and its working principle, Sphinx used data structure is the same, with the inverted index.
Note: This article will not be Sphinx and search engine strictly separate open, the same as search engine view.
First one of the drawings:
Index base
First introduce some basic concepts related to search engine, understanding these concepts is very important to follow up understanding work mechanism.
Word-Document Matrix
The word-document matrix is a conceptual model of the inclusion relationship between the two expressions. As shown, each column represents a document, each line represents a word, and the position of the hook is the containing relationship.
From a portrait perspective, you can tell which words each column represents, and from the landscape, each row represents which document contains a word. The index of the search engine is actually the concrete data structure that realizes the word-document matrix. There are different ways to implement these conceptual models, such as inverted indexes, signature files, suffix trees, and so on. But experimental data show that inverted index is the best way to realize the relationship between word-to-document mapping.
Inverted index Basic concepts
Document: A storage object that exists as text. such as: Web pages, Word, PDF, XML and other files in different formats.
Document Collection: A collection of several documents. such as: A large number of web pages.
Document ID: Inside the search engine, unique number that uniquely identifies the document.
Word ID: A unique number within the search engine that uniquely identifies a word.
Inverted index (inverted index): Implements a specific storage form of the word – document matrix. Inverted indexes consist mainly of word dictionaries and inverted files.
Word dictionary (Lexicon): A collection of strings of all the words that appear in the document collection, each entry in the word dictionary records some information about the word itself and pointers to the inverted list.
Inverted arrangement Table (postinglist): A list of documents for all documents with a word and the location information that the word appears in the document. Each record in the list is called an inverted item (Posting).
Inverted files (inverted file): A file that holds the inverted list of all words, and the inverted file is the physical file that stores the inverted index.
The relationship between concepts.
Inverted index Simple instance
An example is given below, which gives a more intuitive feel to the inverted index.
Assume that the document collection contains 5 documents, and each document has the following contents:
An inverted index is established such as:
Word ID: Record the word number of each word;
Word: the corresponding word;
Document frequency: Represents how many documents in a re-document collection contain a word
Inverted list: Contains the word ID and other necessary information
TF: The number of times a word appears in a document
POS: Where the word appears in the document
Take the word "join" for example, the word number is 6, the document frequency is 3, representing the entire document collection of three documents contain the word, the corresponding inverted arrangement table for {(2;1;<4>),(3;1;<7>),(5;1;<5>)}
, meaning is in the document 2,3,5 appeared this word, in each document appeared 1 times, the word "joined" in the first document POS is 4, That is, the fourth word of the document is "affiliate", other similar.
This inverted index is already a very complete index system, and the index structure of the actual search system is basically the same.
Word dictionary
The word dictionary is used to maintain information about all the words that appear in the document collection and to record the position of the inverted list in the inverted file for a word. In the query to the Word dictionary query, you can get the corresponding inverted list, and as a basis for order sequencing.
Common data structures: loads hash linked lists and tree-shaped dictionary structures.
Loads hash linked list
Is the structure of the loads hash linked list dictionary. The principal is a hash table, each hash table entry holds a pointer to the conflicting table, and the same hash value of the word forms the linked list structure.
Build process:
对文档进行分词;对于做好的分词,利用哈希函数获取哈希值;根据哈希值对应的哈希表项找到对应的冲突链表;如果冲突链表已经存在该单词 不处理否则 加入冲突连表
Tree-shaped structure
Use the structure of B-tree or + + trees. Unlike a hash table, a dictionary item is required to be sorted by size, that is, by using a number or character Fu She. tree structure, using hierarchical lookup, intermediate nodes save a certain sequence range of dictionary items stored in which subtree, the lowest leaf node stores the word address information.
Inverted list
The inverted arrangement table is used to record which documents contain a word. The inverted list consists of inverted index entries, each inverted index entry consisting of the document ID, the number of occurrences of the word TD, and where the word appears in the document. Some column inverted index entries that contain a word form an inverted list of words. is a list of inverted permutations:
Build an index
The index structure is described earlier, so how does the index build after the data is available? There are three main methods of indexing.
Two times document Traversal method (2-pass in-memory inversion)
This method completes the index creation process in memory. Require memory to be large enough.
First times
Collect some of the global statistical information. Includes the number of documents contained in the document collection, N, the number of different words contained within the document collection, and the information df that each word appears in the number of documents.
By adding all the DF values for all words, you know how much memory is required to establish the final index. After the information is obtained, resources such as memory are allocated based on statistics, and colleagues establish good words relative to the position information in memory that the table should be inverted.
Twice
Create inverted list information by word. Get the document ID of each document that contains a word, and the number of occurrences of the word in the document TF, and then continuously populate the memory allocated for the first scan. When the second scan is finished, the allocated memory is filled in, and each word uses the "fragment" of the memory area pointed to by the pointer, and the data between its starting position and the ending position is the inverted list of the word.
Sorting method (sort-based inversion)
During the indexing process, a fixed-size space is always allocated in memory to store the intermediate results of dictionary information and indexes, and when the allocated space is consumed, the intermediate results are written to disk, emptying the space occupied by intermediate results in memory to be used as the storage area for the next round of index intermediate results. Reference:
Is the sorting method that establishes the intermediate result of the index. Build process:
读入文档后,对文档进行编号,赋予唯一的文档ID,并对文档内容解析;将单词映射为单词ID;建立(单词ID、文档ID、单词频率)三元组;将三元组追加进中间结果存储区末尾;然后依次序处理下一个文档;当分配的内存定额被占满时,则对中间结果进行排序(根据单词ID->文档ID的排序原则);将排好序的三元组写入磁盘文件中。
Note: In the process of indexing in the sorting method, the dictionary is always stored in memory, because the allocation of memory is fixed size, and gradually the dictionary occupies more and more memory, then, the more backward, can be used to store triples less space.
After the index is established, it needs to be merged.
When merging, the system creates a data buffer in memory for each intermediate result file, which is used to store part of the data in the file. Merge the triples of the same word ID contained in different buffers, and if all triples of a word ID are all merged, indicating that the inverted table of the word has been constructed, writes it to the final index, and the colleague empties the ternary contents of each buffer corresponding to the word ID. The buffer continues to read from the intermediate result file to subsequent triples for the next round of merging. When all intermediate result files are read sequentially into the buffer, and the merge is complete, the final index file is formed.
Merge method (merge-based inversion)
Merging is similar to sorting, where all intermediate results, including dictionaries, are written to disk each time the data is written to disk, so that all memory contents can be emptied, and subsequent indexing can use all the fixed memory. The merge method is as follows:
Differences from the sorting method:
1, the sorting method in memory is the dictionary information and ternary data, the dictionary and ternary data is not directly linked, the dictionary just to map the word to the word ID. The rule of merging is to create a complete memory index structure in memory, which is part of the final article index.
2, when the intermediate result is written to the disk temporary file, the merge method writes the inverted index of this memory to the temporary file, then completely empties the occupied memory. The sorting method simply sorts the ternary data into the disk temp file, and the dictionary is stored in memory as a mapping table.
3. When merging, the sorting method is to combine the three groups of the same word in succession; The temporary file of the merge method is the partial inverted list of each word, so the inverted list of each word is merged to form the final inverted list of the word.
Dynamic indexing
In a real world, some documents within a collection of documents that a search engine needs to process may be deleted or content modified. Dynamic indexing can achieve this real-time requirement if it is to be reflected in the search results immediately after the content has been deleted or modified. Dynamic indexes have three key index structures: Inverted index, temporary index, and deleted document list.
Temp index: An inverted index that is built in memory in real time, parsing the document in real time and appending it into the temporary index structure when a new document enters the system.
Deleted list: stores the corresponding document ID of a document that has been deleted, forming a list of document IDs. When a document is modified, you can think of deleting the old document and then adding a new document to the system to enable support for content changes in such an indirect way.
When the system discovers that a new document has entered, it is immediately added to the temporary index. When a new document is deleted, it is added to the delete document queue. When the document is changed, the original document is placed in the delete queue, resolving the changed document content, and adding it to the temporary index. This will satisfy the requirement of real-time.
When processing a user's query request, the search engine reads the inverted list of the words from the inverted index and the temporary index, finds a collection of documents containing the user's query, merges the two results, and then filters the deleted documents from the search results by filtering from the results by using the Delete document list. The final search results are formed and returned to the user.
Index Update Policy
Dynamic indexing can meet the needs of real-time search, but as more documents are added, the memory consumed by the temporary index increases. Therefore, consider updating the contents of the staging index to the disk index to free up memory space to accommodate subsequent documents, and consider a reasonable and effective index update policy.
Full rebuild strategy (complete re-build)
Re-index all documents. After the new index is established, the old index is discarded and the response to the user query is entirely up to the new index. During the rebuild, the memory still needs to maintain the old index in response to the user's query.
Re-merge strategy (Re-merge)
When a new document enters the search system, the search system records its information in a temporary inverted index in memory maintenance, when a new document reaches a certain number, or when the specified size of memory is consumed, the temporary index is combined with the inverted index of the old document to generate a new index. The process is as follows:
Update steps:
1. When the new document enters the system, parses the document, updates the temporary indexes maintained in memory, each word appearing in the document, appends the inverted table entry at the end of its inverted table, this temporary index can be called the incremental index
2. Once the incremental index will consume the specified memory light, the incremental index and the old inverted index content need to be merged.
Efficient reason: when traversing an old inverted index, the contents of the file can be read sequentially, reducing the disk seek time, because the dictionary order of the indexed words is sorted from low to high.
Cons: Because you want to generate a new inverted index file, the inverted table in the old index does not change and needs to be read out and written to the new index. Increases the consumption of I/O.
In-place update policy (in-place)
The starting point of the in-place update strategy is to resolve the drawbacks of the re-merge strategy.
When the index is merged, the new index file is not generated, but instead is appended to the old index file, appending the inverted list of words in the Delta index to the end of the corresponding position of the old index, so as to achieve the above goal, that is, only update the words related information in the increment index, and the other words related information will not change.
In order to be able to support append operations, the in-place update policy in the initially established index reserves a certain amount of disk space at the end of each word's inverted table, so that the incremental index can be appended to the reservation space when the index is merged. Such as:
Experimental data prove that the index update efficiency of in-place update strategy is lower than the re-merge strategy, because: 1, because of the need to do fast migration, this strategy needs to maintain and manage the disk free space, the cost is very high. 2, when doing data migration, some words and their corresponding inverted list will be removed from the old index, breaking the word continuity, it is necessary to maintain a word to its inverted file corresponding location mapping table. Reduces disk read speed and consumes large amounts of memory (store mapping information).
Hybrid strategy (Hybrid)
The words are categorized according to their different properties, and different categories of words have different index update strategies for their indexes. Common practice: Depending on the length of the inverted table of words, because some words often appear in different documents, so their corresponding inverted list is longer, and some words are seldom seen, then the inverted table is shorter. According to this nature, the words are divided into long inverted list words and short inverted arrangement table words. The Long inverted permutation table word takes the in-place update strategy, while the short inverted list word takes a re-merge strategy.
Because the read/write overhead of a long inverted list word is significantly larger than a short inverted list word, using in-place update strategy can save disk read/write times. While the cost of reading/writing a large number of short inverted list words is relatively small, the sequential read/write advantage can be fully exploited by using the re-merge strategy.
Query processing
After establishing the index, how to use inverted index to respond to the user's query? There are the following three kinds of query processing mechanism.
Single document (Doc at a time)
Calculates the final similarity score for one of the documents and the query each time, with the document contained in the inverted list, and then begins to calculate the final score for the other document until all the documents have been scored. Then, according to the size of the document score, the output of the highest score of the K document as the search results output, that is, completed a user query response. In real-world implementations, only a priority queue of size k is maintained in memory. As shown in the computer system of one document at a time:
The dashed arrows mark the direction in which the query processing calculation progresses. When querying, for document 1, because this document is included in the inverted list of two words, you can calculate the similarity of the document and query words based on parameters such as TF and IDF, and then add two scores to the similarity score of document 1 and user query Score1. Others are similar calculations. Finally, according to the size of the document score, output the highest score of the K-separated document as the search results output.
One word at a time
Unlike one document, one word at a time takes a "horizontal and vertical" approach, first calculating a partial similarity score for each document ID in the inverted list corresponding to a word, that is, moving horizontally in the word-document matrix first, after calculating all the documents contained in a word's inverted list, It then calculates the document ID contained in the next Word inverted list, which is calculated vertically and accumulates on the original score if a document ID has been found to have scored. When all the words have been processed, the final similarity score for each document ends, followed by size, and the highest-scoring K document is the result of the search. is the arithmetic mechanism of one word at a time.
The dashed arrows indicate the direction of the calculation, in order to save the data, use a hash table in memory to hold the intermediate results and the final calculation results. At query time, for document 1, the similarity score for the "search engine" is calculated based on parameters such as TD and IDF, followed by the document ID in the hash table, and the similarity score is saved in the hash table. The calculation of the similarity score for the next word ("technology") begins after calculating the other documents in turn. When calculating, for document 1, after calculating the similarity score, find the hash table, find the document 1 and the score, then the hash table corresponding to the score and the score just calculated as the final score, and update the Hashtable 1 in the Chinese document 1 corresponding score, so that the document 1 and user query the final similarity score, Similar to the calculation of other documents, the result is sorted after the output of the highest scoring K document as the search results.
Jump pointer (skip pointers)
Basic idea: Piecemeal a inverted table data, cut into several fixed-size blocks of data, a block of data as a group, adding meta information to each block of data to record information about the block, so that even in the face of a compressed inverted list, there are two benefits to merging the inverted table:
1, do not need to decompress all the inverted list items, only the partial decompression of data can
2. There is no need to compare any two document IDs.
is the "Google" query word corresponding to the inverted table added to the data structure after the jump pointer.
Suppose the size of the data block is 3 for the inverted table of the word "Google". Then add management information before each piece of data, such as the first block of management information is «5,pos1»,5 represents the first document ID number in the block, POS1 is the jumping pointer, pointing to the starting position of the 2nd block. Suppose you want to find a document with document ID 7 in the inverted list after the word "Google" compresses. First of all, the first two values of the inverted list of data decompression, read the data of the jump pointer, the value of <5,pos1>, wherein the POS1 points out the 2nd set of jumping pointer in the inverted table in the starting position, so you can decompress the POS1 position at two consecutive values, get < 13,pos2>. 5 and 13 are the smallest document IDs in two sets of data (that is, the first document ID for each set of data), and we are looking for 7, then if the 7th document is contained in the inverted list of words "Google", it will appear in the first group, otherwise the inverted list does not contain this document. After extracting the 1th set of data, reverse restores its original document number according to the smallest document number, the original document ID of <2,1> here is: 5+2=7, the same as the document ID we are looking for, indicating that document 7th is in the inverted list of the word "Google", so you can end this search.
From the above search process, when looking for data, only one of the data blocks need to be decompressed and document number lookup to obtain results, without extracting all the data, it is obvious to speed up the search speed, and save memory space.
Disadvantage: Increase the number of pointer comparison operations.
The practice shows that assuming the length of the inverted table is L (that is, it contains the L document ID), the square root L is used as the block size, then the effect is better.
Multi-field Index
That is, multiple fields of the document are indexed. How to implement multi-field indexing: Multiple-indexed, inverted-table, and extended-list methods.
Multi-index Mode
For each of the different fields, create an index that extracts the results from the appropriate index when the user specifies a field to use as the search scope. When a user does not specify a specific field, the search engine searches all fields and merges the relevance score for multiple fields, which is less efficient. The multi-index method is as follows:
Inverted table mode
Storing field information in the inverted table of a keyword, appending field information at the end of each document index entry information in the inverted table, so that when the user queries the inverted list of keywords, it is possible to determine whether the keyword appears in a field based on the field information to filter it. The inverted arrangement table is as follows:
Extended list mode
This is a much more used method of supporting multiple-field indexes. Create a list for each field that records the occurrence location information for each document in this field. is a list of extensions:
For convenience, only the list of extensions is established for the title field. For example, the first <1, (1,4); For document 1, the position of the caption is the same as the range of the first word to the 4th word, and the other items have similar meanings.
For queries, suppose that the user searches for "search engines" in the Title field, and by rearranging the tables to know that documents 1, 3, and 4 contain the query Word, then you need to decide if the documents appear in the Title field. For document 1, the query word "search engine" appears in the position of 6 and 10. By the corresponding title extension list, the title range of document 1 is 1 to 4, stating that the title of document 1 does not contain a query term, that is, document 1 does not meet the requirements. For document 3, the "search engine appears in 2, 8, 15, the corresponding title extension list, the title appears in the range of 1 to 3, indicating that the query word in position 2 appears in the title range, that is, meet the requirements, can be output as search results." Document 4 is also a similar process.
Phrase Query
The essence of a phrase query is how to maintain sequential relationships or location information between words in an index. The more common support phrase query techniques include: Location information index, double Word index and phrase index. Three can also be used in combination.
Location Information index (Position index)
It is easy to support phrase queries by recording Word location information in the index. But the cost of storage and computation is high. As follows:
<5,2,[3,7]>
The meaning is that the 5 document contains the word "Love", and the word appears 2 times in the document, its corresponding position is 3 and 7, the other meaning is the same.
When querying, the inverted table shows that document 5 and document 9 contain two query terms, in order to determine whether the user query exists as a phrase in both documents, and to determine the location information. The word "Love" in the 5th document where the position is 3 and 7, and "Buy and sell" in the 5th document where the position is 4, you can know the 5th document position 3 and position 4 respectively corresponds to the word "love" and "buy and sell", that is, the two is a phrase form, and according to the same analysis that the 9th document is So document number 5th is returned as search results.
Double Word index (Nextword index)
Statistics show that two phrase is the largest proportion in the phrase, so it can solve the problem of phrase query by providing fast query for two phrase phrases. But if you do this, the number of list tables will explode. Two-word index data structures such as:
From the figure, the memory contains two dictionaries, namely the "first word" and "the word" dictionary, the "First word" dictionary has a pointer to a location in the "word" dictionary, "the word" Dictionary stores the 2nd Word of a common phrase immediately following the "word" dictionary, and the "word" dictionary pointer to the inverted list containing the phrase. For example, the phrase "my", whose inverted list contains documents 5 and 7, the phrase "Father", whose inverted table contains document 5, the rest of the dictionaries are similar meanings.
For the query, the user enters "my father" to query, the search engine will be the word "my" and "father" two phrases, and then find the dictionary information, found that contains "my" This phrase is document 5 and document 7, and contains "father" this phrase has document 5. To see where the corresponding occurrence is, you can know that document 5 is a qualifying search result, which completes support for phrase queries.
A double-word index can make the index grow exponentially, and the general implementation does not create a double-word index on all words, but rather a double-word index on a phrase that is computationally expensive.
Phrase index (Phrase index)
Add multiple phrases directly to the dictionary and maintain the inverted list of phrases. The disadvantage is that it is impossible to index all the phrases in advance. The common practice is to dig out hot phrases. is the overall index structure after adding the phrase index:
For the query, when the search engine receives the user query, now the phrase index to find, if found, then return to the user search results after the calculation, otherwise still using the regular index for query processing.
Blending methods
Combine the three, received the user query, the system first in the phrase index lookup, if found to return results, otherwise in the double-word index, if found to return results, otherwise from the regular index to deal with the phrase, give full play to their advantages. 3 ways to mix the index structure as shown:
Phrase queries are used to index popular phrases and high-frequency phrases, and double-word indexes index high-cost phrases, including inactive words.
For queries, the system first looks in the phrase index, and if found, returns the result, otherwise finds in the double-word index, returns the result if found, otherwise processes the phrase from the regular index, giving full play to their advantages.
Distributed index (Parallel indexing)
When a search engine needs to deal with too many collections of documents, it needs to consider a distributed solution. Each machine maintains part of the entire index, with multiple machines collaborating to complete indexing and responding to queries.
by document (Doc paritioning)
Cuts the entire collection of documents into sub-collections, and each machine is responsible for indexing a collection of files and responding to query requests. The following documents are divided by:
How it works: After the query distributor receives a user query request, it broadcasts the query to all index servers. Each index server is responsible for index maintenance and query responses for some of the text subfolders. When a user query is received by the Index server, the relevant document is calculated and the highest-scoring K documents are sent back to the query Distributor. After the query distributor synthesizes the search results for each index server, merges the search results and returns the highest-scoring m document as the final search result to the user.
Divide by word (term paritioning)
Each index server is responsible for the establishment and maintenance of the inverted list of some words in the dictionary. Divide by Word as follows:
How it works: one word at a time. Assuming that the query contains a, B, and C three words, the query server receives the query, forwards the query to the Index Server node 1 that contains the word a inverted list, Index server node 1 extracts A's inverted list, and calculates the median of the search results, The query and intermediate results are then passed to the Index server node that contains the word B inverted list, and the Index Server node 2 is similarly processed and continues to index server node 3. The final result is then returned to the query Distributor, and the query Distributor calculates the highest scoring K documents as the search results output.
Comparison of two schemes
According to the document is commonly used, divided by the word is only used in special applications. The lack of words divided by word:
Scalability
The documents processed by the search engine are constantly changing. If you divide the index by document, you only need to increase the index server, which is convenient to operate. However, indexing by words has a direct impact on almost all index servers, because the new document may contain all dictionary words, i.e. the inverted table of each word needs to be updated to make it relatively complex.
Load Balancing
The inverted list of commonly used words is very large and can reach a size of dozens of M. If divided by document, the inverted table of this word is distributed evenly across the different index servers, and by word indexing, the entire contents of the inverted list of a common word are maintained by an index server. If the word is also a popular term, then the server becomes a performance bottleneck with excessive load.
Fault tolerance
Suppose a server fails. If you divide by document, it affects only a subset of the files, and the other index servers still respond. However, if the index server fails, the inverted table of some words cannot be accessed, and when users query these words, they will find that there is no search result, which directly affects the user experience.
Support for query processing methods
Index by word only one word can be queried at a time, and is not subject to this limitation by document.
Summarize
By understanding the data structure and algorithm used by search engine, we have a further understanding of its working principle. For Sphinx, an on-line environment can take into account the effect of incremental indexing and a full-scale index combined to achieve real-time performance.
Because the underlying foundation is poor, spent most of the month repeated reading several times to understand the third chapter of the content, really realize that the data structure and algorithm is really important. Although the daily work seldom directly uses the data structure and the algorithm, but knows the commonly used database construction and the algorithm, when encounters the question to have the more solution idea, the thick accumulation thin hair.
Search engine index data structure and algorithm