Index of the search engine
1. Word--Document matrix
The word-document matrix is a conceptual model of the inclusion relationship between the two, and figure 3-1 shows its meaning. Each column in Figure 3-1 represents a document, each line represents a word, and the position of the tick represents the containing relationship.
Figure 3-1 Word-document matrix
From the portrait-to-document dimension, each column represents which words the document contains, such as document 1, which contains vocabulary 1 and vocabulary 4, without other words. Judging by the dimension of horizontal-word, each line represents which document contains a word. For example, for vocabulary 1, Word 1 appears in document 1 and document 4, while other documents do not contain vocabulary 1. Other columns in the matrix can be interpreted as such.
The index of search engine is actually the concrete data structure that realizes "word-document Matrix". There are different ways to implement the above conceptual model, such as "Inverted Index", "Signature file", "suffix tree" and so on. However, the experimental data show that the "inverted index" is the best way to implement the word-to-document mapping relationship, so this chapter mainly introduces the technical details of "inverted index".
2. Inverted index Basic concepts
Document: The general search engine is dealing with the Internet Web page, and the concept of the document is more broad, representing the existence of text-based storage objects, compared to the Web page, covering more forms, such as word,pdf,html, Files in different formats, such as XML, can be referred to as documents. Another such as an email, a text message, a microblog can also be called a document. In the following sections of the book, documents are used to characterize text information in many cases.
Document Collection: A collection of documents is called a collection of documents. For example, a huge amount of internet pages or a large number of e-mails are specific examples of document collections.
Document ID: Within the search engine, each document within the document collection is assigned a unique internal number, which is used as a unique identifier for this document, so that it is easy to process internally, and the internal number of each document is called the "document Number". The following article sometimes uses DocId to easily represent document numbers.
Word ID: Similar to the document number, the search engine internally represents a word with a unique number, and the word number can be used as the unique representation of a word.
Inverted Indexes (inverted index): Inverted indexes are a specific form of storage that implements the word-document matrix, and by inverted index, a list of documents containing the word can be quickly obtained based on the word. The inverted index consists mainly of two parts: the word dictionary and the inverted file.
Word Dictionary (Lexicon): the usual index unit of a search engine is the word, which is a collection of strings consisting of all the words that appear in the document collection, and each index entry in the word dictionary records some information about the word itself and a pointer to the inverted list.
Inverted Arrangement Table (postinglist): The inverted table records the list of documents for all documents that have a word, and the location information that the word appears in the document, each of which is called an inverted item (Posting). You can tell which documents contain a word, based on the inverted list.
Inverted files (inverted file): The inverted list of all words is often stored sequentially in a file on disk, which is called an inverted file, and the inverted file is the physical file that stores the inverted index.
The relationship between these concepts can be clearly seen through figure 3-2.
Figure 3-2 Inverted index basic concepts
3. Inverted index Simple instance
Inverted indexes are very simple in terms of logical structure and basic thinking. Below we illustrate through concrete examples, so that readers can have a macro and direct impression of inverted index.
Assume that the document collection contains five documents, as shown in content 3-3 for each document, and that the leftmost column in the diagram is the document number for each document. Our task is to set up an inverted index on this collection of documents.
Figure 3-3 Document Collection
Chinese and English and other languages, the words do not have a clear separation between the symbols, so the first to use Word segmentation system to automatically cut into the word sequence. In this way, each document is converted into a data stream consisting of a sequence of words, for ease of subsequent processing, a unique word number is assigned to each of the different words, and the document contains the word, and at the end of this process we can get the simplest inverted index (see figure 3-4). In Figure 3-4, the Word ID column records the word number for each word, the second column is the corresponding word, and the third column is the inverted list of each word. For example, the word "Google", whose word number is 1, the inverted arrangement table is {1,2,3,4,5}, the document collection in each document contains the word.
Figure 3-4 A simple inverted index
The inverted index shown in Figure 3-4 is the simplest because the index system only records which documents contain a word, and in fact, the index system can also record more information beyond that. Figure 3-5 is a relatively complex inverted index, compared to the basic index system in Figure 3-4, in the corresponding inverted table of words not only records the document number, also recorded the word frequency information (TF), that is the word in a document occurrences, the reason to record this information, is because the word frequency information in the search results sort, the calculation of query and document similarity is a very important factor of calculation, so it is recorded in the inverted table, in order to facilitate the subsequent sorting of the score calculation. In the example in Figure 3-5, the word "founder" of the word number is 7, the corresponding inverted table content is: (3:1), where 3 represents the document number 3 of the document contains the word, the number 1 is the word frequency information, that the words in the 3rd document only 1 times, The inverted list that corresponds to the other words means the same thing.
Figure 3-5 Inverted index with Word frequency information
Useful inverted index can also record more information, the index system in Figure 3-6, in addition to record the document number and Word frequency information, the additional two types of information, that is, each word corresponding to the "Document frequency Information" (corresponding to the third column of Figure 3-6) and in the inverted table to record the occurrence of a word in a document location information.
Figure 3-6 Inverted index with word frequency, document frequency, and occurrence location information
"Document frequency Information" represents how many documents in a document collection contain a word, and the reason for this information is the same as the word frequency information, which is a very important factor in the search results sort calculation. The location of the word in a document is not the index system must be recorded, in the actual index system can be included, you can choose not to include this information, because this information is not necessary for the search system, location information only in support of "phrase query" can be useful.
Take the word "la" for example, the word number is 8, the document frequency is 2, representing the entire document collection of two documents containing the word, the corresponding inverted list is: {(3;1;<4>), (5;1;<4>)}, which means that the word in document 3 and document 5 appears, Word frequency is 1, the word "la" in two documents appear in the position of 4, that is, the fourth word in the document is "La".
The inverted index shown in Figure 3-6 is already a very complete index system, the index structure of the actual search system is basically the same, the difference is nothing more than the specific data structure to achieve the above logical structure.
With this index system, the search engine can easily respond to user queries, such as user input query word "Facebook", the search system to find inverted index, which can read the document containing the word, these documents are provided to the user's search results, and the use of Word frequency information, The document frequency information can be used to sort these candidate search results, calculate the similarity of the document and query, according to the similarity score from high to low sort output, this is a part of the internal process of the search system, the specific implementation of the program in the fifth chapter will be described in detail.
4. Dictionary of Words
A word dictionary is an important part of an inverted index that maintains information about all the words that appear in the document collection, and also records the position of the inverted list in the inverted file for a word. In support of the search, according to the user's query words, go to the Word dictionary query, you can get the corresponding inverted table, and as a basis for subsequent sorting.
For a large collection of documents, may contain hundreds of thousands of or even millions of different words, can quickly locate a word, which directly affect the response speed of the search, so the need for efficient data structures to build and find the word dictionary, commonly used data structure including loads hash linked list structure and tree-shaped dictionary structure.
4.1 Loads hash Linked list
Figure 1-7 is the structure of this dictionary. This dictionary structure is composed of two parts:
The main part is a hash table, each hash table entry holds a pointer to the list of conflicting links, and in the conflict list, the same hash value of the word forms the linked list structure. There are conflicting lists because two different words get the same hash value, and if so, when Hashifand Fari is called a conflict, you can store words of the same hash value in the list for subsequent lookups.
Figure 1-7 Loads hash list dictionary structure
In the process of building the index, the dictionary structure is also built accordingly. For example, when parsing a new document, for a word t appearing in a document, the hash function is used first to obtain its hash value, and then the corresponding list of conflicting links is found by reading the stored pointer from the hash table entry corresponding to the hashed value. If the word already exists in the conflict list, the word has already appeared in the previously parsed document. If the word is not found in the conflict list, the word is first encountered, then added to the conflict list. In this way, when all the documents within the document collection have been parsed, the corresponding dictionary structure is built up.
In response to a user query request, the process is similar to establishing a dictionary, except that a word is not added to the dictionary, even if there are no words in the dictionary. In Figure 1-7, for example, assume that the user entered a query request for the word 3, the word is hashed, positioned to the hash table 2nd slot, from its reserved pointer can get the conflict linked list, followed by the word 3 and the conflict linked list of the word comparison, found the word 3 in the conflict list, so found the word, You can then read the corresponding inverted list of the word to do subsequent work, if the word is not found, the document collection does not have any documents contain words, the search results are empty.
4.2 Tree-shaped structure
A B-tree (or a + + tree) is another efficient lookup structure, and figure 1-8 is a B-tree structure. B-trees are different from Hashefang, requiring dictionary items to be sorted by size (numeric or Word Fu She), while hashing does not require data to satisfy this requirement.
B-Tree formed a hierarchical search structure, the middle node used to indicate a certain sequence range of dictionary items stored in which subtree, based on the dictionary item comparison size to navigate the role of the bottom of the leaf node stores the word address information, according to this address can extract the word string.
Figure 1-8 B-Tree lookup structure
Inverted sort principle and example