Search engine index
1. Word-document Matrix
Word-document matrix is a conceptual model that expresses the inclusive relationship between the two. Figure 3-1 shows its meaning. Each column in Figure 3-1 represents a document, each line represents a word, and the position of the check mark represents the inclusion relationship.
Figure 3-1 word-document Matrix
From the vertical dimension, that is, the document, each column represents the words in the document. For example, document 1 contains Vocabulary 1 and vocabulary 4, but not other words. In the word dimension, each line indicates which documents contain a word. For example, word 1 appears in documents 1 and 4, while other documents do not include word 1. Other columns in the matrix can also be interpreted.
The search engine index is actually a specific data structure that implements the word-document matrix. There are different ways to implement the above conceptual model, such as "inverted index", "signature file", and "suffix tree. However, the experiment data shows that "inverted index" is the best way to realize the ing between words and documents. Therefore, this chapter mainly introduces the technical details of "inverted index.
2. Basic concepts of inverted Indexes
Document, it covers more forms, such as Word, PDF, html, XML, and other files in different formats. For example, an email, a text message, or a microblog can also be called a document. In the subsequent content of this book, documents are often used to characterize text information.
Document Collection: A Collection composed of several documents is called a Document Collection. For example, a large number of Internet webpages or a large number of emails are examples of a collection of documents.
Document ID: in the search engine, each Document in the Document set is assigned a unique internal number, which is used as the unique identifier of the Document, in this way, the internal numbers of each document are referred to as "document numbers", which are sometimes represented by docids.
Word ID: similar to the document number, a search engine uses a unique number to represent a Word. A Word number can be used as a unique identifier of a Word.
Inverted Index): Inverted index is a storage method for implementing the word-document matrix. You can use inverted indexes to quickly obtain a list of documents containing the word. Inverted indexes mainly consist of two parts: "Word Dictionary" and "Inverted File ".
Word Dictionary (Lexicon): Generally, the index unit of a search engine is a word. A word dictionary is a string set of all words that have occurred in a document set, each index in the word dictionary records some information about the word and the pointer to the inverted list.
PostingList: the inverted list records the document list of all documents with a word and the position where the word appears in the document, each record is called an inverted sorting item (Posting ). Based on the inverted list, you can know which documents contain a word.
Inverted File: the Inverted list of all words is stored in a File on the disk in sequence. This File is called an Inverted File, inverted Files are physical files that store inverted indexes.
The relationship between these concepts can be clearly seen in Figure 3-2.
Figure 3-2 Basic inverted index concepts
3. Simple inverted index example
Inverted indexes are simple in terms of logical structure and basic ideas. The following is an example to describe how to implement inverted indexes.
Assume that the document set contains five documents, as shown in 3-3. The leftmost column in the figure shows the document numbers for each document. Our task is to create an inverted index for this document set.
Figure 3-3 Document Set
Different languages, such as Chinese and English, do not have a definite separator between words. Therefore, you must use the word segmentation system to automatically split documents into word sequences. In this way, each document is converted into a data stream composed of word sequences. For the convenience of subsequent processing by the system, you need to assign a unique word number to each different word and record which documents contain the word, after such processing, we can obtain the simplest inverted index (see Figure 3-4 ). In Figure 3-4, the "Word ID" Column records the word numbers of each word, the second column is the corresponding word, and the third column is the inverted list of each word. For example, for the word "google", the word number is 1 and the inverted list is {1, 2, 3, 4, 5}. This indicates that each document in the document set contains this word.
Figure 3-4 simple inverted index
The inverted index shown in Figure 3-4 is the simplest because the index system only records the documents that contain a certain word. In fact, the index system can also record more information. Figure 3-5 is a relatively complex inverted index. Compared with the basic index system of Figure 3-4, the inverted table corresponding to a word not only records the document number, it also records the word frequency information (TF), that is, the number of times a word appears in a document. The reason why this information is recorded is that when the word frequency information is sorted by search results, calculating the similarity between queries and documents is a very important factor. Therefore, records them in an inverted table to facilitate scoring during subsequent sorting. In the example in Figure 3-5, the number of the word "founder" is 7, and the content of the inverted list is ), 3 indicates that the document numbered 3 contains the word, and 1 indicates the word frequency. That is, the word appears only once in document numbered 3, the inverted list corresponding to other words represents the same meaning.
Figure 3-5 inverted index with Word Frequency Information
The practical inverted index can also record more information. As shown in Figure 3-6, the index system not only records document numbers and word frequency information, but also records two types of information, that is, the "Document Frequency Information" corresponding to each word (corresponding to the third column in Figure 3-6) and the position where the word appears in a document in an inverted table.
Figure 3-6 inverted indexes with Word Frequency, document frequency, and location information
"Document Frequency Information" indicates how many documents in the document set contain a word. The reason for recording this information is the same as the word frequency information, this information is a very important factor in the sorting Calculation of search results. The location information of words in a document is not recorded by the index system. It can be included or excluded from the actual index system, this information is not necessary for the search system. location information can be used only when "phrase query" is supported.
Taking the word "Las" as an example, the word number is 8 and the document frequency is 2, which indicates that two documents in the entire document set contain the word and the corresponding inverted list is {(3; 1; <4>), (5; 1; <4>)}, meaning that the word appears in document 3 and document 5. The word frequency is 1, the word "Las" appears in four locations in both documents, that is, the fourth word in the document is "Las ".
As shown in figure 3-6, the inverted index is already a very complete index system. The index structure of the actual search system is basically the same. The difference is nothing more than the specific data structure used to implement the above logical structure.
With this index system, the search engine can easily respond to users' queries. For example, if a user inputs the query word "Facebook", the search system searches for inverted indexes and can read documents containing this word, these documents are the search results provided to users. By using Word Frequency information and document frequency information, these candidate search results can be sorted to calculate the similarity between documents and queries, output by similarity score from high to low. This is part of the internal process of the search system. The fifth chapter of the specific implementation scheme will be described in detail.
4. Word Dictionary
Word Dictionary is a very important part of inverted indexes. It is used to maintain information about all words that have occurred in a document set, it is used to record the location information of the inverted list corresponding to a word in the inverted list file. When searching is supported, You can query words in the Word Dictionary Based on the user's query words to obtain the inverted list, which serves as the basis for subsequent sorting.
A large collection of documents may contain hundreds of thousands or even millions of different words. The ability to quickly locate a word directly affects the search response speed, therefore, an efficient data structure is required to construct and search word dictionaries. Common data structures include hash and linked list structures and tree dictionary structures.
4.1 hash and linked list
Figure 1-7 shows the dictionary structure. This dictionary structure consists of two parts:
The main part is a hash table. Each hash table item stores a pointer pointing to a conflicting linked list. words with the same hash value form a linked list structure. A conflicting linked list is generated because two different words obtain the same hash value. If so, it is called a conflict in the hash method, words with the same hash value can be stored in the linked list for subsequent search.
Figure 1-7 structure of the dictionary with hash and linked list
The dictionary structure is also constructed when an index is created. For example, when parsing a new document, for a word T in the document, the hash function is used to obtain its hash value, then, read the saved pointer Based on the hash table item corresponding to the hash value, and find the corresponding conflicting linked list. If the word already exists in the conflicting linked list, it indicates that the word already exists in the previously resolved document. If the word is not found in the conflicting linked list, it indicates that the word was first encountered and is added to the conflicting linked list. In this way, when all the documents in the document set are parsed, the corresponding dictionary structure is established.
When responding to a user's query request, the process is similar to creating a dictionary. The difference is that even if no word appears in the dictionary, it is not added to the dictionary. Take 1-7 as an example. Assume that the query request entered by the user is word 3. Hash the word and locate slot 2 in the hash table. A conflicting linked list can be obtained from its reserved pointer, compare word 3 with the word in the conflicting linked list in sequence, and find word 3 in the conflicting linked list. Then, find the word and read the inverted list corresponding to the word for subsequent work, if the word is not found, the search result is blank if no document in the document set contains words.
4.2 Tree Structure
B-tree (or B + tree) is another efficient search structure. Figure 1-8 is a B-tree structure. Tree B is different from the hash method. dictionary items must be sorted by size (number or character order), while the hash method does not require data to meet this requirement.
The B-tree structure is a hierarchical search structure. The intermediate node is used to indicate the subtree in which dictionary projects of a certain sequence range are stored, which can be used for navigation based on the size of dictionary items, the bottom-layer leaf node stores the word address information. Based on this address, you can extract the word string.
Figure 1-8 B-tree search structure
References:
This is the search engine: detailed explanation of core technologies chapter 3