Boolean retrieval and its query optimization

Source: Internet
Author: User

For the retrieval of Boolean queries, Boolean queries refer to queries that use the And,or or not operator to concatenate terms.

A simple example: Which script of Shakespeare contains Brutus and Caesar but does not contain Calpurnia. The Boolean expression is: Brutus and Caesar and Notcalpurnia. The stupidest way to do this is to scan all the scripts from start to finish, judging each script to see if it contains Brutus and Caesar, without Calpurnia. The disadvantage of this approach is that it is very slow (especially for large document sets), that processing not Calpurnia is not easy (once contained to stop judging), and it is not easy to support other operations (such as Find the word Romans nearcountrymen), Sorting of retrieved results is not supported (that is, only good results are returned).

A non-linear scanning method is to index the document beforehand, assuming that we record in advance whether it contains a word in the thesaurus for each document (this is the script), and the result is a Boolean-valued term-Document association matrix, as shown below:


In response to the query Brutus and Caesar and Notcalpurnia, we remove the corresponding row vectors for the Brutus, Caesar, and Calpumia respectively, and reverse the vector of the calpumia, and then perform the bitwise-based and operation to get:

110100 AND110111 and 101111 = 100100

The 1th and 4th elements in the result vector are 1, which indicates that the script corresponding to the query is Antonyand Cleopatra and Hamlet. The evaluation criteria of the retrieval effect are generally correct rate and recall rate. The correct rate (Precision) indicates that the correct proportions are returned in the resulting document, such as returning 80 documents, of which 20 are relevant, the correct rate is 1/4, and the recall rate (Recall) represents the proportion returned in all related documents, such as returning 80 documents, of which 20 are related, But the total should be related to the document is 100, recall rate of 1/5. The correct rate and recall rate reflect the two aspects of the retrieval effect, which is indispensable. Return all, the correct rate is low, the recall rate is 100%, return only a very reliable result, the correct rate is 100%, the recall rate is low. The F-metric is therefore introduced as follows:


Back to the example, we obviously can no longer use the original way to create and store a term-document matrix, assuming that n = 1 million documents (1M), each with 1000 words (1K), each word has an average of 6 bytes (including spaces and punctuation), then all documents will occupy about 6GB space, Assuming that the size of the glossary (that is, the number of words) is 500,000, or 500K, the term item-document matrix =500k x 1m=500g. We can do a rough calculation of the above example, because the average length of each document is 1000 words, so 1 million documents in the Word item-document matrix up to 1 billion (1 000x1 000 000) 1, that is, in the term-document matrix of at least 99.8% (11 billion/ 500 billion) the element is 0.          Obviously, it is better to only record the representation of 1 in the original matrix than the word item-document matrix. The above idea leads to the inverted index. For each term T, record a list of all documents that contain T, each of which is represented by a unique docid, usually a positive integer, such as three-in-one ... a variable-length table is usually used to store the DocId list. The inverted index is shown below:


Note that the dictionary portion is often placed in memory, and each inverted record table pointed to by the pointer is often stored on disk. The dictionaries are sorted alphabetically, and the inverted record table is sorted by the document ID number. The process of constructing the inverted index is shown below:


The index build process is as follows: Document generation terms--word item sort--merge, as shown in the following three images:




Consider the following query: Brutus and Calpurnia, steps as follows:

(1) Locating Brutus in the dictionary;

(2) Return to its inverted record table;

(3) Locating Calpurnia in the dictionary;

(4) Return to its inverted record table and (5) the intersection of two inverted record tables, as shown in the following figure:


The pseudo-code of the above merging algorithm is described as follows:


The query optimization problem is considered below, and query optimization refers to the process of organizing queries to minimize processing effort. One of the main factors to consider when optimizing Boolean queries is the order of access to the inverted record table. A heuristic idea is that, according to the document frequency of the word item (that is, the length of the inverted record table) from small to large processing, if we first merge two shortest inverted record table, then the size of all intermediate results will not exceed the shortest inverted record table, so the work required for processing is likely to be minimal. The following figure calculates the and operation of two short inverted record tables.


For any Boolean query, we must calculate and temporarily save the result of the intermediate expression. However, in many cases, whether due to the nature of the query language itself or simply because it is the most common type of query submitted by the user, the query is often composed of pure "and" operations. In this case, instead of merging the inverted record table as a function of two inputs plus a different output, merging each returned inverted record table with the intermediate results in the current memory is more efficient and the initial intermediate result can be adjusted to the inverted record table for the word item with the minimum document frequency. The merge algorithm is asymmetric: the intermediate result table in memory that needs to be merged with the inverted record table is often read from disk. In addition, the length of the intermediate result table is as long as the inverted record table, and in many cases it may be one or more orders of magnitude shorter. In situations where the length of the inverted record table varies widely, some strategies can be used to speed up the merge process. For intermediate result tables, the merge algorithm can make destructive modifications to the failed elements in place or only add tags. Alternatively, you can combine each element in the intermediate result table by a binary lookup in a long inverted record table. Another possibility is to store the long inverted record table in a hash way, so that each element of the intermediate result table can be found by constant time rather than by linear or logarithmic time.



Reference:

Introduction to Information retrieval--christopherd. Manning waiting for CAS-modern information retrieval ppt--Wang Bin

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.