Document directory
- 1. Westlaw Query
- 2. Introduction to Memex
I. Information Retrieval concepts
Information retrieval is used to find the desired information from a large number of unstructured documents;
Of course, information retrieval is far more than this, such as taking out a credit card from the package, checking the card number, and searching for files on the computer;
Unstructured: data does not have a clear Semantic Structure and computers are not easy to process;
Strict unstructured data does not exist. For example, although the text is unstructured, the text also has a fixed format, such as the title;
Semi-structured data: structured and unstructured information;
Category: for a given category, the document is assigned to a specific category. Generally, there are training sets and test sets;
Clustering: automatically aggregates and separates a specified document set. That is, no category is specified in advance;
Grep is a query command in UNIX;
Corpus (corpus) = collection;
Ad-hoc retrieval: the document set is relatively static and the user requirements are constantly changing and one-time. Input requests and relevant documents are returned;
Generally, information retrieval systems belong to ad-hoc searches;
Information requirements: original user queries, such as I want a apple and a banana;
Query: Input System statements after preprocessing such as tokenization, such as Want Apple Banana;
For example, the original information requirement is I have a apple and banana; the query is apple and banana;
Evaluation Information Retrieval System Indicators
Accuracy: F11/(F10 + F11 );
Recall rate: F11 (F01 + F11 );
Ii. Boolean search model
A Boolean search model simply regards a document as a set of words, and determines whether a word appears in a specific document, regardless of the number of occurrences;
The most common information retrieval method is to scan the document set;
However, to cope with many problems, such as near-query, We need to pre-construct indexes;
Term-Document matrix: vertical coordinates represent word items, horizontal coordinates represent documents, and Boolean queries are supported;
However, one drawback is that the space consumption is too large because the matrix is sparse;
Iii. inverted index
It is composed of dictionary and posting. Generally, the dictionary is placed in the memory, while the posting is placed in the disk;
You can add additional information in the dictionary, such as the document frequency, that is, the length of posting, to facilitate subsequent operations (such as and );
You can compress the dictionary so that as many dictionaries as possible are stored in the memory;
Posting data structure:
1. Single-chain table: Suitable for frequently updated posting and requires more space;
2. variable-length array: Suitable for posting with infrequent updates. Because of continuous storage, traversal can be accelerated;
Iv. boolean query and Operation
A boolean query similar to a and B can use a merge algorithm with the complexity of O (x + y;
Multiple consecutive Boolean queries such as a and B and C can be merged through query optimization;
Query Optimization Method: merge two posting statements with the minimum length each time;
For example, if the length of a is 10, the length of B is 20, and the length of C is 15, the intermediate result after A and C are merged and then merged with B;
Note: When multiple consecutive Boolean queries are performed, the intermediate results are generally stored in the memory, while the rest are stored in the disk;
This method is not necessarily the best, but in most cases it is the best;
For (A or B) and (C or D) and (E or F), the heuristic query optimization method is as follows:
Because or can be understood as an addition operation, the estimated length of A or B is Len (a + B). Therefore, we can first) and (E or F) extract the two with the shortest length to perform the and operation;
Process A and not B query:
(1) initial method: calculate not B to generate a new posting (O (n), and then merge with a (O (a + Len ));
(2) Efficient Method: similar to the practice of A and B, moving the pointer to judge, only O (A + B) is required ), consider the case where the pointer B is moved to the end and the pointer A is not moved to the end;
If B is a high-frequency word, calculate not B first, because not B can make posting smaller;
If B is a low frequency word, use the (2) method to calculate faster;
For A or B queries, the complexity is O (x + y );
For a or not B query, the complexity is O (n );
Disadvantages of Boolean query:
1. The recall rate for and operations is too low;
2. The accuracy of the or operation is too low;
Supplement 1. Westlaw Query
1. Space indicates "or" instead of "and", & indicates and;
2./s indicates in the sentence,/P indicates a paragraph,/k indicates K words;
3 .! Wildcard query;
2. Introduction to Memex
A paper from bush in 1945, a device that stores personal archives, notes, books, and other materials, allowing users to conveniently access the materials;