Summary of chapter 1 of Introduction to Information Retrieval

Source: Internet
Author: User
Document directory
  • 1. Westlaw Query
  • 2. Introduction to Memex
I. Information Retrieval concepts

Information retrieval is used to find the desired information from a large number of unstructured documents;

Of course, information retrieval is far more than this, such as taking out a credit card from the package, checking the card number, and searching for files on the computer;

Unstructured: data does not have a clear Semantic Structure and computers are not easy to process;

Strict unstructured data does not exist. For example, although the text is unstructured, the text also has a fixed format, such as the title;

Semi-structured data: structured and unstructured information;

 

Category: for a given category, the document is assigned to a specific category. Generally, there are training sets and test sets;

Clustering: automatically aggregates and separates a specified document set. That is, no category is specified in advance;

Grep is a query command in UNIX;

 

Corpus (corpus) = collection;

 

Ad-hoc retrieval: the document set is relatively static and the user requirements are constantly changing and one-time. Input requests and relevant documents are returned;

Generally, information retrieval systems belong to ad-hoc searches;

 

Information requirements: original user queries, such as I want a apple and a banana;

Query: Input System statements after preprocessing such as tokenization, such as Want Apple Banana;

For example, the original information requirement is I have a apple and banana; the query is apple and banana;

Evaluation Information Retrieval System Indicators

Accuracy: F11/(F10 + F11 );

Recall rate: F11 (F01 + F11 );

Ii. Boolean search model

A Boolean search model simply regards a document as a set of words, and determines whether a word appears in a specific document, regardless of the number of occurrences;

The most common information retrieval method is to scan the document set;

However, to cope with many problems, such as near-query, We need to pre-construct indexes;

Term-Document matrix: vertical coordinates represent word items, horizontal coordinates represent documents, and Boolean queries are supported;

However, one drawback is that the space consumption is too large because the matrix is sparse;

Iii. inverted index

It is composed of dictionary and posting. Generally, the dictionary is placed in the memory, while the posting is placed in the disk;

You can add additional information in the dictionary, such as the document frequency, that is, the length of posting, to facilitate subsequent operations (such as and );

You can compress the dictionary so that as many dictionaries as possible are stored in the memory;

 

Posting data structure:

1. Single-chain table: Suitable for frequently updated posting and requires more space;

2. variable-length array: Suitable for posting with infrequent updates. Because of continuous storage, traversal can be accelerated;

Iv. boolean query and Operation

A boolean query similar to a and B can use a merge algorithm with the complexity of O (x + y;

Multiple consecutive Boolean queries such as a and B and C can be merged through query optimization;

Query Optimization Method: merge two posting statements with the minimum length each time;

For example, if the length of a is 10, the length of B is 20, and the length of C is 15, the intermediate result after A and C are merged and then merged with B;

Note: When multiple consecutive Boolean queries are performed, the intermediate results are generally stored in the memory, while the rest are stored in the disk;

This method is not necessarily the best, but in most cases it is the best;

 

For (A or B) and (C or D) and (E or F), the heuristic query optimization method is as follows:

Because or can be understood as an addition operation, the estimated length of A or B is Len (a + B). Therefore, we can first) and (E or F) extract the two with the shortest length to perform the and operation;

 

 

 

Process A and not B query:

(1) initial method: calculate not B to generate a new posting (O (n), and then merge with a (O (a + Len ));

(2) Efficient Method: similar to the practice of A and B, moving the pointer to judge, only O (A + B) is required ), consider the case where the pointer B is moved to the end and the pointer A is not moved to the end;

If B is a high-frequency word, calculate not B first, because not B can make posting smaller;

If B is a low frequency word, use the (2) method to calculate faster;

 

For A or B queries, the complexity is O (x + y );

For a or not B query, the complexity is O (n );

 

Disadvantages of Boolean query:

1. The recall rate for and operations is too low;

2. The accuracy of the or operation is too low;

Supplement 1. Westlaw Query

1. Space indicates "or" instead of "and", & indicates and;

2./s indicates in the sentence,/P indicates a paragraph,/k indicates K words;

3 .! Wildcard query;

2. Introduction to Memex

A paper from bush in 1945, a device that stores personal archives, notes, books, and other materials, allowing users to conveniently access the materials;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.