Summary of chapter 1 of Introduction to Information Retrieval

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

1. Westlaw Query
2. Introduction to Memex

I. Information Retrieval concepts

Information retrieval is used to find the desired information from a large number of unstructured documents;

Of course, information retrieval is far more than this, such as taking out a credit card from the package, checking the card number, and searching for files on the computer;

Unstructured: data does not have a clear Semantic Structure and computers are not easy to process;

Strict unstructured data does not exist. For example, although the text is unstructured, the text also has a fixed format, such as the title;

Semi-structured data: structured and unstructured information;

Category: for a given category, the document is assigned to a specific category. Generally, there are training sets and test sets;

Clustering: automatically aggregates and separates a specified document set. That is, no category is specified in advance;

Grep is a query command in UNIX;

Corpus (corpus) = collection;

Ad-hoc retrieval: the document set is relatively static and the user requirements are constantly changing and one-time. Input requests and relevant documents are returned;

Generally, information retrieval systems belong to ad-hoc searches;

Information requirements: original user queries, such as I want a apple and a banana;

Query: Input System statements after preprocessing such as tokenization, such as Want Apple Banana;

For example, the original information requirement is I have a apple and banana; the query is apple and banana;

Evaluation Information Retrieval System Indicators

Accuracy: F11/(F10 + F11 );

Recall rate: F11 (F01 + F11 );

Ii. Boolean search model

A Boolean search model simply regards a document as a set of words, and determines whether a word appears in a specific document, regardless of the number of occurrences;

The most common information retrieval method is to scan the document set;

However, to cope with many problems, such as near-query, We need to pre-construct indexes;

Term-Document matrix: vertical coordinates represent word items, horizontal coordinates represent documents, and Boolean queries are supported;

However, one drawback is that the space consumption is too large because the matrix is sparse;

Iii. inverted index

It is composed of dictionary and posting. Generally, the dictionary is placed in the memory, while the posting is placed in the disk;

You can add additional information in the dictionary, such as the document frequency, that is, the length of posting, to facilitate subsequent operations (such as and );

You can compress the dictionary so that as many dictionaries as possible are stored in the memory;

Posting data structure:

1. Single-chain table: Suitable for frequently updated posting and requires more space;

2. variable-length array: Suitable for posting with infrequent updates. Because of continuous storage, traversal can be accelerated;

Iv. boolean query and Operation

A boolean query similar to a and B can use a merge algorithm with the complexity of O (x + y;

Multiple consecutive Boolean queries such as a and B and C can be merged through query optimization;

Query Optimization Method: merge two posting statements with the minimum length each time;

For example, if the length of a is 10, the length of B is 20, and the length of C is 15, the intermediate result after A and C are merged and then merged with B;

Note: When multiple consecutive Boolean queries are performed, the intermediate results are generally stored in the memory, while the rest are stored in the disk;

This method is not necessarily the best, but in most cases it is the best;

For (A or B) and (C or D) and (E or F), the heuristic query optimization method is as follows:

Because or can be understood as an addition operation, the estimated length of A or B is Len (a + B). Therefore, we can first) and (E or F) extract the two with the shortest length to perform the and operation;

Process A and not B query:

(1) initial method: calculate not B to generate a new posting (O (n), and then merge with a (O (a + Len ));

(2) Efficient Method: similar to the practice of A and B, moving the pointer to judge, only O (A + B) is required ), consider the case where the pointer B is moved to the end and the pointer A is not moved to the end;

If B is a high-frequency word, calculate not B first, because not B can make posting smaller;

If B is a low frequency word, use the (2) method to calculate faster;

For A or B queries, the complexity is O (x + y );

For a or not B query, the complexity is O (n );

Disadvantages of Boolean query:

1. The recall rate for and operations is too low;

2. The accuracy of the or operation is too low;

Supplement 1. Westlaw Query

1. Space indicates "or" instead of "and", & indicates and;

2./s indicates in the sentence,/P indicates a paragraph,/k indicates K words;

3 .! Wildcard query;

2. Introduction to Memex

A paper from bush in 1945, a device that stores personal archives, notes, books, and other materials, allowing users to conveniently access the materials;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Summary of chapter 1 of Introduction to Information Retrieval

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Summary of chapter 1 of Introduction to Information Retrieval

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support