Summary of the seventh chapter of the introduction to information retrieval

Last Update:2018-07-23 Source: Internet

Author: User

Tags relative sort square root idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

first, the characteristics of the scoring sort

In fact, for the scoring sort, we only need to determine the relative order of the document, so we can simplify the scoring algorithm, only need to keep the relative order of the same;

Ii. quick Sorting and scoring methods

Our previous scoring methods all need to calculate the query and the cosine similarity of each document, and then need to take out the highest scoring of the top K document, the complexity is very high; in fact, if an algorithm can approximate the first K-document but the complexity is much less (do not need to calculate the score of all documents), We usually use the latter algorithm;

General method: pre-locating a subset of documents (much less than the initial document set), including most of the candidate documents, and in a to calculate the highest scores of the top K documents; The following methods are calculated based on this rule; 1. Index Removal Technology

(1) Only consider the posting of the term's IDF exceeding the threshold; Because the term of the low IDF is usually the stop words,posting is very long, so not calculating these will make the complexity greatly reduced, so do not have to consider;

There will be more than the threshold of the doc is not more than K, you need to use a hierarchical index to solve;

Hierarchical index: The inverted record table is layered, such as the TF more than 20 in the first layer, TF more than 10 in the second layer, when the need to find the first K documents, only need to first find, if not enough k, then to the second level to find;

Therefore, the hierarchical index is to solve the possible return of less than K-document method;

(2) Only the documents containing multiple query terms are considered;

2. Victory Chart Method

Victory table (Champion list): For the term T, pre-posting the highest TF value of the R document, this sequence is called the Victory table;

Given a query q, we only need to ask for the set of the Victory table for each term in Q, which is the general method of the document subset A, and calculates the cosine similarity in A;

3. Static score sorting method static quality score

Each document has a static score of G (d) unrelated to the query, and the posting in the inverted index are sorted in descending order of G (d);

And the final score is score (Q,D) =g (d) +v (q) v (d);

In the 21st chapter, PageRank is a static quality score, which is a score based on web link analysis;

4. Hierarchical Search Sorting

For term T, maintain two tables: the High-end table (the highest TF-valued M-document) and the low-end table (the rest of the documents) are sorted by G (d);

Take out the highest scoring K-document method: First calculate the high-end table score, if already in the high-end table has been able to remove the highest scores of K-score documents, the end; otherwise, the rest in the low-end table;

5.cluster Pruning

Leader: Found in n documents (square root N) document as leader;

Follower: Each leader has (square root n) a follower, which indicates closer to the leader distance;

Query method: Given the query Q, first compute the cosine similarity with each leader, find the nearest leader, the document subset a leader+leader corresponding follower;

Iii. Other factors of consideration

1. Query term Proximity

We want the query word to be close in the document so that the document and query are more relevant;

Minimum window size: The quality of mercy is not stained, if the query is: stained quality; the minimum window size is 6 (quality of mercy are not strained);

Soft-Hop: The document does not have to contain all the query terms, only the majority of the query term can be included;

Therefore, it is possible to add proximity to the weights;

Iv. composition of search engines

Indexer is used to generate various kinds of indexes, such as parameterized index, domain index, k-gram index, hierarchical index;

The vector space model and the Boolean retrieval model are different, the Boolean model only considers whether the term item exists in the document, but does not consider the occurrence of several times, and no weight;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Summary of the seventh chapter of the introduction to information retrieval

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Summary of the seventh chapter of the introduction to information retrieval

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support