Summary of the seventh chapter of the introduction to information retrieval

Source: Internet
Author: User
Tags relative sort square root idf
first, the characteristics of the scoring sort

In fact, for the scoring sort, we only need to determine the relative order of the document, so we can simplify the scoring algorithm, only need to keep the relative order of the same;

Ii. quick Sorting and scoring methods

Our previous scoring methods all need to calculate the query and the cosine similarity of each document, and then need to take out the highest scoring of the top K document, the complexity is very high; in fact, if an algorithm can approximate the first K-document but the complexity is much less (do not need to calculate the score of all documents), We usually use the latter algorithm;

General method: pre-locating a subset of documents (much less than the initial document set), including most of the candidate documents, and in a to calculate the highest scores of the top K documents; The following methods are calculated based on this rule; 1. Index Removal Technology

(1) Only consider the posting of the term's IDF exceeding the threshold; Because the term of the low IDF is usually the stop words,posting is very long, so not calculating these will make the complexity greatly reduced, so do not have to consider;

There will be more than the threshold of the doc is not more than K, you need to use a hierarchical index to solve;

Hierarchical index: The inverted record table is layered, such as the TF more than 20 in the first layer, TF more than 10 in the second layer, when the need to find the first K documents, only need to first find, if not enough k, then to the second level to find;

Therefore, the hierarchical index is to solve the possible return of less than K-document method;

(2) Only the documents containing multiple query terms are considered;

2. Victory Chart Method

Victory table (Champion list): For the term T, pre-posting the highest TF value of the R document, this sequence is called the Victory table;

Given a query q, we only need to ask for the set of the Victory table for each term in Q, which is the general method of the document subset A, and calculates the cosine similarity in A;

3. Static score sorting method static quality score

Each document has a static score of G (d) unrelated to the query, and the posting in the inverted index are sorted in descending order of G (d);

And the final score is score (Q,D) =g (d) +v (q) v (d);

In the 21st chapter, PageRank is a static quality score, which is a score based on web link analysis;

4. Hierarchical Search Sorting

For term T, maintain two tables: the High-end table (the highest TF-valued M-document) and the low-end table (the rest of the documents) are sorted by G (d);

Take out the highest scoring K-document method: First calculate the high-end table score, if already in the high-end table has been able to remove the highest scores of K-score documents, the end; otherwise, the rest in the low-end table;

5.cluster Pruning

Leader: Found in n documents (square root N) document as leader;

Follower: Each leader has (square root n) a follower, which indicates closer to the leader distance;

Query method: Given the query Q, first compute the cosine similarity with each leader, find the nearest leader, the document subset a leader+leader corresponding follower;

Iii. Other factors of consideration

1. Query term Proximity

We want the query word to be close in the document so that the document and query are more relevant;

Minimum window size: The quality of mercy is not stained, if the query is: stained quality; the minimum window size is 6 (quality of mercy are not strained);

Soft-Hop: The document does not have to contain all the query terms, only the majority of the query term can be included;

Therefore, it is possible to add proximity to the weights;

Iv. composition of search engines

Indexer is used to generate various kinds of indexes, such as parameterized index, domain index, k-gram index, hierarchical index;

The vector space model and the Boolean retrieval model are different, the Boolean model only considers whether the term item exists in the document, but does not consider the occurrence of several times, and no weight;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.