first, the characteristics of the scoring sort
In fact, for the scoring sort, we only need to determine the relative order of the document, so we can simplify the scoring algorithm, only need to keep the relative order of the same;
Ii. quick Sorting and scoring methods
Our previous scoring methods all need to calculate the query and the cosine similarity of each document, and then need to take out the highest scoring of the top K document, the complexity is very high; in fact, if an algorithm can approximate the first K-document but the complexity is much less (do not need to calculate the score of all documents), We usually use the latter algorithm;
General method: pre-locating a subset of documents (much less than the initial document set), including most of the candidate documents, and in a to calculate the highest scores of the top K documents; The following methods are calculated based on this rule; 1. Index Removal Technology
(1) Only consider the posting of the term's IDF exceeding the threshold; Because the term of the low IDF is usually the stop words,posting is very long, so not calculating these will make the complexity greatly reduced, so do not have to consider;
There will be more than the threshold of the doc is not more than K, you need to use a hierarchical index to solve;
Hierarchical index: The inverted record table is layered, such as the TF more than 20 in the first layer, TF more than 10 in the second layer, when the need to find the first K documents, only need to first find, if not enough k, then to the second level to find;
Therefore, the hierarchical index is to solve the possible return of less than K-document method;
(2) Only the documents containing multiple query terms are considered;
2. Victory Chart Method
Victory table (Champion list): For the term T, pre-posting the highest TF value of the R document, this sequence is called the Victory table;
Given a query q, we only need to ask for the set of the Victory table for each term in Q, which is the general method of the document subset A, and calculates the cosine similarity in A;
3. Static score sorting method static quality score
Each document has a static score of G (d) unrelated to the query, and the posting in the inverted index are sorted in descending order of G (d);
And the final score is score (Q,D) =g (d) +v (q) v (d);
In the 21st chapter, PageRank is a static quality score, which is a score based on web link analysis;
4. Hierarchical Search Sorting
For term T, maintain two tables: the High-end table (the highest TF-valued M-document) and the low-end table (the rest of the documents) are sorted by G (d);
Take out the highest scoring K-document method: First calculate the high-end table score, if already in the high-end table has been able to remove the highest scores of K-score documents, the end; otherwise, the rest in the low-end table;
5.cluster Pruning
Leader: Found in n documents (square root N) document as leader;
Follower: Each leader has (square root n) a follower, which indicates closer to the leader distance;
Query method: Given the query Q, first compute the cosine similarity with each leader, find the nearest leader, the document subset a leader+leader corresponding follower;
Iii. Other factors of consideration
1. Query term Proximity
We want the query word to be close in the document so that the document and query are more relevant;
Minimum window size: The quality of mercy is not stained, if the query is: stained quality; the minimum window size is 6 (quality of mercy are not strained);
Soft-Hop: The document does not have to contain all the query terms, only the majority of the query term can be included;
Therefore, it is possible to add proximity to the weights;
Iv. composition of search engines
Indexer is used to generate various kinds of indexes, such as parameterized index, domain index, k-gram index, hierarchical index;
The vector space model and the Boolean retrieval model are different, the Boolean model only considers whether the term item exists in the document, but does not consider the occurrence of several times, and no weight;