first, ranked retrieval
In the previous we discussed the Boolean query, so the results are either matched, or mismatched, when the result of a lot of matches, we will find that we need to sort the document;
second, parameterized index and domain index
The document has metadata in addition to the text, such as the creation time, the title of the document, etc., so we can also restrict him, such as restricting the query document results must be published in 2010;
Parameterized index (parametric index): There is a certain limit to the value of a field, such as the limit of the range of values, the date is parameterized index;
Domain index: There is no limit to the value of a field, such as the title, can be arbitrary text;
third, domain weighted scoring
A document has a domain f1,f2,f3, and each domain has a different weight;
Each domain has a weight value w1,w2,w3; The weights are determined by machine learning (training set).
S (q,d) =w1*s (Q,F1) +w2*s (Q,F2) +w3*s (Q,F3);
So S (q,d) just calculates a domain score for a document and a query;
Iv. Term Frequency
For a given query, we calculate each item's score separately for each word iteration, and then add the query to the document by adding the item's score in the document.
The weight of a word item in a document is the number of times the word item appears in the document;
Bag of Word Model: the order of the word items in the document is ignored and only the number of occurrences is concerned;
Compared with the Boolean retrieval model, there is a great improvement.
TF: The number of occurrences of a word item in a document;
TF indicates that the weight of a word item depends only on the number of occurrences, which can be problematic, such as the number of occurrences in the document, the TF is very high, but the word is not important at all;
For convenience, we generally have the TF value in the inverted index of the posting, and the IDF value exists in the dictionary;
If you rely only on TF as the weight value, the importance of each word in the document is assumed to be the same;
So we need to introduce other weighting representations;
v. Inverse Document Frequency
DF: The number of documents in which a word item appears;
IDF:DF, the lower the IDF value, when a word entry appears in more documents;
CF: a word item appears in a document set several times;
, n indicates the number of documents;
Therefore, we need to combine TF-IDF for weight calculation;
The score of each document is determined by the sum of the scores of each word in the query for this document;
Document Scoring method:
For known query,collections, you need to look for documents related to query in the documentation set;
We divide the query into a set of words, and then for each word, calculate the TF-IDF value in the document set, and finally add up the value of all the TF-IDF to the score of a document;
six, vector space model VSM
Main idea: The document is regarded as the sequence of terms, that is, the vector;
Constituent term-the document matrix, where each value is the weight of a particular word item corresponding to a particular document, indicating the importance of the term item in the document;
European normalization: given (N1,N2,N3,N4); for normalization of this vector, (n1/n,n2/n,n3/n,n4/n); where n is Euclidean length; 1. Two document similarity calculation
0. Construct each document as a vector;
1. Two documents are normalized separately;
2. Vector dot product;
Users can find a webpage similar to a webpage by this method; 2. Query Vector
The query is a vector, and the document set consists of a document vector;
For query vectors, each value represents the importance (weight) of the term item in the query, which can be weighted by TF or DF or TF-IDF;
Like what:
The weight of the query is TF-IDF and normalized;
The weight of the document vector is the TF value and normalized;
Finally, the product of two vectors is the score of (Q,D);
In the above image, the query vectors are obtained by calculating the weights (0,1.3,2.0,3.0);
Note: For example, the query vector is auto car, but does not necessarily make the last query weight (1,0,1,0), because if the TF is not used, only DF, then the weight of the query vector is (2.3,1.3,2.0,3.0);